In [1]:
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 1.1. BigQuery Storage & Spark DataFrames - Python

### Create Dataproc Cluster with Jupyter

This notebook is designed to be run on Google Cloud Dataproc.

Follow the links below for instructions on how to create a Dataproc Cluster with the Juypter component installed.

* [Tutorial - Install and run a Jupyter notebook on a Dataproc cluster](https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook)
* [Blog post - Apache Spark and Jupyter Notebooks made easy with Dataproc component gateway](https://medium.com/google-cloud/apache-spark-and-jupyter-notebooks-made-easy-with-dataproc-component-gateway-fa91d48d6a5a)

### Python 3 Kernel

Use a Python 3 kernel (not PySpark) to allow you to configure the SparkSession in the notebook and include the [spark-bigquery-connector](https://github.com/GoogleCloudDataproc/spark-bigquery-connector) required to use the [BigQuery Storage API](https://cloud.google.com/bigquery/docs/reference/storage).

### Scala Version

Check what version of Scala you are running so you can include the correct spark-bigquery-connector jar 

In [2]:
!scala -version

cat: /release: No such file or directory
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL


### Create Spark Session

Include the correct version of the spark-bigquery-connector jar

Scala version 2.11 - `'gs://spark-lib/bigquery/spark-bigquery-latest.jar'`.

Scala version 2.12 - `'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'`.

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
  .appName('1.1. BigQuery Storage & Spark DataFrames - Python')\
  .config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar') \
  .getOrCreate()

### Enable repl.eagerEval

This will output the results of DataFrames in each step without the new need to show `df.show()` and also improves the formatting of the output

In [4]:
spark.conf.set("spark.sql.repl.eagerEval.enabled",True)

### Read BigQuery table into Spark DataFrame

Use `filter()` to query data from a partitioned table.

In [5]:
table = "bigquery-public-data.wikipedia.pageviews_2020"
df_wiki_pageviews = spark.read \
  .format("bigquery") \
  .option("table", table) \
  .option("filter", "datehour >= '2020-03-01' AND datehour < '2020-03-02'") \
  .load()

df_wiki_pageviews.printSchema()

root
 |-- datehour: timestamp (nullable = true)
 |-- wiki: string (nullable = true)
 |-- title: string (nullable = true)
 |-- views: long (nullable = true)



Select required columns and apply a filter using `where()` which is an alias for `filter()` then cache the table

In [7]:
df_wiki_en = df_wiki_pageviews \
  .select("title", "wiki", "views") \
  .where("views > 1000 AND wiki in ('en', 'en.m')") \
  .cache()

df_wiki_en

title,wiki,views
2020_Democratic_P...,en,3242
Eurovision_Song_C...,en,2368
Colin_McRae,en,2360
Donald_trump,en,2223
Comparison_of_onl...,en,1398
Coronavirus,en,1872
-,en,136620
Bombshell_(2019_f...,en,1084
Brooklyn,en,1946
2019–20_coronavir...,en,8313


Group by title and order by page views to see the top pages

In [8]:
import pyspark.sql.functions as F

df_wiki_en_totals = df_wiki_en \
.groupBy("title") \
.agg(F.sum('views').alias('total_views'))

df_wiki_en_totals.orderBy('total_views', ascending=False)

title,total_views
Main_Page,10939337
United_States_Senate,5619797
-,3852360
Special:Search,1538334
2019–20_coronavir...,407042
2020_Democratic_P...,260093
Coronavirus,254861
The_Invisible_Man...,233718
Super_Tuesday,201077
Colin_McRae,200219


### Write Spark Dataframe to BigQuery table

Write the Spark Dataframe to BigQuery table using BigQuery Storage connector. This will also create the table if it does not exist. The GCS bucket and BigQuery dataset must already exist.

If the GCS bucket and BigQuery dataset do not exist they will need to be created before running `df.write`

- [Instructions here for creating a GCS bucket](https://cloud.google.com/storage/docs/creating-buckets)
- [Instructions here for creating a BigQuery Dataset](https://cloud.google.com/bigquery/docs/datasets) 

In [9]:
# Update to your GCS bucket
gcs_bucket = 'dataproc-bucket-name'

# Update to your BigQuery dataset name you created
bq_dataset = 'dataset_name'

# Enter BigQuery table name you want to create or overwite. 
# If the table does not exist it will be created when you run the write function
bq_table = 'wiki_total_pageviews'

df_wiki_en_totals.write \
  .format("bigquery") \
  .option("table","{}.{}".format(bq_dataset, bq_table)) \
  .option("temporaryGcsBucket", gcs_bucket) \
  .mode('overwrite') \
  .save()

### Use BigQuery magic to query table

Use the [BigQuery magic](https://googleapis.dev/python/bigquery/latest/magics.html) to check if the data was created successfully in BigQuery. This will run the SQL query in BigQuery and the return the results

In [10]:
%%bigquery
SELECT title, total_views
FROM dataset_name.wiki_total_pageviews
ORDER BY total_views DESC
LIMIT 10

Unnamed: 0,title,total_views
0,Main_Page,10939337
1,United_States_Senate,5619797
2,-,3852360
3,Special:Search,1538334
4,2019–20_coronavirus_outbreak,407042
5,2020_Democratic_Party_presidential_primaries,260093
6,Coronavirus,254861
7,The_Invisible_Man_(2020_film),233718
8,Super_Tuesday,201077
9,Colin_McRae,200219
