In [1]:
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 1.1. BigQuery Storage & Spark DataFrame - Python

### Python 3 Kernel

Use a Python 3 kernel (not PySpark) to allow you to configure the SparkSession in the notebook and include the [spark-bigquery-connector](https://github.com/GoogleCloudDataproc/spark-bigquery-connector) required to use the [BigQuery Storage API](https://cloud.google.com/bigquery/docs/reference/storage).

### Scala Version

Check what version of Scala you are running so you can include the correct spark-bigquery-connector jar 

In [3]:
!scala -version

Scala code runner version 2.12.10 -- Copyright 2002-2019, LAMP/EPFL and Lightbend, Inc.


### Create Spark Session

Include the correct version of the spark-bigquery-connector jar

Scala version 2.11 - `'gs://spark-lib/bigquery/spark-bigquery-latest.jar'`.

Scala version 2.12 - `'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'`.

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
  .appName('BigQuery Storage & Spark DataFrame - Python')\
  .config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
  .getOrCreate()

### Read BigQuery table into Spark DataFrame

Use `filter()` to query data from a partitioned table.

In [6]:
table = "bigquery-public-data.wikipedia.pageviews_2020"
df = spark.read \
  .format("bigquery") \
  .option("table", table) \
  .option("filter", "datehour >= '2020-03-01' AND datehour < '2020-03-02'") \
  .load()

df.printSchema()

root
 |-- datehour: timestamp (nullable = true)
 |-- wiki: string (nullable = true)
 |-- title: string (nullable = true)
 |-- views: long (nullable = true)



Select required columns and apply a filter using `where()` which is an alias for `filter()` then cache the table

In [10]:
df1 = df \
  .select("title", "wiki", "views") \
  .where("views > 1000 AND wiki in ('en', 'en.m')") \
  .cache()

df1.show(5)

+--------------------+----+-----+
|               title|wiki|views|
+--------------------+----+-----+
|2020_Democratic_P...|  en| 3242|
|Eurovision_Song_C...|  en| 2368|
|         Colin_McRae|  en| 2360|
|        Donald_trump|  en| 2223|
|Comparison_of_onl...|  en| 1398|
+--------------------+----+-----+
only showing top 5 rows



Group by title and find top 20 pages by page views

In [12]:
import pyspark.sql.functions as F

df2 = df1 \
.groupBy("title") \
.agg(F.sum('views').alias('total_views'))

df2.orderBy('total_views', ascending=False).show(20)

+--------------------+-----------+
|               title|total_views|
+--------------------+-----------+
|           Main_Page|   10939337|
|United_States_Senate|    5619797|
|                   -|    3852360|
|      Special:Search|    1538334|
|2019–20_coronavir...|     407042|
|2020_Democratic_P...|     260093|
|         Coronavirus|     254861|
|The_Invisible_Man...|     233718|
|       Super_Tuesday|     201077|
|         Colin_McRae|     200219|
|         David_Byrne|     189989|
|2019–20_coronavir...|     156803|
|        John_Mulaney|     155605|
|2020_South_Caroli...|     152137|
|      AEW_Revolution|     140503|
|       Boris_Johnson|     120957|
|          Tom_Steyer|     120926|
|Dyatlov_Pass_inci...|     117704|
|         Spanish_flu|     108335|
|2020_coronavirus_...|     107653|
+--------------------+-----------+
only showing top 20 rows



### Write Spark Dataframe to BigQuery table

Write the Spark Dataframe to BigQuery table using BigQuery Storage connector. This will also create the table if it does not exist. The GCS bucket and BigQuery dataset must already exist.

In [16]:
bucket_name = 'dataproc-bucket-name'

df2.write \
  .format("bigquery") \
  .option("table","dataset_name.wiki_total_pageviews") \
  .option("temporaryGcsBucket", bucket_name) \
  .mode('overwrite') \
  .save()