In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table align="left">

  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-managed-notebook?download_url=https://raw.githubusercontent.com/hyunuk/vertex-ai-samples/experiment/notebooks/official/workbench/spark/spark_sample_notebook.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

This notebook demonstrates how you ingest, analyze, and write back data to BigQuery using Apache Spark on Dataproc Serverless. Using the GitHub Activity Data, we will analyze repositories in GitHub and find out what kind of programming languages being used in their repositories.

### Dataset

The dataset we are using is the [GitHub Activity Data](https://console.cloud.google.com/marketplace/product/github/github-repos), available in [BigQuery Public Datasets](https://cloud.google.com/bigquery/public-data). The first 1TB of data queried each month is free.

### Objective

This notebook demonstrates Apache Spark jobs that fetch data from BigQuery, analyze it, and write the results back to BigQuery. Through this process, we can learn a common use case in data engineering: ingesting data from a database, performing transformations during preprocessing, and writing back to another database. We also learn how to submit Apache Spark jobs in the Dataproc Serverless environment on Google Cloud Platform. 

In this project, these questions below will be answered.

- Which language is the most frequently used among the monoglot repos?
- What is the average size of each language among the monoglot repos?
- Given a language, which other languages are most frequently found in polyglot repos with it?

Note: repositories that encompass polyglot programming are referred to as polyglot repos and those which only contain one programming language are referred to as monoglot repos.


The steps performed include the following:

- Setting up the serverless environment.
- Configuring spark-bigquery-connector.
- Ingesting data from BigQuery to Spark DataFrame.
- Preprocessing ingested data.
- Analyze that the most frequently used programming language among the monoglot repos.
- Analyze that the average size of each language among the monoglot repos.
- Analyze that the most frequently used languages with a given language in polyglot repos.
- Write the result back to BigQuery
- Delete Dataproc Serverless Session
- Disable APIs being used in the project.

### Costs 

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage
* Dataproc Serverless

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), [Dataproc Serverless pricing](https://cloud.google.com/dataproc-serverless/pricing)
and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. [Enable Notebooks API, Vertex AI API, and Cloud Dataproc API](https://console.cloud.google.com/flows/enableapi?apiid=notebooks.googleapis.com,aiplatform.googleapis.com,dataproc&_ga=2.209429842.1903825585.1657549521-326108178.1655322249)

1. Enter your project ID in the cell below. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [29]:
import os

PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Project ID:  dasi22


Otherwise, set your project ID here.

In [22]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type: "string"}

In [32]:
! gcloud config set project $PROJECT_ID -q

Updated property [core/project].


#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  We recommend that you choose the region closest to you.

In [33]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append it onto the name of resources you create in this tutorial.

In [34]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

#### Create a Cloud Storage bucket

The Spark DataFrame created during this project will be stored in BigQuery. The data will be written first to the bucket in Google Cloud Storage(GCS) and then it is loaded it to BigQuery.
A GCS bucket must be configured to indicate the temporary data location.

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = f"{PROJECT_ID}{TIMESTAMP}"
    BUCKET_URI = f"gs://{BUCKET_NAME}"

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_URI

#### Create a BigQuery resource

Using BUCKET_NAME created above, create a BigQuery resource.

In [None]:
! bq mk $BUCKET_NAME

### Create a Dataproc cluster

Note: If you already have a cluster on Dataproc, you can skip this part and go to "Change your kernel"

To run your Spark jobs on a Dataproc cluster, you need to create a cluster.

In [36]:
CLUSTER_NAME = "[your-cluster]" # @param {type: "string"}

if CLUSTER_NAME == "[your-cluster]":
    CLUSTER_NAME = f"{PROJECT_ID}-{TIMESTAMP}"

In [37]:
!gcloud dataproc clusters create $CLUSTER_NAME \
    --region=$REGION \
    --enable-component-gateway \
    --optional-components=JUPYTER

[1;31mERROR:[0m (gcloud.dataproc.clusters.create) PERMISSION_DENIED: Request had insufficient authentication scopes.
- '@type': type.googleapis.com/google.rpc.ErrorInfo
  domain: googleapis.com
  metadata:
    method: google.cloud.dataproc.v1.ClusterController.CreateCluster
    service: dataproc.googleapis.com
  reason: ACCESS_TOKEN_SCOPE_INSUFFICIENT

If you are in a compute engine VM, it is likely that the specified scopes during VM creation are not enough to run this command.
See https://cloud.google.com/compute/docs/access/service-accounts#accesscopesiam for more information about access scopes.
See https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-instances#changeserviceaccountandscopes for how to update access scopes of the VM.


#### Change your kernel

We will execute Apache Spark jobs on Dataproc Clusters, so you need to change your kernel to Python 3 on your cluster name.

Click the button on top-right corner and select "Python 3 on CLUSTER_NAME: Dataproc cluster in REGION (Remote)"

## Tutorial

### Import required libraries

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import size, col, UserDefinedFunction, explode
from pyspark.sql.types import ArrayType, FloatType, StringType
from itertools import combinations

### Initialize the SparkSession and fetch data from BigQuery

In [None]:
#!/usr/bin/python
"""BigQuery I/O PySpark example."""
spark = SparkSession.builder \
    .appName('spark-bigquery-polyglot-language-demo') \
    .config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.2.jar') \
    .getOrCreate()

# Load data from BigQuery.
df = spark.read.format('bigquery') \
  .option('table', 'bigquery-public-data.github_repos.languages') \
  .load()

### Preprocessing

In [None]:
# Define constants for further use.
LIMIT = 10
EXPLODE_PIE_CHART = tuple([.05] * LIMIT)

def array_to_mono_language(arr) -> str:
    if len(arr) != 1:
        return None
    return arr[0].name

def array_to_mono_size(arr) -> int:
    if len(arr) != 1:
        return 0
    return arr[0].bytes

def array_to_poly_language(arr) -> str:
    if len(arr) < 2:
        return None
    arr.sort(key=lambda x : -x.bytes)
    sub = arr[:3]
    sub.sort(key=lambda x : x.name)
    ret = []
    for elem in sub:
        ret.append(elem.name)
    return ', '.join(ret)

udf_map = {
    "mono_language": array_to_mono_language,
    "mono_size": array_to_mono_size,
    "poly_language": array_to_poly_language,
}

preprocessed_df = df.alias("preprocessed_df")
for name, udf in udf_map.items():
    preprocessed_df = preprocessed_df.withColumn(name, UserDefinedFunction(udf)(col("language")))
preprocessed_df = preprocessed_df.drop("language")

In [None]:
preprocessed_df.createOrReplaceTempView('df_view')
preprocessed_df.printSchema()
preprocessed_df.show()
# See the number of repositories of monoglot(single language used) and polyglot(multiple languages used).
mono = preprocessed_df.where(size(col("language")) == 1).count()
poly = preprocessed_df.where(size(col("language")) > 1).count()
print(f"The number of repositories that use only one language is {mono}")
print(f"The number of repositories that use multiple language is {poly}")

In [None]:
# Sort the monoglot repositories by the popularity of languages and see the top 10 languages.
mono_ranking = spark.sql("""
    SELECT
        df_view.mono_language,
        count(df_view.mono_language) AS `cnt`
    FROM
        df_view
    GROUP BY
        df_view.mono_language
    ORDER BY
        `cnt` DESC
""")

mono_ranking.show()
mono_panda = mono_ranking.toPandas()[:LIMIT].copy()
mono_panda.groupby(['mono_language']) \
        .sum() \
        .plot( \
            kind='pie', \
            y='cnt', \
            autopct='%1.0f%%', \
            figsize=(8, 8), \
            explode=EXPLODE_PIE_CHART \
        )

In [None]:
# Get the average size in MegaBytes of each monoglot repositories and sort by it's size
mono_ranking_avg_bytes = spark.sql("""
    SELECT
        df_view.mono_language, 
        round(avg(df_view.mono_size/1000)) AS `average(MB)`,
        count(df_view.mono_language) AS `cnt`
    FROM 
        df_view
    GROUP BY df_view.mono_language
    ORDER BY
        `average(MB)` DESC
""")
mono_ranking_avg_bytes = mono_ranking_avg_bytes.filter(col('cnt') > 100)
mono_ranking_avg_bytes.show()

In [None]:
poly_ranking = spark.sql("""
    SELECT
        df_view.poly_language,
        count(df_view.poly_language) AS cnt
    FROM
        df_view
    GROUP BY df_view.poly_language
    ORDER BY
        cnt DESC
""")
poly_ranking.show()

poly_panda = poly_ranking.toPandas()[:LIMIT].copy()
poly_panda.groupby(['poly_language']) \
        .sum() \
        .plot( \
            kind='pie', \
            y='cnt', \
            autopct='%1.0f%%', \
            figsize=(8, 8), \
            explode=EXPLODE_PIE_CHART \
        )

In [None]:
def preprocess_reduce_language(arr):
    if len(arr) < 2:
        return None
    languages = []
    for language in arr:
        languages.append(language.name)
    return languages

def preprocess_combination(arr):
    if not arr:
        return None
    arr_combinations = []
    for combination in combinations(arr, 2):
        arr_combinations.append(combination)
        arr_combinations.append(combination[::-1])
    return arr_combinations

df = df.withColumn(
        "reduced_languages",
        UserDefinedFunction(
            preprocess_reduce_language,
            ArrayType(StringType())
        )(col("language"))
    )
df = df.withColumn(
        "combinations",
        UserDefinedFunction(
            preprocess_combination, 
            ArrayType(ArrayType(StringType()))
        )(col("reduced_languages"))
    )

frequency_df = df.select(col("repo_name"), col("combinations")).where(size(col("language")) > 1)
frequency_df = frequency_df.withColumn("languages", explode(col("combinations")))
frequency_df = frequency_df.withColumn("lang0", col('languages')[0])
frequency_df = frequency_df.withColumn("lang1", col('languages')[1])
frequency_df = frequency_df.crosstab("lang0", "lang1")
frequency_df = frequency_df.withColumn("languages", col("lang0_lang1"))
frequency_df = frequency_df.drop("lang0_lang1")
frequency_df.createOrReplaceTempView('frequency_df_view')

In [None]:
MAJOR_LANGUAGES = {"C", "C++", "Java", "Python", "JavaScript", "Go"}
df_arr = []

# py_df = spark.sql('''
#     SELECT languages, `Python` from frequency_df_view ORDER BY `python` DESC LIMIT 20
# ''')
# py_df.show()

# test = frequency_df.select(col("languages"), col("Python")).sort(-col("Python")).limit(20)
# test.show()
for language in MAJOR_LANGUAGES:
    df_arr.append(frequency_df.select(col("languages"), language).sort(-col(language)).limit(20))
for i in df_arr:
    i.show()

In [None]:
bucket = "[your-bucket-name]"
spark.conf.set('temporaryGcsBucket', bucket)
# Saving the data to BigQuery
word_count.write.format('bigquery') \
  .option('table', 'wordcount_dataset.wordcount_output') \
  .save()


## General style examples

### Notebook heading

- Include the collapsed license at the top (this uses Colab's "Form" mode to hide the cells).
- Only include a single H1 title.
- Include the button-bar immediately under the H1.
- Check that the Colab and GitHub links at the top are correct.

### Notebook sections

- Use H2 (##) and H3 (###) titles for notebook section headings.
- Use [sentence case to capitalize titles and headings](https://developers.google.com/style/capitalization#capitalization-in-titles-and-headings). ("Train the model" instead of "Train the Model")
- Include a brief text explanation before any code cells.
- Use short titles/headings: "Download the data", "Build the model", "Train the model".

### Writing style

- Use [present tense](https://developers.google.com/style/tense). ("You receive a response" instead of "You will receive a response")
- Use [active voice](https://developers.google.com/style/voice). ("The service processes the request" instead of "The request is processed by the service")
- Use [second person](https://developers.google.com/style/person) and an imperative style. 
    - Correct examples: "Update the field", "You must update the field"
    - Incorrect examples: "Let's update the field", "We'll update the field", "The user should update the field"
- **Googlers**: Please follow our [branding guidelines](http://goto/cloud-branding).

### Code

- Put all your installs and imports in a setup section.
- Save the notebook with the Table of Contents open.
- Write Python 3 compatible code.
- Follow the [Google Python Style guide](https://github.com/google/styleguide/blob/gh-pages/pyguide.md) and write readable code.
- Keep cells small (max ~20 lines).

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

{TODO: Include commands to delete individual resources below}

In [None]:
# Delete endpoint resource
! gcloud ai endpoints delete $ENDPOINT_NAME --quiet --region $REGION_NAME

# Delete model resource
! gcloud ai models delete $MODEL_NAME --quiet

# Delete Cloud Storage objects that were created
! gsutil -m rm -r $JOB_DIR

if os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI