In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table align="left">

  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-managed-notebook?download_url=https://raw.githubusercontent.com/hyunuk/vertex-ai-samples/experiment/notebooks/official/workbench/spark/spark_sample_notebook.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

This notebook demonstrates how to ingest, analyze, and write data to BigQuery using Apache Spark with [Dataproc](https://cloud.google.com/dataproc). Using GitHub Activity Data, you will analyze repositories in GitHub and find out what kind of programming languages are used in their repositories.

### Dataset

The dataset you are using is the [GitHub Activity Data](https://console.cloud.google.com/marketplace/product/github/github-repos), available in [BigQuery Public Datasets](https://cloud.google.com/bigquery/public-data). We will refer to repositories that encompass polyglot programming as polyglot repos and those which only contain one programming language are referred to as monoglot repos. The first 1TB of data queried each month is free.

### Objective

This notebook demonstrates Apache Spark jobs that fetch data from BigQuery, analyze it, and write the results back to BigQuery. Through this process, you will learn a common use case in data engineering: ingesting data from a database, performing transformations during preprocessing, and writing back to another database. You also learn how to submit Apache Spark jobs to Dataproc Serverless on Google Cloud.

In this project, you will answer the following questions.

- Which language is the most frequently used among the monoglot repos?
- What is the average size of each language among the monoglot repos?
- Given a language, which other languages are most frequently found in polyglot repos with it?

The steps performed include the following:

- Setting up the Dataproc Serverless environment.
- Configuring the spark-bigquery-connector.
- Ingesting data from BigQuery to a Spark DataFrame.
- Preprocessing ingested data.
- Discover the most frequently used programming language among the monoglot repos.
- Discover the average size of each language among the monoglot repos.
- Discover the most frequently used languages with a given language in polyglot repos.
- Write the result back to BigQuery
- Delete Dataproc Serverless Session
- Disable APIs used in the project.

### Costs 

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage
* Dataproc Serverless

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), [Dataproc Serverless pricing](https://cloud.google.com/dataproc-serverless/pricing)
and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. [Enable Notebooks API, Vertex AI API, and Cloud Dataproc API](https://console.cloud.google.com/flows/enableapi?apiid=notebooks.googleapis.com,aiplatform.googleapis.com,dataproc&_ga=2.209429842.1903825585.1657549521-326108178.1655322249)

1. Enter your project ID in the cell below. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

### Create a Dataproc cluster

**Note**: If you already have a cluster on Dataproc, you can skip this part and go to `Switch your kernel`.

The Spark job that you are going to execute in this project is compute-intensive and could take a lot of time on a standard Notebook environment, so this tutorial uses a Dataproc cluster. To run your Spark jobs on a Dataproc cluster, you need to create a cluster with component gateway enabled and JupyterLab extension.

In [None]:
CLUSTER_NAME = "[your-cluster]"  # @param {type: "string"}
CLUSTER_REGION = "[your-region]"  # @param {type: "string"}

if CLUSTER_REGION == "[your-region]":
    CLUSTER_REGION = "us-central1"

print(f"CLUSTER_NAME: {CLUSTER_NAME}")
print(f"CLUSTER_REGION: {CLUSTER_REGION}")

In [None]:
!gcloud dataproc clusters create $CLUSTER_NAME \
    --region=$CLUSTER_REGION \
    --enable-component-gateway \
    --optional-components=JUPYTER

**Note**: Your `CLUSTER_NAME` must be unique and must start with a lowercase letter followed by up to 51 lowercase letters, numbers, and hyphens, and cannot end with a hyphen.

#### Switch your kernel

**Note**: If you already select your kernel as Python 3 on your Dataproc cluster, you can skip this part.

In order to execute Apache Spark jobs on Dataproc Clusters, you need to change your kernel to Python 3 on your cluster.

Click the button on the top-right corner and select `Python 3 on CLUSTER_NAME: Dataproc cluster in REGION (Remote)`

If you switch your kernel here, you will lose all variables declared earlier.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
import os

PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Otherwise, set your project ID here.

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type: "string"}

In [None]:
! gcloud config set project $PROJECT_ID -q

#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  We recommend that you choose [the region closest to you](https://cloud.google.com/compute/docs/regions-zones).

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

#### Create a Cloud Storage bucket

The Spark DataFrame created during this project will be stored in BigQuery. The data will be written first to the Google Cloud Storage(GCS) bucket and then it is loaded it to BigQuery.
A GCS bucket must be configured to indicate the temporary data location.

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = f"{PROJECT_ID}{TIMESTAMP}"
    BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

Validate access to your Cloud Storage bucket by displaying the bucket's metadata:

In [None]:
! gsutil ls -L -b $BUCKET_URI

#### Create a BigQuery dataset

Set a name for your BigQuery Dataset and create it.

In [None]:
DATASET_NAME = "[your-dataset-name]"  # @param {type:"string"}

if DATASET_NAME == "" or DATASET_NAME is None or DATASET_NAME == "[your-dataset-name]":
    DATASET_NAME = f"{PROJECT_ID}{TIMESTAMP}"

In [None]:
! bq mk $DATASET_NAME

## Tutorial

### Import required libraries

In [None]:
# A Spark Session is how you interact with Spark SQL to create Dataframes
from pyspark.sql import SparkSession
# PySpark functions
from pyspark.sql.functions import avg, col, count, desc, round, size, udf
# These allow us to create a schema for our data
from pyspark.sql.types import ArrayType, IntegerType, StringType

### Initialize the SparkSession

To use Apache Spark with BigQuery, you must include the [spark-bigquery-connector](https://github.com/GoogleCloudDataproc/spark-bigquery-connector) when you initialize the SparkSession.

In [None]:
# Initialize the SparkSession with the following config.
spark = (
    SparkSession.builder.appName("spark-bigquery-polyglot-language-demo")
    .config(
        "spark.jars",
        "gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.2.jar",
    )
    .config("spark.sql.debug.maxToStringFields", "500")
    .getOrCreate()
)

### Fetch data from BigQuery

In [None]:
# Load Github Activity Public Dataset from BigQuery.
df = (
    spark.read.format("bigquery")
    .option("table", "bigquery-public-data.github_repos.languages")
    .load()
)

df.printSchema()

### Preprocessing

Based on the schema printed above, data of the GitHub Activity is not stored in primitive types, but is instead stored in arrays. 

To work more effectively with the data, you need to convert it to primitive types and separate data for monoglot repos and polyglot repos.

You can see three Python functions with the `@udf` annotation with their return type below. The annotation `@udf` is a short form of [User Defined Function](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.udf.html), which is used to extend the functions of the PySpark framework. 

In [None]:
# Set the LIMIT constant as 10 to get top 10 result for further use.
LIMIT = 10

# A constant to explode the pie chart to distinguish each part more visible.
EXPLODE_PIE_CHART = tuple([0.05] * LIMIT)


@udf(returnType=StringType())
def language_to_mono_language(language) -> str:
    """
    Preprocess function takes language and return it's name if language only has 1 elements.
    Args:
        language: list of struct that contains name and bytes.
                  (e.g., language = [[name: "C", bytes: 300]]
    Returns:
        Monorepo's name
    """
    return language[0].name if len(language) == 1 else None


@udf(returnType=IntegerType())
def language_to_mono_size(language) -> int:
    """
    Preprocess function takes language and return it's bytes if language only has 1 elements.
    Args:
        language: list of struct that contains name and bytes.
                  (e.g., language = [[name: "C", bytes: 300]]
    Returns:
        Monorepo's bytes
    """
    return language[0].bytes if len(language) == 1 else 0


@udf(returnType=StringType())
def language_to_poly_language(language) -> str:
    """
    Preprocess function takes language and return the top 3 language's name based on their bytes.
    Args:
        language: list of struct that contains name and bytes.
                  (e.g., language = [[name: "C", bytes: 300],
                                     [name: "Java", bytes: 200]]
    Returns:
        Polyrepo's name in string form separate by commas
    """
    if len(language) < 2:
        return None
    # Sort language by their bytes in a descending order.
    language.sort(key=lambda x: -x.bytes)
    top_3 = language[:3]

    # Sort top_3 language by their name.
    top_3.sort(key=lambda x: x.name)
    ret = []
    for elem in top_3:
        ret.append(elem.name)
    return ", ".join(ret)

In [None]:
# Create a new DataFrame called preprocessed_df, where the array was splitted into three columns using UDF.
preprocessed_df = df.select(
    col("repo_name"),
    language_to_mono_language(col("language")).alias("mono_language"),
    language_to_mono_size(col("language")).alias("mono_size"),
    language_to_poly_language(col("language")).alias("poly_language"),
)
preprocessed_df.printSchema()

After preprocessing, you can see `preprocessed_df`'s schema, and the language column is separated into three columns, `mono_language`, `mono_size`, and `poly_language`.

In [None]:
# See the number of repositories of monoglot(single language used) and polyglot(multiple languages used).
mono = preprocessed_df.where(col("mono_language").isNotNull()).count()
print(f"The number of repositories that use only one language is {mono}")

poly = preprocessed_df.where(col("poly_language").isNotNull()).count()
print(f"The number of repositories that use multiple language is {poly}")

poly_percent = (poly / (mono + poly)) * 100
print(f"Polyglot repositories account for about {poly_percent:.2f}% of the total repo.")

### Analyze

#### Which language is the most frequently used among the monoglot repos?
To answer this question, you can execute a query below with the preprocessed column, `mono_language`.

In [None]:
# Get the monoglot repositories and sort them based on the popularity of languages.
mono_ranking = (
    preprocessed_df.groupBy("mono_language")
    .count()
    .sort(desc("count"))
    .where(col("mono_language").isNotNull())
)
mono_ranking.show()

Using this `mono_ranking`, you can also visualize it using the pie chart.

In [None]:
# Convert Spark DataFrame to Pandas DataFrame to display the pie chart.
mono_panda = mono_ranking.toPandas()[:LIMIT].copy()
mono_panda.groupby(["mono_language"]).sum().plot(
    kind="pie",
    y="count",
    autopct="%1.1f%%",
    label="",
    title="Monoglot repositories",
    legend=False,
    figsize=(7, 7),
    explode=EXPLODE_PIE_CHART,
)

#### What is the average size of each language among the monoglot repos?

You can use preprocessed columns, `mono_size` and `mono_language` to get the average size of each language.

the `mono_size` column's bytes are kilobytes. In the following query, `mono_size` is divided by 1000 to convert it to megabytes.

In [None]:
mono_ranking_avg_bytes = (
    preprocessed_df.groupBy("mono_language")
    .agg(
        count("mono_language").alias("count"),
        round(avg("mono_size") / 1000).alias("average_in_MB"),
    )
    .sort(desc("average_in_MB"))
    .where(col("mono_language").isNotNull() & (col("count") > 500))
)

mono_ranking_avg_bytes.show()

#### Given a language, which other languages are most frequently found in polyglot repos with it?

You already have a preprocessed column, `poly_language`. Using this column, you can implement a query to show the ranking of polyglot repositories with the top 3 languages, based on their size.

In [None]:
# Get the polyglot repositories by the popularity of languages.
poly_ranking = (
    preprocessed_df.groupBy("poly_language")
    .count()
    .sort(desc("count"))
    .where(col("poly_language").isNotNull())
)

poly_ranking.show()

The majority of results were a mixture of HTML or CSS, so the result had many similar conbinations of HTML, CSS, and Javascript.

It may not be as interesting. What about the pie chart?

In [None]:
# Convert Spark DataFrame to Pandas DataFrame to display the pie chart.
poly_panda = poly_ranking.toPandas()[:LIMIT].copy()
poly_panda.groupby(["poly_language"]).sum().plot(
    kind="pie",
    y="count",
    autopct="%1.1f%%",
    label="",
    title="Polyglot repositories",
    legend=False,
    figsize=(7, 7),
    explode=EXPLODE_PIE_CHART,
)

When you visualize to a pie chart with the top 10 result, eight out of ten contain either `HTML` or `CSS`.

Using the original data fetched from BigQuery, `df`, you can create combinations of languages in each repo.

In [None]:
# A Python package to get combinations.
from itertools import combinations
# A Python package to use type hint
from typing import List

# PySpark functions
from pyspark.sql.functions import explode

In [None]:
def normalize_name(name: str) -> str:
    """
    Change the name of language since BigQuery has a set of invalid characters that cannot be used in their field.
    Args:
        name: string
    Returns:
        Normalized name: string
    """
    normalized_arr = []

    # The following set of characters cannot be used in BigQuery's field.
    invalid_chars = {",", ";", "{", "}", "(", ")", "\n", "\t", "="}

    # The name must start with a letter or underscore.
    if name[0].isnumeric():
        normalized_arr.append("_")

    for ch in name:
        # Skip if a character is in the set of invalid characters.
        if ch in invalid_chars:
            continue

        # Convert space or dot to underscore
        elif ch == " " or ch == ".":
            normalized_arr.append("_")

        # Lower the character to merge the same name (e.g., "Java" and "java")
        else:
            normalized_arr.append(ch.lower())

    # Convert the array to string
    return "".join(normalized_arr)


def reduce_language(language) -> List[str]:
    """
    Preprocess function takes language and reduce it to remove "bytes".
    Args:
        language: list of struct that contains name and bytes.
                  (e.g., language = [[name: "C", bytes: 300],
                                     [name: "Java", bytes: 200]]
    Returns:
        list of strings that contains name.
                  (e.g., reduced_languages = ["C", "Java"])
    """
    if len(language) < 2:
        return None
    reduced_languages = []
    for elem in language:
        # To write back to BigQuery, the name must be normalized. See normalize_name() function below.
        normalized_name = normalize_name(elem.name)
        reduced_languages.append(normalized_name)
    return reduced_languages


def preprocess_combination(language) -> List[List[str]]:
    """
    Preprocess function takes language and return every combination of language.
    Args:
        language: list of struct that contains name and bytes.
                  (e.g., language = [[name: "C", bytes: 300],
                                     [name: "Java", bytes: 200]]
    Returns:
        List of every possible combinations.
                  (e.g., arr_combinations = [["C", "Java"], ["Java", "C"]])
    """
    if not language:
        return None
    arr_combinations = []
    for combination in combinations(language, 2):
        arr_combinations.append(combination)
        arr_combinations.append(combination[::-1])
    return arr_combinations


# Preprocess "reduced_languages" column using UDF.
df = df.withColumn(
    "reduced_languages",
    UserDefinedFunction(reduce_language, ArrayType(StringType()))(col("language")),
)

# Preprocess "combinations" column using UDF.
df = df.withColumn(
    "combinations",
    UserDefinedFunction(preprocess_combination, ArrayType(ArrayType(StringType())))(
        col("reduced_languages")
    ),
)

# Create another DataFrame from df that has repo_name and combinations as columns.
frequency_df = df.select(col("repo_name"), col("combinations")).where(
    size(col("language")) > 1
)
frequency_df.printSchema()

Now frequency_df has repo name and combinations of languages.

Next, you need to use explode function in Spark, which is similar to UNNEST function in SQL.

Note that the following table is what your current table looks like.

| repo_name   | combinations      |
| :----------: | :---------------: |
| a           | [['C', 'C++'], ['C++', 'C'], ['C', 'Java'], ['Java', 'C’], ['C++', 'Java'], ['Java', 'C++']]|
| b           | [['C', 'C++'], ['C++', 'C'], ['C', 'Python'], ['Python', 'C'], ['C++', 'Python'], ['Python', 'C++']]|

In [None]:
# Using explode(), the elements in combinations converted to rows.
frequency_df = frequency_df.withColumn("languages", explode(col("combinations")))

# Create columns for combinations of languages.
frequency_df = frequency_df.withColumn("language0", col("languages")[0])
frequency_df = frequency_df.withColumn("language1", col("languages")[1])

After exploded and adding two Columns `langauge0` and `language1`, `frequency_df` looks like the following.

| repo_name   | languages         | language0    | language1 |
| :---------: | :---------------: | :--------:   | :-------: |
| a           | ['C', 'C++']      | 'C'          |'C++'      |
| a           | ['C++', 'C']      | 'C++'        |'C'        |
| a           | ['C', 'Java']     | 'C'          |'Java'     |
| a           | ['Java', 'C’]     | 'Java'       |'C'        |
| a           | ['C++', 'Java']   | 'C++'        |'Java'     |
| a           | ['Java', 'C++']   | 'Java'       |'C++'      |
| b           | ['C', 'C++']      | 'C'          |'C++'      |
| b           | ['C++', 'C']      | 'C++'        |'C'        |
| b           | ['C', 'Python']   | 'C'          |'Python'   |
| b           | ['Python', 'C']   | 'Python'     |'C'        |
| b           | ['C++', 'Python'] | 'C++'        |'Python'   |
| b           | ['Python', 'C++'] | 'Python'     |'C++'      |

Now you are going to calculate a pair-wise frequency table of given `language0` and `language1` columns. The first column of each row will be the distinct values of `language0` and the column names will be the distinct values of `language1`.

In [None]:
# crosstab() reshapes the table into the frequency distribution table by using cross tabulations.
frequency_df = frequency_df.crosstab("language0", "language1").withColumnRenamed(
    "language0_language1", "languages"
)

After applying `crosstab()` to `frequency_df`, the DataFrame looks like the following.

| languages  |  C  | C++ | Java | Python |
| :--------: | :-: | :-: | :-:  |  :-:   |
|     C      |  0  |  2  |  1   |   1    |
|     C++    |  2  |  0  |  1   |   1    |
|    Java    |  1  |  1  |  0   |   0    |
|   Python   |  1  |  1  |  0   |   0    |

Note that this table is for example and not the real data.

See [frequency distribution](https://en.wikipedia.org/wiki/Frequency_(statistics)) and [cross tabulations](https://en.wikipedia.org/wiki/Contingency_table).

Now you have a DataFrame that has frequency of each languages by a given language. Let's visualize it with some famous language.

In [None]:
# Set of famous languages. You can modify this set to show your favorite languages.
MAJOR_LANGUAGES = {"C", "C++", "Java", "Python", "JavaScript", "Go"}

# Declare a dictionary to store key as language name and value as the selected DataFrame
df_dict = dict()

for language in MAJOR_LANGUAGES:
    # Get a top ten languages of each language and store it to the dictionary.
    df_dict[language] = (
        frequency_df.select(col("languages"), language).sort(-col(language)).limit(10)
    )

In [None]:
for language in df_dict:
    # Convert Spark DataFrame to Pandas DataFrame to display the bar chart.
    elem_panda = df_dict[language].toPandas()[:LIMIT].copy()
    elem_panda.set_index("languages", inplace=True)
    elem_panda.sort_values(language, ascending=True, inplace=True)
    elem_panda.plot(
        kind="barh",
        title=language,
        legend=False,
        xlabel="",
    )

### Write back to BigQuery

After analyzing these queries, we have several DataFrames: the ranking of monoglot repositories, the average bytes of monoglot repositories, and the frequency table of each language being used in a repository. 

In this project, these three DataFrames will be stored in BigQuery using the [spark-bigquery-connector](https://github.com/GoogleCloudDataproc/spark-bigquery-connector).

In [None]:
dataframes = {
    "mono_ranking": mono_ranking,
    "mono_ranking_avg_bytes": mono_ranking_avg_bytes,
    "frequency_table": frequency_df,
}

# Iterate through dataframes and save them to the BigQuery.
for df in dataframes:
    dataframes[df].write.format("bigquery").option(
        "temporaryGcsBucket", DATASET_NAME
    ).option("table", f"{DATASET_NAME}.{df}").save()

If there is no error above, congratulations! Your DataFrame is successfully stored in your BigQuery.

You can find the data via [this link](https://pantheon.corp.google.com/bigquery) or execute `bq` command-line tool like below.

In [None]:
QUERY = f"SELECT languages, python FROM {PROJECT_ID}.{DATASET_NAME}.frequency_table ORDER BY python DESC LIMIT 10"

! bq query --nouse_legacy_sql $QUERY

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

### Delete Vertex AI Workbench - Managed Notebook

To delete Vertex Ai Workbench - Managed Notebook used in this project, you can use this [Clean up](https://cloud.google.com/vertex-ai/docs/workbench/managed/create-managed-notebooks-instance-console-quickstart#clean-up) part of `Managed notebooks` page.

### Delete a Dataproc Cluster

To delete a Dataproc Cluster, you can use this [Deleting a cluster](https://cloud.google.com/dataproc/docs/guides/manage-cluster#deleting_a_cluster) part of `Manage a cluster` page.

In [None]:
# Delete Google Cloud Storage bucket
! gsutil rm -r $BUCKET_URI

In [None]:
# Delete BigQuery dataset
! bq rm -r -f $DATASET_NAME

After you delete the BigQuery dataset, you can check your Datasets in BigQuery using the following command.

In [None]:
! bq ls