## The right tool for the job: combining pandas and SPARK

When it comes to data analysis with python, `pandas` is a very popular library for data manipulation, however it is constrained to datasets that fit into memory on a single machine.

When scale is needed, `Apache Spark` is a distributed computing system that can process large amounts of data quickly thanks to its scale-out architecture. Its is very common to combine different tools in the data analysis process, for instance, starting with `pandas` and bursting heavy analysis to `spark` as needed.

With Google Cloud, it is possible to use both `pandas` and `spark` together to perform data analysis tasks, without increasing the complexity of the solution. **Google Cloud's Spark Serverless** is a fully managed service for running `spark` workloads on Google Cloud. It allows you to easily submit `spark` jobs to be executed on Cloud Dataproc, Google Cloud's fully managed service for running `spark` and  `hadoop` workloads.

With **Google Cloud's Spark Serverless**, you don't need to worry about managing the underlying infrastructure or configuring `spark`  clusters. You simply submit your `spark`  job, and the service automatically provisions the necessary resources and executes the job on a managed `spark`  cluster. When the job is finished, the resources are automatically released, so you only pay for what you use.

In this notebook we will explore a use case:

* Use `pandas` to pre-process data: `pandas` is good at handling small to medium-sized data sets, so you can use it to perform initial data cleaning and manipulation. For example, you can use `pandas` to read data from `BigQuery`, filter the data, and create simple visualizations.

* Use `spark` to scale up: Once you have pre-processed your data using `pandas`, you can use `spark` to scale up your analysis to handle larger data sets. `spark` can be used to perform distributed computing on a cluster of machines, which makes it well-suited for big data tasks. `spark` has a fundamental data structure called a `spark dataframe`, which is similar to a `pandas dataframe`. You can use `spark dataframe` to perform distributed computations on large data sets, and you can also convert `pandas dataframe` to `spark dataframe` using the `spark.createDataFrame method`.


### 1. Using pandas together with BigQuery

Lets start by reading some data from `BigQuery` public datasets into a `pandas dataframe`. Vertex AI managed notebooks integration with `BigQuery` makes this process simple.

#@bigquery
SELECT * FROM bigquery-public-data.chicago_crime.crime

In [None]:
# The following two lines are only necessary to run once.
# Comment out otherwise for speed-up.
from google.cloud.bigquery import Client, QueryJobConfig
client = Client()

query = """SELECT * FROM bigquery-public-data.chicago_crime.crime"""
job = client.query(query)
df = job.to_dataframe()

Lets inspect the memory allocation for the pandas dataframe

In [None]:
df.info()

Lets inspect the memory configuration for this notebook, note that we can always change the machine type

In [None]:
!awk '$3=="kB"{$2=$2/1024^2;$3="GB";} 1' /proc/meminfo | grep MemTotal

In [None]:
!awk '$3=="kB"{$2=$2/1024^2;$3="GB";} 1' /proc/meminfo | grep MemFree

Lets perform some simple aggregations on the dataset, for transformations on small datasets `pandas` is a very expressive and rich tool

In [None]:
count_arrest = df.groupby(by=["primary_type","arrest"]).size().sort_values(ascending=False).rename("count").reset_index()

Lets visualize the aggregation results using `seaborn` and `matplotlib`

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

f, ax = plt.subplots(figsize=(5,5))
sns.barplot( y="primary_type",x="count" , data=count_arrest.iloc[:20, :], hue='arrest', color='red')
ax.legend(ncol=2, loc="lower right", frameon=True)
ax.set(ylabel="Type",xlabel="Crimes")
sns.despine(left=True, bottom=True)

Lets perform some (not very useful) expensive operation now

In [None]:
df = [df.join(df,on="unique_key",rsuffix="_y") for _ in range(5)]

Either ther kernel will die, or a OOM error will be thrown

### 2. Scaling data analysis with SPARK serverless interactive

Lets replicate the previous workload with `pySPARK` running on Google Cloud SPARK Serverless

In [None]:
import sys
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from datetime import datetime
import pyspark.pandas as ps
import pandas as pd
from random import randint

In [None]:
spark = SparkSession.builder.getOrCreate()

In [None]:
spark.sparkContext.getConf().getAll()

#### 2.1 Using pandas on SPARK

In [None]:
ps.set_option("compute.default_index_type", "distributed")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)

In [None]:
# Change type to avoid TypeError: Type datetime64[ns, UTC] was not understood. when converting to pandas on spark dataframe
# See https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/types.html for additional information
df['date'] = pd.to_datetime(df.date).dt.tz_localize(None)
df['updated_on'] = pd.to_datetime(df.updated_on).dt.tz_localize(None)

In [None]:
psdf = ps.from_pandas(df)

In [None]:
count_arrest_psdf = psdf.groupby(by=["primary_type","arrest"]).size().sort_values(ascending=False).rename("count").reset_index()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

f, ax = plt.subplots(figsize=(5,5))
sns.barplot( y="primary_type",x="count" , data=count_arrest_psdf.to_pandas().iloc[:20, :], hue='arrest', color='green')
ax.legend(ncol=2, loc="lower right", frameon=True)
ax.set(ylabel="Type",xlabel="Crimes")
sns.despine(left=True, bottom=True)

In [None]:
psdf = [psdf.join(psdf.rename( columns = { "unique_key" : "unique_key-{}".format(randint(0,1000)) } ),on="unique_key",rsuffix="_y") for _ in range(5)]

In [None]:
psdf

#### 2.2 Using SPARK Dataframe API

It is possible to read data directly from BigQuery storage using the spark BigQuery connector for spark

In [None]:
spark_df = spark.read.format('bigquery') \
  .option('table', 'bigquery-public-data:chicago_crime.crime') \
  .load()

In [None]:
spark_df

The PySPARK dataframe API is pretty similar to the pandas one, so the code refactor is minimal

In [None]:
spark_count_arrest = spark_df.groupby("primary_type","arrest").count().orderBy(col("count").desc())

We can swith between pandas dataframes and spark dataframes easily. Unlike pandas, in spark the execution is lazy

In [None]:
spark_count_arrest = spark_count_arrest.toPandas()

In [None]:
spark_count_arrest

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

f, ax = plt.subplots(figsize=(5,5))
sns.barplot( y="primary_type",x="count" , data=spark_count_arrest.iloc[:20, :], hue='arrest', color='blue')
ax.legend(ncol=2, loc="lower right", frameon=True)
ax.set(ylabel="Type",xlabel="Crimes")
sns.despine(left=True, bottom=True)

Lets execute the heavy operation again, this time we will do it in a distributed a spark cluster

In [None]:
spark_df = spark_df.join(spark_df,on="unique_key").join(spark_df,on="unique_key").join(spark_df,on="unique_key").join(spark_df,on="unique_key").join(spark_df,on="unique_key")

In [None]:
spark_df.explain()

In [None]:
spark_df

In order to not sature the node memory lets retrieve a small percentage of the processed data with the `sample` operation

In [None]:
spark_df = spark_df.sample(0.01).collect()

In [None]:
spark_df[0]

End of notebook