# Pyspark tutorial

## Envionment Setup

**Step 0:** Clone the repository to the cluster's local disk

In [None]:
!git clone https://github.com/Jace-Yang/yelp_db_clone

**Step 1:** Every time you create a new cluster, download external package in order to parse XML files: spark-xml with version 2.12-0.14.0 to support Spark 3.1.2 and Scala 2.12.

In [4]:
!sudo hdfs dfs -get gs://csee4121/homework2/spark-xml_2.12-0.14.0.jar /usr/lib/spark/jars/
    # Reference: https://csee-4121-2022.github.io/homeworks/hw2.html

> Note: if you are using multiple GCP dataproc nodes, run `sudo hdfs dfs -get gs://csee4121/homework2/spark-xml_2.12-0.14.0.jar /usr/lib/spark/jars/` on every worker VM machines by SSH them.

**Step 2:** Download dataset from [yelp dataset](https://www.yelp.com/dataset/documentation/main) and upload all json files except the `photos` into a GCP bucket. In this case the bucket name is `coms4111`, and I placed it into a directory that jupyterlab can directly access it through `GCS` folder.

**Step 3:** Move data from GS into a HDFS directory every time you create a new cluster. We do this by moving data into the local disk first, then to HDFS!

In [30]:
# Gs -> Local
!mkdir yelp_db_clone/data/
!gsutil cp gs://coms4111/notebooks/jupyter/data/*.json file:///yelp_db_clone/data/

Copying gs://coms4111/notebooks/jupyter/data/yelp_academic_dataset_business.json...
Copying gs://coms4111/notebooks/jupyter/data/yelp_academic_dataset_checkin.json...
Copying gs://coms4111/notebooks/jupyter/data/yelp_academic_dataset_review.json...
Copying gs://coms4111/notebooks/jupyter/data/yelp_academic_dataset_tip.json...  
/ [4 files][  5.5 GiB/  5.5 GiB]   54.4 MiB/s                                   
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://coms4111/notebooks/jupyter/data/yelp_academic_dataset_user.json...
/ [5 files][  8.6 GiB/  8.6 GiB]   92.4 MiB/s                                   
Operation completed over 5 objects/8.6 GiB.                                      


In [36]:
# Local -> HDFS
!hdfs dfs -cp -f file:///yelp_db_clone/data/* hdfs:///user/dataproc/

In [37]:
# Check whether data is now in HDFS!
!hdfs dfs -ls hdfs:///user/dataproc/

Found 5 items
-rw-r--r--   2 root hadoop  118863795 2022-05-03 20:19 hdfs:///user/dataproc/yelp_academic_dataset_business.json
-rw-r--r--   2 root hadoop  286958945 2022-05-03 20:19 hdfs:///user/dataproc/yelp_academic_dataset_checkin.json
-rw-r--r--   2 root hadoop 5341868833 2022-05-03 20:20 hdfs:///user/dataproc/yelp_academic_dataset_review.json
-rw-r--r--   2 root hadoop  180604475 2022-05-03 20:20 hdfs:///user/dataproc/yelp_academic_dataset_tip.json
-rw-r--r--   2 root hadoop 3363329011 2022-05-03 20:20 hdfs:///user/dataproc/yelp_academic_dataset_user.json


## Examples

In [1]:
import pyspark
from pyspark.sql.functions import col

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/03 20:44:01 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
22/05/03 20:44:01 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
22/05/03 20:44:01 INFO org.apache.spark.SparkEnv: Registering BlockManagerMasterHeartbeat
22/05/03 20:44:01 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator


### Read Data & Print Schema

In [3]:
review = spark.read.json('hdfs:///user/dataproc/yelp_academic_dataset_review.json')
review.printSchema()



root
 |-- business_id: string (nullable = true)
 |-- cool: long (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- review_id: string (nullable = true)
 |-- stars: double (nullable = true)
 |-- text: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)



                                                                                

In [4]:
business = spark.read.json('hdfs:///user/dataproc/yelp_academic_dataset_business.json')
business.printSchema()

                                                                                

root
 |-- address: string (nullable = true)
 |-- attributes: struct (nullable = true)
 |    |-- AcceptsInsurance: string (nullable = true)
 |    |-- AgesAllowed: string (nullable = true)
 |    |-- Alcohol: string (nullable = true)
 |    |-- Ambience: string (nullable = true)
 |    |-- BYOB: string (nullable = true)
 |    |-- BYOBCorkage: string (nullable = true)
 |    |-- BestNights: string (nullable = true)
 |    |-- BikeParking: string (nullable = true)
 |    |-- BusinessAcceptsBitcoin: string (nullable = true)
 |    |-- BusinessAcceptsCreditCards: string (nullable = true)
 |    |-- BusinessParking: string (nullable = true)
 |    |-- ByAppointmentOnly: string (nullable = true)
 |    |-- Caters: string (nullable = true)
 |    |-- CoatCheck: string (nullable = true)
 |    |-- Corkage: string (nullable = true)
 |    |-- DietaryRestrictions: string (nullable = true)
 |    |-- DogsAllowed: string (nullable = true)
 |    |-- DriveThru: string (nullable = true)
 |    |-- GoodForDancing: str

22/05/03 20:44:25 WARN org.apache.spark.sql.catalyst.util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


- Here we see that spark allows semi-structure!

In [5]:
user = spark.read.json('hdfs:///user/dataproc/yelp_academic_dataset_user.json')
user.printSchema()



root
 |-- average_stars: double (nullable = true)
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- elite: string (nullable = true)
 |-- fans: long (nullable = true)
 |-- friends: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- yelping_since: string (nullable = true)



                                                                                

### Join

In [6]:
review_wide = review.join(business.select('business_id', col('name').alias('biz_name'), 'attributes.RestaurantsTakeOut'),
                          on='business_id',
                          how='inner'). \
                     join(user.select('user_id', col('name').alias('user_name'), 'fans', 'yelping_since'),
                          on='user_id',
                          how='inner')

In [7]:
%%time 
review_wide.count()



CPU times: user 22.6 ms, sys: 3.98 ms, total: 26.5 ms
Wall time: 14.9 s


                                                                                

6990247

- This command take very long!! This is because you are committing also the delayed join!

In [8]:
# But we can explictly tell DB to store it in memory first
review_wide = review_wide.persist(pyspark.StorageLevel.MEMORY_ONLY)

In [9]:
%%time 
# Now, this command is still taking even longer! Because this time it runs the Join and keep it in the memory
review_wide.count()



CPU times: user 69.5 ms, sys: 7.25 ms, total: 76.7 ms
Wall time: 37.5 s


                                                                                

6990247

In [11]:
%%time 
# Now, this command is getting much more faster!
review_wide.count()



CPU times: user 4.17 ms, sys: 45 µs, total: 4.21 ms
Wall time: 1.43 s


                                                                                

6990247

- And finally!! Counting 6m+ rows in 1.43s!

In [12]:
review_wide.show(5)

+--------------------+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+------------------+---------+----+-------------------+
|             user_id|         business_id|cool|               date|funny|           review_id|stars|                text|useful|            biz_name|RestaurantsTakeOut|user_name|fans|      yelping_since|
+--------------------+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+------------------+---------+----+-------------------+
|--RJK834fiQXm21Vp...|aIoUwpy5ZFQXUDxWM...|   0|2019-08-25 23:17:52|    0|QPF7spAqCc-D81GeX...|  1.0|There are new own...|     0|     Pete & Shorty's|              True|    Renee|   0|2018-02-04 20:34:16|
|--UhENQdbuWEh0mU5...|K_s-9Wd6vXSfnxYFz...|   1|2017-08-06 02:42:02|    1|dghJt1TSuyFkmLddu...|  5.0|When im first arr...|     0|Kei Sushi Restaurant|              True|    Sonny| 

### Create View