## Welcome to this course "Getting started with Apache Spark"
## Video: Convert PySpark dataframe to Pandas dataframe and vice-versa

![PySpark](https://drive.google.com/uc?id=1oU2tHXn4Tb4NJ0GQLbFQanLUVWj-3M-G)

## Contents
- Setting up PySpark environment
- Conversion of Dataframes from Spark to Pandas and vice-versa

## Setting up the PySpark environment
- Check out this video for more details: https://www.youtube.com/watch?v=r5PbUuLUZiE
- You can use the below cell to install all the required libraries and files

In [1]:
# Setting up the PySpark environment

# Install java 8
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Download Apache Spark binary: This link can change based on the version. Update this link with the latest version before using
!wget -q https://downloads.apache.org/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz

# Unzip file
!tar -xf spark-3.0.2-bin-hadoop2.7.tgz

# Install findspark: Adds Pyspark to sys.path at runtime
!pip install -q findspark

# Install pyspark
!pip install pyspark

# Add environmental variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.2-bin-hadoop2.7"

# findspark will locate spark in the system
import findspark
findspark.init()

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
0% [Connecting to archive.ubuntu.com] [Waiting for headers] [Connecting to ppa.                                                                               Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
0% [Connecting to archive.ubuntu.com] [2 InRelease 14.2 kB/88.7 kB 16%] [Connec0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com] [2 InRelease 1                                                                               Ign:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (91.189.88.152)0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (91.189.88.152)                                                                               Hit:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Ign:5 http

### Initialize SparkSession

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .master("local") \
        .appName("Hands-on PySpark on Google Colab") \
        .getOrCreate()

In [3]:
spark

## Conversion of Dataframe

### Read data
Dataset: https://archive.ics.uci.edu/ml/datasets/in-vehicle+coupon+recommendation

In [4]:
!wget -q https://archive.ics.uci.edu/ml/machine-learning-databases/00603/in-vehicle-coupon-recommendation.csv -P sample_data/

In [5]:
# We can set header='true' and inferSchema='true' to infer the schema while reading the data

filepath = "sample_data/in-vehicle-coupon-recommendation.csv"
spark_df = spark.read.format('csv').options(header='true', inferSchema='true').load(filepath)
spark_df.show(5, truncate=False)

+---------------+---------+-------+-----------+----+---------------------+----------+------+---+-----------------+------------+------------------------+----------+---------------+----+-----+-----------+---------+--------------------+----------------+----------------+-----------------+-----------------+--------------+-------------+---+
|destination    |passanger|weather|temperature|time|coupon               |expiration|gender|age|maritalStatus    |has_children|education               |occupation|income         |car |Bar  |CoffeeHouse|CarryAway|RestaurantLessThan20|Restaurant20To50|toCoupon_GEQ5min|toCoupon_GEQ15min|toCoupon_GEQ25min|direction_same|direction_opp|Y  |
+---------------+---------+-------+-----------+----+---------------------+----------+------+---+-----------------+------------+------------------------+----------+---------------+----+-----+-----------+---------+--------------------+----------------+----------------+-----------------+-----------------+--------------+--------

In [6]:
spark_df.printSchema()

root
 |-- destination: string (nullable = true)
 |-- passanger: string (nullable = true)
 |-- weather: string (nullable = true)
 |-- temperature: integer (nullable = true)
 |-- time: string (nullable = true)
 |-- coupon: string (nullable = true)
 |-- expiration: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- age: string (nullable = true)
 |-- maritalStatus: string (nullable = true)
 |-- has_children: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- income: string (nullable = true)
 |-- car: string (nullable = true)
 |-- Bar: string (nullable = true)
 |-- CoffeeHouse: string (nullable = true)
 |-- CarryAway: string (nullable = true)
 |-- RestaurantLessThan20: string (nullable = true)
 |-- Restaurant20To50: string (nullable = true)
 |-- toCoupon_GEQ5min: integer (nullable = true)
 |-- toCoupon_GEQ15min: integer (nullable = true)
 |-- toCoupon_GEQ25min: integer (nullable = true)
 |-- direction_same: inte

### Convert Spark DF to Pandas DF
- Once you convert into pandas, you can use all the python libraries for modeling, visualization, preprocessing, etc.

In [7]:
pandas_df = spark_df.toPandas()
pandas_df.shape

(12684, 26)

In [8]:
pandas_df.isnull().sum()

destination                 0
passanger                   0
weather                     0
temperature                 0
time                        0
coupon                      0
expiration                  0
gender                      0
age                         0
maritalStatus               0
has_children                0
education                   0
occupation                  0
income                      0
car                     12576
Bar                       107
CoffeeHouse               217
CarryAway                 151
RestaurantLessThan20      130
Restaurant20To50          189
toCoupon_GEQ5min            0
toCoupon_GEQ15min           0
toCoupon_GEQ25min           0
direction_same              0
direction_opp               0
Y                           0
dtype: int64

In [9]:
pandas_df.head(2)

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,has_children,education,occupation,income,car,Bar,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,No Urgent Place,Alone,Sunny,55,2PM,Restaurant(<20),1d,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,0,0,0,1,1
1,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,0,0,0,1,0


In [10]:
pandas_df.dtypes

destination             object
passanger               object
weather                 object
temperature              int32
time                    object
coupon                  object
expiration              object
gender                  object
age                     object
maritalStatus           object
has_children             int32
education               object
occupation              object
income                  object
car                     object
Bar                     object
CoffeeHouse             object
CarryAway               object
RestaurantLessThan20    object
Restaurant20To50        object
toCoupon_GEQ5min         int32
toCoupon_GEQ15min        int32
toCoupon_GEQ25min        int32
direction_same           int32
direction_opp            int32
Y                        int32
dtype: object

In [13]:
# You can sample to sample the dataset and convert the data into pandas for further analysis
sample_pandas_df = spark_df.sample(withReplacement=False, fraction=0.01).toPandas()
sample_pandas_df.shape

(145, 26)

In [14]:
sample_pandas_df["age"].value_counts()

26         33
21         32
31         20
50plus     19
41         12
36         12
46         11
below21     6
Name: age, dtype: int64

### Convert Pandas DF to Spark DF

In [15]:
# Select few columns
pandas_df_subset = pandas_df[["destination", "passanger", "weather", "temperature", "time", "coupon", "age", "car", "Y"]].copy()
pandas_df_subset.head(2)

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,age,car,Y
0,No Urgent Place,Alone,Sunny,55,2PM,Restaurant(<20),21,,1
1,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,21,,0


In [16]:
pandas_df_subset.shape

(12684, 9)

In [17]:
spark_df_from_pandas = spark.createDataFrame(pandas_df_subset)
spark_df_from_pandas.show(2)

+---------------+---------+-------+-----------+----+---------------+---+----+---+
|    destination|passanger|weather|temperature|time|         coupon|age| car|  Y|
+---------------+---------+-------+-----------+----+---------------+---+----+---+
|No Urgent Place|    Alone|  Sunny|         55| 2PM|Restaurant(<20)| 21|null|  1|
|No Urgent Place|Friend(s)|  Sunny|         80|10AM|   Coffee House| 21|null|  0|
+---------------+---------+-------+-----------+----+---------------+---+----+---+
only showing top 2 rows



In [18]:
type(spark_df_from_pandas)

pyspark.sql.dataframe.DataFrame

In [19]:
spark_df_from_pandas.printSchema()

root
 |-- destination: string (nullable = true)
 |-- passanger: string (nullable = true)
 |-- weather: string (nullable = true)
 |-- temperature: long (nullable = true)
 |-- time: string (nullable = true)
 |-- coupon: string (nullable = true)
 |-- age: string (nullable = true)
 |-- car: string (nullable = true)
 |-- Y: long (nullable = true)



### Use schema for spark dataframe

- StructType objects define the schema of Spark DataFrames. StructType objects contain a list of StructField objects that define the name, type, and nullable flag for each column in a DataFrame.

In [20]:
from pyspark.sql.types import *
schema = StructType([
      StructField("destination", StringType(), True),
      StructField("passanger", StringType(), True),
      StructField("weather", StringType(), True),
      StructField("temperature", IntegerType(), True),
      StructField("time", StringType(), True),
      StructField("coupon", StringType(), True),
      StructField("age", StringType(), True),
      StructField("car", StringType(), True),
      StructField("Y", IntegerType(), True)])

spark_df_from_pandas = spark.createDataFrame(pandas_df_subset, schema=schema)
spark_df_from_pandas.show(2)

+---------------+---------+-------+-----------+----+---------------+---+----+---+
|    destination|passanger|weather|temperature|time|         coupon|age| car|  Y|
+---------------+---------+-------+-----------+----+---------------+---+----+---+
|No Urgent Place|    Alone|  Sunny|         55| 2PM|Restaurant(<20)| 21|null|  1|
|No Urgent Place|Friend(s)|  Sunny|         80|10AM|   Coffee House| 21|null|  0|
+---------------+---------+-------+-----------+----+---------------+---+----+---+
only showing top 2 rows



In [21]:
spark_df_from_pandas.printSchema()

root
 |-- destination: string (nullable = true)
 |-- passanger: string (nullable = true)
 |-- weather: string (nullable = true)
 |-- temperature: integer (nullable = true)
 |-- time: string (nullable = true)
 |-- coupon: string (nullable = true)
 |-- age: string (nullable = true)
 |-- car: string (nullable = true)
 |-- Y: integer (nullable = true)



## Summary
- Spark DataFrame to Pandas DataFrame
- Pandas DataFrame to Spark DataFrame
- Sampling a DataFrame
- Passing schema while converting a pandas DF to spark data frame

### Thank you :)
-  That's the end of the this video. If you like this video, please do like, share and subscribe to my channel.
- If you are on LinkedIn, please tag me and share your thoughts on this video and the series "Getting started with PySpark - Hands on". This will motivate me to make more videos.
<div>
<img src="https://drive.google.com/uc?id=1ttB2gJaw0cXuJfj6GBx5VaYf2ArjiRXM" width="200"/>
</div>