## Welcome to this course "Getting started with Apache Spark"
## Video: Summarize data in PySpark

![PySpark](https://drive.google.com/uc?id=1oU2tHXn4Tb4NJ0GQLbFQanLUVWj-3M-G)

## Contents
- Summarize dataframe in PySpark
  - Shape (rows x no. of columns)
  - Schema
  - describe (min, max, count)
  - percentiles (25%, 50%, 75%)

## Setting up the PySpark environment
- Check out this video for more details: https://www.youtube.com/watch?v=r5PbUuLUZiE
  - You can check out the link in the description below
- You can use the below cell to install all the required libraries and files

In [1]:
# Setting up the PySpark environment

# Install java 8
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Download Apache Spark binary: This link can change based on the version. Update this link with the latest version before using
!wget -q https://downloads.apache.org/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz

# Unzip file
!tar -xf spark-3.0.2-bin-hadoop2.7.tgz

# Install findspark: Adds Pyspark to sys.path at runtime
!pip install -q findspark

# Install pyspark
!pip install pyspark

# Add environmental variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.2-bin-hadoop2.7"

# findspark will locate spark in the system
import findspark
findspark.init()

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
0% [Connecting to archive.ubuntu.com (91.189.88.152)] [Waiting for headers] [Co                                                                               Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
0% [Connecting to archive.ubuntu.com (91.189.88.152)] [2 InRelease 14.2 kB/88.70% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (91.189.88.152)                                                                               Ign:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (91.189.88.152)                                                                               Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos

### Initialize SparkSession

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .master("local") \
        .appName("Hands-on PySpark on Google Colab") \
        .getOrCreate()

In [3]:
spark

## Summarize

### Read data
Dataset (In-vehicle coupon recommendation): https://archive.ics.uci.edu/ml/machine-learning-databases/00603/in-vehicle-coupon-recommendation.csv

In [4]:
!wget -q https://archive.ics.uci.edu/ml/machine-learning-databases/00603/in-vehicle-coupon-recommendation.csv -P sample_data/

In [5]:
# We can set header='true' and inferSchema='true' to infer the schema while reading the data

filepath = "sample_data/in-vehicle-coupon-recommendation.csv"
spark_df = spark.read.format('csv').options(header='true', inferSchema='true').load(filepath)
spark_df.show(5, truncate=False)

+---------------+---------+-------+-----------+----+---------------------+----------+------+---+-----------------+------------+------------------------+----------+---------------+----+-----+-----------+---------+--------------------+----------------+----------------+-----------------+-----------------+--------------+-------------+---+
|destination    |passanger|weather|temperature|time|coupon               |expiration|gender|age|maritalStatus    |has_children|education               |occupation|income         |car |Bar  |CoffeeHouse|CarryAway|RestaurantLessThan20|Restaurant20To50|toCoupon_GEQ5min|toCoupon_GEQ15min|toCoupon_GEQ25min|direction_same|direction_opp|Y  |
+---------------+---------+-------+-----------+----+---------------------+----------+------+---+-----------------+------------+------------------------+----------+---------------+----+-----+-----------+---------+--------------------+----------------+----------------+-----------------+-----------------+--------------+--------

In [7]:
columns_to_use = ["destination", "passanger", "weather", "temperature", "time", "coupon", "gender", "age", "has_children", "toCoupon_GEQ5min", "Y"]
spark_df = spark_df.select(*columns_to_use)
spark_df.show(5, truncate=False)

+---------------+---------+-------+-----------+----+---------------------+------+---+------------+----------------+---+
|destination    |passanger|weather|temperature|time|coupon               |gender|age|has_children|toCoupon_GEQ5min|Y  |
+---------------+---------+-------+-----------+----+---------------------+------+---+------------+----------------+---+
|No Urgent Place|Alone    |Sunny  |55         |2PM |Restaurant(<20)      |Female|21 |1           |1               |1  |
|No Urgent Place|Friend(s)|Sunny  |80         |10AM|Coffee House         |Female|21 |1           |1               |0  |
|No Urgent Place|Friend(s)|Sunny  |80         |10AM|Carry out & Take away|Female|21 |1           |1               |1  |
|No Urgent Place|Friend(s)|Sunny  |80         |2PM |Coffee House         |Female|21 |1           |1               |0  |
|No Urgent Place|Friend(s)|Sunny  |80         |2PM |Coffee House         |Female|21 |1           |1               |0  |
+---------------+---------+-------+-----

### Shape of the dataframe (count x number of columns)

In [8]:
spark_df.count()

12684

In [9]:
print(spark_df.columns)

['destination', 'passanger', 'weather', 'temperature', 'time', 'coupon', 'gender', 'age', 'has_children', 'toCoupon_GEQ5min', 'Y']


In [10]:
len(spark_df.columns)

11

### Schema of the dataframe

In [11]:
spark_df.printSchema()

root
 |-- destination: string (nullable = true)
 |-- passanger: string (nullable = true)
 |-- weather: string (nullable = true)
 |-- temperature: integer (nullable = true)
 |-- time: string (nullable = true)
 |-- coupon: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- age: string (nullable = true)
 |-- has_children: integer (nullable = true)
 |-- toCoupon_GEQ5min: integer (nullable = true)
 |-- Y: integer (nullable = true)



In [12]:
spark_df.dtypes

[('destination', 'string'),
 ('passanger', 'string'),
 ('weather', 'string'),
 ('temperature', 'int'),
 ('time', 'string'),
 ('coupon', 'string'),
 ('gender', 'string'),
 ('age', 'string'),
 ('has_children', 'int'),
 ('toCoupon_GEQ5min', 'int'),
 ('Y', 'int')]

### Describe the dataframe (min, max, count)

In [13]:
spark_df.describe()

DataFrame[summary: string, destination: string, passanger: string, weather: string, temperature: string, time: string, coupon: string, gender: string, age: string, has_children: string, toCoupon_GEQ5min: string, Y: string]

In [14]:
# To get the output, you have to run action commands (like show, collect, etc.)
spark_df.describe().show()

+-------+-----------+---------+-------+------------------+-----+---------------+------+------------------+------------------+----------------+------------------+
|summary|destination|passanger|weather|       temperature| time|         coupon|gender|               age|      has_children|toCoupon_GEQ5min|                 Y|
+-------+-----------+---------+-------+------------------+-----+---------------+------+------------------+------------------+----------------+------------------+
|  count|      12684|    12684|  12684|             12684|12684|          12684| 12684|             12684|             12684|           12684|             12684|
|   mean|       null|     null|   null|63.301797540208135| null|           null|  null|29.887815247850035|0.4141438032166509|             1.0|0.5684326710816777|
| stddev|       null|     null|   null| 19.15448575684057| null|           null|  null| 7.697275065801651|   0.4925929797555|             0.0| 0.495314356461186|
|    min|       Home|    Alo

### Percentiles

In [15]:
spark_df.summary().show()

+-------+-----------+---------+-------+------------------+-----+---------------+------+------------------+------------------+----------------+------------------+
|summary|destination|passanger|weather|       temperature| time|         coupon|gender|               age|      has_children|toCoupon_GEQ5min|                 Y|
+-------+-----------+---------+-------+------------------+-----+---------------+------+------------------+------------------+----------------+------------------+
|  count|      12684|    12684|  12684|             12684|12684|          12684| 12684|             12684|             12684|           12684|             12684|
|   mean|       null|     null|   null|63.301797540208135| null|           null|  null|29.887815247850035|0.4141438032166509|             1.0|0.5684326710816777|
| stddev|       null|     null|   null| 19.15448575684057| null|           null|  null| 7.697275065801651|   0.4925929797555|             0.0| 0.495314356461186|
|    min|       Home|    Alo

In [16]:
spark_df.summary("40%", "60%", "90%").show()

+-------+-----------+---------+-------+-----------+----+------+------+----+------------+----------------+---+
|summary|destination|passanger|weather|temperature|time|coupon|gender| age|has_children|toCoupon_GEQ5min|  Y|
+-------+-----------+---------+-------+-----------+----+------+------+----+------------+----------------+---+
|    40%|       null|     null|   null|         55|null|  null|  null|26.0|           0|               1|  0|
|    60%|       null|     null|   null|         80|null|  null|  null|31.0|           1|               1|  1|
|    90%|       null|     null|   null|         80|null|  null|  null|41.0|           1|               1|  1|
+-------+-----------+---------+-------+-----------+----+------+------+----+------------+----------------+---+



### To get other functions, you can directly see in Google Colab (using spark_df.)

In [None]:
spark_df.

## Summary
- We have seen how to extract basic statistics from a DataFrame using PySpark

### Thank you :)
-  That's the end of the this video. If you like this video, please do like, share and subscribe to my channel.
- If you are on LinkedIn, please tag me and share your thoughts on this video and the series "Getting started with PySpark - Hands on". This will motivate me to make more videos.
<div>
<img src="https://drive.google.com/uc?id=1ttB2gJaw0cXuJfj6GBx5VaYf2ArjiRXM" width="200"/>
</div>