![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/PySpark/2.PySpark_DataFrames.ipynb)

# PySpark Tutorial-2 DataFrame

# Overview
PySpark DataFrames are lazily evaluated. They are implemented on top of RDDs. When Spark transforms data, it does not immediately compute the transformation but plans how to compute later. When actions such as collect() are explicitly called, the computation starts. This notebook shows the basic usages of the DataFrame, geared mainly for new users.

PySpark applications start with initializing SparkSession which is the entry point of PySpark as below. In case of running it in PySpark shell via pyspark executable, the shell automatically creates the session in the variable spark for users.

[source](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html#DataFrame-Creation)

### Install PySpark

In [None]:
# install PySpark
! pip install pyspark==3.2.0

### Initializing Spark

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark

In [None]:
#  DO NOT FORGET WHEN YOU'RE DONE => spark.stop()

# Pyspark DataFrame

If you’re used to working with Pandas or data frames in R, you’ll have probably also expected to see a header, but there is none. To make your life easier, you will move on from the RDD and convert it to a DataFrame. Dataframes are preferred over RDDs whenever you can use them. Especially when you’re working with Python, the performance of DataFrames is better than RDDs.

But what is the difference between the two?

You can use RDDs when you want to perform low-level transformations and actions on your unstructured data. This means that you don’t care about imposing a schema while processing or accessing the attributes by name or column. Tying in to what was said before about performance, by using RDDs, you don’t necessarily want the performance benefits that DataFrames can offer for (semi-) structured data. Use RDDs when you want to manipulate the data with functional programming constructs rather than domain specific expressions.

To recapitulate, you’ll switch to DataFrames now to use high-level expressions, to perform SQL queries to explore your data further and to gain columnar access.

PySpark applications start with initializing SparkSession which is the entry point of PySpark as below. In case of running it in PySpark shell via pyspark executable, the shell automatically creates the session in the variable spark for users.

## DataFrame Creation

A PySpark DataFrame can be created via `pyspark.sql.SparkSession.createDataFrame` typically by passing a list of lists, tuples, dictionaries and `pyspark.sql.Row`s, a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and an RDD consisting of such a list.
`pyspark.sql.SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the DataFrame. When it is omitted, PySpark infers the corresponding schema by taking a sample from the data.

Firstly, you can create a PySpark DataFrame from a list of rows

In [14]:
from datetime import datetime, date
import pandas as pd

from pyspark.sql import Row
import pyspark.sql.functions as F

In [None]:
# Create a DataFrame from pandas

iphones = pd.DataFrame([
    ("XS", 2018, 5.65, 2.79, 6.24),
    ("XR", 2018, 5.94, 2.98, 6.84),
    ("X10", 2017, 5.65, 2.79, 6.13),
    ("8Plus", 2017, 6.23, 3.07, 7.12)
])

names = [ 'Model',
          'Year',
          'Height',
          'Width',
          'Weight'
]

iphones_df = spark.createDataFrame(iphones, schema=names)

type(iphones_df)


pyspark.sql.dataframe.DataFrame

The top rows of a DataFrame can be displayed using `DataFrame.show()`.

In [None]:
iphones_df.show()

+-----+----+------+-----+------+
|Model|Year|Height|Width|Weight|
+-----+----+------+-----+------+
|   XS|2018|  5.65| 2.79|  6.24|
|   XR|2018|  5.94| 2.98|  6.84|
|  X10|2017|  5.65| 2.79|  6.13|
|8Plus|2017|  6.23| 3.07|  7.12|
+-----+----+------+-----+------+



In [None]:
# The rows can also be shown vertically. This is useful when rows are too long to show horizontally.

iphones_df.show(1, vertical=True)

-RECORD 0------
 Model  | XS   
 Year   | 2018 
 Height | 5.65 
 Width  | 2.79 
 Weight | 6.24 
only showing top 1 row



In [None]:
# printSchema(): It displays the schema of the data.

iphones_df.printSchema()

root
 |-- Model: string (nullable = true)
 |-- Year: long (nullable = true)
 |-- Height: double (nullable = true)
 |-- Width: double (nullable = true)
 |-- Weight: double (nullable = true)



In [None]:
# Create a DataFrame from RDDs

rdd = spark.sparkContext.parallelize([
    (1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
    (2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
    (3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))
])
df = spark.createDataFrame(rdd, schema=['a', 'b', 'c', 'd', 'e'])
df

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

In [None]:
df.show()

+---+---+-------+----------+-------------------+
|  a|  b|      c|         d|                  e|
+---+---+-------+----------+-------------------+
|  1|2.0|string1|2000-01-01|2000-01-01 12:00:00|
|  2|3.0|string2|2000-02-01|2000-01-02 12:00:00|
|  3|4.0|string3|2000-03-01|2000-01-03 12:00:00|
+---+---+-------+----------+-------------------+



In [None]:
# Create a PySpark DataFrame with an explicit schema.

df = spark.createDataFrame([
    (1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
    (2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
    (3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))
], schema='a long, b double, c string, d date, e timestamp')
df

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

In [None]:
# create a PySpark DataFrame from a list of rows

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

In [4]:
# Download data from github

! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/PySpark/data/airport-codes.csv

In [2]:
# Create a DataFrame from reading a CSV/JSON/TXT

# df_json = spark.read.json("./airport-codes.json", header=True, inferSchema=True)

# df_txt = spark.read.txt("./airport-codes.txt", header=True, inferSchema=True)

df_airport = spark.read.csv("./airport-codes.csv", header=True, inferSchema=True)

In [3]:
# The top rows of a DataFrame can be displayed using DataFrame.show()

df_airport.show(10)

+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|  00A|     heliport|   Total Rf Heliport|          11|       NA|         US|     US-PA|    Bensalem|     00A|     null|       00A|40.07080078125, -...|
| 00AA|small_airport|Aero B Ranch Airport|        3435|       NA|         US|     US-KS|       Leoti|    00AA|     null|      00AA|38.704022, -101.4...|
| 00AK|small_airport|        Lowell Field|         450|       NA|         US|     US-AK|Anchor Point|    00AK|     null|      00AK|59.94919968, -151...|
| 00AL|small_airport|        Epps Airpark|         820|       NA|         US|     

## DataFrame Operations

In [None]:
# printSchema() operation prints the types of columns in the DataFrame

df_airport.printSchema()

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: integer (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)



In [None]:
# You can see the DataFrame's schema and column names as follows:

df_airport.columns

['ident',
 'type',
 'name',
 'elevation_ft',
 'continent',
 'iso_country',
 'iso_region',
 'municipality',
 'gps_code',
 'iata_code',
 'local_code',
 'coordinates']

In [None]:
df_airport.count()

55113

In [None]:
# select() transformation subsets the columns in the DataFrame

df_id_name = df_airport.select("type",'municipality',"iso_region").show(5)

# same as df_csv.select(["type",'municipality',"iso_region"]).show(5)

+-------------+------------+----------+
|         type|municipality|iso_region|
+-------------+------------+----------+
|     heliport|    Bensalem|     US-PA|
|small_airport|       Leoti|     US-KS|
|small_airport|Anchor Point|     US-AK|
|small_airport|     Harvest|     US-AL|
|       closed|     Newport|     US-AR|
+-------------+------------+----------+
only showing top 5 rows



In [None]:
# describe() the summary of the DataFrame

df_airport.select("ident", "elevation_ft", "municipality").describe().show()

+-------+--------------------+------------------+-------------------+
|summary|               ident|      elevation_ft|       municipality|
+-------+--------------------+------------------+-------------------+
|  count|               55113|             48180|              49473|
|   mean|2.3873375337777779E8|1243.2134703196348|               null|
| stddev| 9.492375382267495E8|1605.0652744362676|               null|
|    min|                 00A|             -1266|"""Big"" Rock Flat"|
|    max|                spgl|             22000|             Žocene|
+-------+--------------------+------------------+-------------------+



`DataFrame.collect()` collects the distributed data to the driver side as the local data in Python. Note that this can throw an out-of-memory error when the dataset is too large to fit in the driver side because it collects all the data from executors to the driver side.

In [None]:
# DataFrame.collect()

df_airport.collect()[:5]

[Row(ident='00A', type='heliport', name='Total Rf Heliport', elevation_ft=11, continent='NA', iso_country='US', iso_region='US-PA', municipality='Bensalem', gps_code='00A', iata_code=None, local_code='00A', coordinates='40.07080078125, -74.93360137939453'),
 Row(ident='00AA', type='small_airport', name='Aero B Ranch Airport', elevation_ft=3435, continent='NA', iso_country='US', iso_region='US-KS', municipality='Leoti', gps_code='00AA', iata_code=None, local_code='00AA', coordinates='38.704022, -101.473911'),
 Row(ident='00AK', type='small_airport', name='Lowell Field', elevation_ft=450, continent='NA', iso_country='US', iso_region='US-AK', municipality='Anchor Point', gps_code='00AK', iata_code=None, local_code='00AK', coordinates='59.94919968, -151.695999146'),
 Row(ident='00AL', type='small_airport', name='Epps Airpark', elevation_ft=820, continent='NA', iso_country='US', iso_region='US-AL', municipality='Harvest', gps_code='00AL', iata_code=None, local_code='00AL', coordinates='34.8

In [None]:
df_airport.take(1)

[Row(ident='00A', type='heliport', name='Total Rf Heliport', elevation_ft=11, continent='NA', iso_country='US', iso_region='US-PA', municipality='Bensalem', gps_code='00A', iata_code=None, local_code='00A', coordinates='40.07080078125, -74.93360137939453')]

PySpark DataFrame also provides the conversion back to a pandas DataFrame to leverage pandas API. Note that toPandas also collects all data into the driver side that can easily cause an out-of-memory-error when the data is too large to fit into the driver side.

In [None]:
df_airport.toPandas()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"40.07080078125, -74.93360137939453"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"38.704022, -101.473911"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"59.94919968, -151.695999146"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"34.86479949951172, -86.77030181884766"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"35.6087, -91.254898"
...,...,...,...,...,...,...,...,...,...,...,...,...
55108,ZYYK,medium_airport,Yingkou Lanqi Airport,0.0,AS,CN,CN-21,Yingkou,ZYYK,YKH,,"40.542524, 122.3586"
55109,ZYYY,medium_airport,Shenyang Dongta Airport,,AS,CN,CN-21,Shenyang,ZYYY,,,"41.784400939941406, 123.49600219726562"
55110,ZZ-0001,heliport,Sealand Helipad,40.0,EU,GB,GB-ENG,Sealand,,,,"51.894444, 1.4825"
55111,ZZ-0002,small_airport,Glorioso Islands Airstrip,11.0,AF,TF,TF-U-A,Grande Glorieuse,,,,"-11.584277777799999, 47.296388888900005"


In [None]:
df_airport.columns

['ident',
 'type',
 'name',
 'elevation_ft',
 'continent',
 'iso_country',
 'iso_region',
 'municipality',
 'gps_code',
 'iata_code',
 'local_code',
 'coordinates']

## Selecting and Accessing Data

PySpark DataFrame is lazily evaluated and simply selecting a column does not trigger the computation but it returns a `Column` instance.

In [None]:
df_airport.municipality

Column<'municipality'>

In [None]:
df_airport.select(df_airport.municipality).show(10)

+------------+
|municipality|
+------------+
|    Bensalem|
|       Leoti|
|Anchor Point|
|     Harvest|
|     Newport|
|        Alex|
|      Cordes|
|     Barstow|
|       Biggs|
| Pine Valley|
+------------+
only showing top 10 rows



In [None]:
df_airport.select(F.col("municipality")).show(10)

+------------+
|municipality|
+------------+
|    Bensalem|
|       Leoti|
|Anchor Point|
|     Harvest|
|     Newport|
|        Alex|
|      Cordes|
|     Barstow|
|       Biggs|
| Pine Valley|
+------------+
only showing top 10 rows



Assign new Column instance.

In [4]:
from pyspark.sql import Column
from pyspark.sql.functions import upper

In [9]:
df_airport.withColumn('upper_MUNIPALITY', upper(df_airport.municipality)).show(5)

+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+----------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|         coordinates|upper_MUNIPALITY|
+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+----------------+
|  00A|     heliport|   Total Rf Heliport|          11|       NA|         US|     US-PA|    Bensalem|     00A|     null|       00A|40.07080078125, -...|        BENSALEM|
| 00AA|small_airport|Aero B Ranch Airport|        3435|       NA|         US|     US-KS|       Leoti|    00AA|     null|      00AA|38.704022, -101.4...|           LEOTI|
| 00AK|small_airport|        Lowell Field|         450|       NA|         US|     US-AK|Anchor Point|    00AK|     null|      00AK|59.94919968, -151..

To select a subset of rows, use `DataFrame.filter()`.

In [10]:
# DataFrame.filter()

df_airport.filter(df_airport.type ==  "heliport").show(5)

+-----+--------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|ident|    type|                name|elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|         coordinates|
+-----+--------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|  00A|heliport|   Total Rf Heliport|          11|       NA|         US|     US-PA|    Bensalem|     00A|     null|       00A|40.07080078125, -...|
| 00CN|heliport|Kitchen Creek Hel...|        3350|       NA|         US|     US-CA| Pine Valley|    00CN|     null|      00CN|32.7273736, -116....|
| 00FD|heliport|  Ringhaver Heliport|          25|       NA|         US|     US-FL|   Riverview|    00FD|     null|      00FD|28.84659957885742...|
| 00GE|heliport|    Caffrey Heliport|         957|       NA|         US|     US-GA|       Hiram|    00GE|     nu

In [None]:
df_airport.filter((df_airport.iso_country != "US") & (df_airport.iata_code != "null")).show(3)

+-----+-------------+---------------+------------+---------+-----------+----------+---------------+--------+---------+----------+--------------------+
|ident|         type|           name|elevation_ft|continent|iso_country|iso_region|   municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+---------------+------------+---------+-----------+----------+---------------+--------+---------+----------+--------------------+
|  03N|small_airport| Utirik Airport|           4|       OC|         MH|    MH-UTI|  Utirik Island|    K03N|      UTK|       03N|  11.222, 169.852005|
|  AAD|small_airport|  Adado Airport|        1001|       AF|         SO|     SO-GA|          Adado|    null|      AAD|      null|   6.095802, 46.6375|
|  ABP|small_airport|Atkamba Airport|         150|       OC|         PG|    PG-WPD|Atkamba Mission|    null|      ABP|       AKA|-6.06555555556000...|
+-----+-------------+---------------+------------+---------+-----------+----------+-----------

In [None]:
# we can also use brackets (as in Pandas) instead of filter()

df_airport[df_airport.iso_country != "US" ].show(5)

+-----+-------------+--------------------+------------+---------+-----------+----------+----------------+--------+---------+----------+--------------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region|    municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+--------------------+------------+---------+-----------+----------+----------------+--------+---------+----------+--------------------+
| 02PR|small_airport|     Cuylers Airport|          15|       NA|         PR|    PR-U-A|       Vega Baja|    02PR|     null|      02PR|18.45330047607422...|
|  03N|small_airport|      Utirik Airport|           4|       OC|         MH|    MH-UTI|   Utirik Island|    K03N|      UTK|       03N|  11.222, 169.852005|
| 0TT8|     heliport|    Dynasty Heliport|         150|       OC|         MP|    MP-U-A|San Jose, Tinian|    0TT8|     null|      0TT8|14.96329975128173...|
| 12PR|     heliport|Villamil-304 Ponc...|         148|   

In [None]:
df_airport[(df_airport.iso_country != "US") & (df_airport.iata_code != "null")].show(3)

+-----+-------------+---------------+------------+---------+-----------+----------+---------------+--------+---------+----------+--------------------+
|ident|         type|           name|elevation_ft|continent|iso_country|iso_region|   municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+---------------+------------+---------+-----------+----------+---------------+--------+---------+----------+--------------------+
|  03N|small_airport| Utirik Airport|           4|       OC|         MH|    MH-UTI|  Utirik Island|    K03N|      UTK|       03N|  11.222, 169.852005|
|  AAD|small_airport|  Adado Airport|        1001|       AF|         SO|     SO-GA|          Adado|    null|      AAD|      null|   6.095802, 46.6375|
|  ABP|small_airport|Atkamba Airport|         150|       OC|         PG|    PG-WPD|Atkamba Mission|    null|      ABP|       AKA|-6.06555555556000...|
+-----+-------------+---------------+------------+---------+-----------+----------+-----------

In [5]:
# df.col.contains(other): Contains the other element. Returns a boolean Column based on a string match.

df_airport.filter(df_airport.municipality.contains('Island')).show(5)

+-----+-------------+--------------------+------------+---------+-----------+----------+------------------+--------+---------+----------+--------------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region|      municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+--------------------+------------+---------+-----------+----------+------------------+--------+---------+----------+--------------------+
|  03N|small_airport|      Utirik Airport|           4|       OC|         MH|    MH-UTI|     Utirik Island|    K03N|      UTK|       03N|  11.222, 169.852005|
| 0IS0|     heliport|Trinity Medical C...|         664|       NA|         US|     US-IL|       Rock Island|    0IS0|     null|      0IS0|41.48109817504883...|
| 0MD2|small_airport|Squier Landing Ai...|          16|       NA|         US|     US-MD|       Cobb Island|    null|     null|      null|38.287781, -76.86...|
| 0XS5|     heliport|Southeastern Heli...|    

In [6]:
# endswith: String ends with. Returns a boolean Column based on a string match.
# startswith: String starts with. Returns a boolean Column based on a string match

df_airport.filter(df_airport.name.endswith('Airpark')).show(5)

+-----+-------------+------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|ident|         type|              name|elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
| 00AL|small_airport|      Epps Airpark|         820|       NA|         US|     US-AL|     Harvest|    00AL|     null|      00AL|34.86479949951172...|
|  01J|small_airport|  Hilliard Airpark|          59|       NA|         US|     US-FL|    Hilliard|     01J|     null|       01J|30.68630027770996...|
| 01NC|small_airport|   Topsail Airpark|          65|       NA|         US|     US-NC| Holly Ridge|    01NC|     null|      01NC|34.47529983520508...|
| 02MI|small_airport|Fairplains Airpark|         850|       NA|         US|     US-MI|  Greenv

In [7]:
# SQL like expression. Returns a boolean Column based on a SQL LIKE match.

df_airport.filter(df_airport.name.like('Hi%')).show(5)

+-----+-------------+--------------------+------------+---------+-----------+----------+----------------+--------+---------+----------+--------------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region|    municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+--------------------+------------+---------+-----------+----------+----------------+--------+---------+----------+--------------------+
|  01J|small_airport|    Hilliard Airpark|          59|       NA|         US|     US-FL|        Hilliard|     01J|     null|       01J|30.68630027770996...|
| 01MO|     heliport|Highway Patrol Tr...|         615|       NA|         US|     US-MO|Town and Country|    01MO|     null|      01MO|38.64170074462890...|
| 03AL|     heliport|Highland Medical ...|         628|       NA|         US|     US-AL|      Scottsboro|    03AL|     null|      03AL|34.662604, -86.04...|
| 05TE|small_airport|   Hilde-Griff Field|         950|   

## Grouping Data

PySpark DataFrame also provides a way of handling grouped data by using the common approach, split-apply-combine strategy. It groups the data by a certain condition applies a function to each group and then combines them back to the DataFrame.
    
    Some Aggregation Operations
    Row Count:                F.count()
    Sum of Rows in Group:     F.sum(*cols)
    Mean of Rows in Group:    F.mean(*cols)
    Max of Rows in Group:     F.max(*cols)
    Min of Rows in Group:     F.min(*cols)
    First Row in Group:       F.alias(*cols)


In [11]:
# groupby() operation can be used to group a variable

df_airport_group = df_airport.groupby('type')

df_airport_group.count().show(5)

+-------------+-----+
|         type|count|
+-------------+-----+
|large_airport|  613|
|  balloonport|   23|
|seaplane_base| 1014|
|     heliport|11316|
|       closed| 3618|
+-------------+-----+
only showing top 5 rows



In [None]:
# orderby() operation sorts the DataFrame based one or more columns

df_airport_group.count().orderBy('type').show(5)

+--------------+-----+
|          type|count|
+--------------+-----+
|   balloonport|   23|
|        closed| 3618|
|      heliport|11316|
| large_airport|  613|
|medium_airport| 4531|
+--------------+-----+
only showing top 5 rows



In [None]:
# avg() function return to the resulting groups mean

df_airport.groupby('type').avg('elevation_ft').show()

+--------------+------------------+
|          type| avg(elevation_ft)|
+--------------+------------------+
| large_airport| 802.8962108731466|
|   balloonport|           1059.55|
| seaplane_base| 651.4975786924939|
|      heliport|1163.2955186461925|
|        closed|1022.8905136857893|
|medium_airport| 1046.634371395617|
| small_airport|1341.5329663040595|
+--------------+------------------+



## Some Built-in Functions

In [8]:
# withColumnRenamed() renames a column in the DataFrame

df_airport = df_airport.withColumnRenamed('ident', 'index')
df_airport = df_airport.withColumnRenamed('elevation_ft', 'ft')

In [9]:
# dropDuplicates() removes the duplicate rows of a DataFrame

df_no_dup = df_airport.select('index').dropDuplicates()

df_no_dup.count()

55113

In [10]:
# PySpark drop() takes self and *cols as arguments.

# please pay attention here, We re-created the df_airport variable

df_airport = df_airport.drop("gps_code","iata_code","local_code","iso_country","iso_region","continent")
df_airport.printSchema()

root
 |-- index: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- ft: integer (nullable = true)
 |-- municipality: string (nullable = true)
 |-- coordinates: string (nullable = true)



In [None]:
df_airport.coordinates

Column<'coordinates'>

In [11]:
from pyspark.sql.functions import split

df_airport = df_airport.withColumn("lat", split("coordinates", ",")[0])\
                       .withColumn("long", split("coordinates", ", ")[1])

df_airport = df_airport.drop("coordinates")
df_airport.show(5)

+-----+-------------+--------------------+----+------------+-----------------+------------------+
|index|         type|                name|  ft|municipality|              lat|              long|
+-----+-------------+--------------------+----+------------+-----------------+------------------+
|  00A|     heliport|   Total Rf Heliport|  11|    Bensalem|   40.07080078125|-74.93360137939453|
| 00AA|small_airport|Aero B Ranch Airport|3435|       Leoti|        38.704022|       -101.473911|
| 00AK|small_airport|        Lowell Field| 450|Anchor Point|      59.94919968|    -151.695999146|
| 00AL|small_airport|        Epps Airpark| 820|     Harvest|34.86479949951172|-86.77030181884766|
| 00AR|       closed|Newport Hospital ...| 237|     Newport|          35.6087|        -91.254898|
+-----+-------------+--------------------+----+------------+-----------------+------------------+
only showing top 5 rows



In [None]:
# Between This method returns either True or False if the passed values in the between method

df_airport.filter(df_airport.lat.between(40.0, 41.0)).show(5)

+-----+-------------+--------------------+----+-------------+-----------------+------------------+
|index|         type|                name|  ft| municipality|              lat|              long|
+-----+-------------+--------------------+----+-------------+-----------------+------------------+
|  00A|     heliport|   Total Rf Heliport|  11|     Bensalem|   40.07080078125|-74.93360137939453|
| 00CO|       closed|          Cass Field|4830|   Briggsdale|        40.622202|       -104.344002|
| 00IS|small_airport|Hayenga's Cant Fi...| 820|        Kings|40.02560043334961| -89.1229019165039|
| 00NJ|     heliport|Colgate-Piscatawa...|  78|New Brunswick|40.52090072631836|-74.47460174560547|
| 00PS|       closed|        Thomas Field| 815|    Loysville|          40.3778|        -77.365303|
+-----+-------------+--------------------+----+-------------+-----------------+------------------+
only showing top 5 rows



In [12]:
# Replace() Returns a new DataFrame replacing a value with another value

df_airport.replace({"heliport": "heli_port"}, subset=["type"]).show(5)

+-----+-------------+--------------------+----+------------+-----------------+------------------+
|index|         type|                name|  ft|municipality|              lat|              long|
+-----+-------------+--------------------+----+------------+-----------------+------------------+
|  00A|    heli_port|   Total Rf Heliport|  11|    Bensalem|   40.07080078125|-74.93360137939453|
| 00AA|small_airport|Aero B Ranch Airport|3435|       Leoti|        38.704022|       -101.473911|
| 00AK|small_airport|        Lowell Field| 450|Anchor Point|      59.94919968|    -151.695999146|
| 00AL|small_airport|        Epps Airpark| 820|     Harvest|34.86479949951172|-86.77030181884766|
| 00AR|       closed|Newport Hospital ...| 237|     Newport|          35.6087|        -91.254898|
+-----+-------------+--------------------+----+------------+-----------------+------------------+
only showing top 5 rows



In [18]:
# round(col, scale=0): Round the given value to scale decimal places using HALF_UP rounding mode if scale >= 0 or at integral part when scale < 0.

df_airport = df_airport.withColumn('lat', F.round('lat', 1))
df_airport.show(5)

+-----+-------------+--------------------+----+------------+----+------------------+
|index|         type|                name|  ft|municipality| lat|              long|
+-----+-------------+--------------------+----+------------+----+------------------+
|  00A|     heliport|   Total Rf Heliport|  11|    Bensalem|40.1|-74.93360137939453|
| 00AA|small_airport|Aero B Ranch Airport|3435|       Leoti|38.7|       -101.473911|
| 00AK|small_airport|        Lowell Field| 450|Anchor Point|59.9|    -151.695999146|
| 00AL|small_airport|        Epps Airpark| 820|     Harvest|34.9|-86.77030181884766|
| 00AR|       closed|Newport Hospital ...| 237|     Newport|35.6|        -91.254898|
+-----+-------------+--------------------+----+------------+----+------------------+
only showing top 5 rows



In [19]:
# F.floor(col) Computes the floor of the given value.

df_airport = df_airport.withColumn('long', F.floor('long'))
df_airport.show(5)

+-----+-------------+--------------------+----+------------+----+----+
|index|         type|                name|  ft|municipality| lat|long|
+-----+-------------+--------------------+----+------------+----+----+
|  00A|     heliport|   Total Rf Heliport|  11|    Bensalem|40.1| -75|
| 00AA|small_airport|Aero B Ranch Airport|3435|       Leoti|38.7|-102|
| 00AK|small_airport|        Lowell Field| 450|Anchor Point|59.9|-152|
| 00AL|small_airport|        Epps Airpark| 820|     Harvest|34.9| -87|
| 00AR|       closed|Newport Hospital ...| 237|     Newport|35.6| -92|
+-----+-------------+--------------------+----+------------+----+----+
only showing top 5 rows



## Applying a Function

PySpark supports various UDFs and APIs to allow users to execute Python native functions. See also the latest [Pandas UDFs](https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs) and [Pandas Function APIs](https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-function-apis). For instance, the example below allows users to directly use the APIs in [a pandas Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) within Python native function.

In [23]:
## Applying a Function

def multiply(series: pd.Series) -> pd.Series:
    # Simply multiply by using pandas Series.
    return series*1000

df_airport.select(multiply(df_airport.ft)).show(10)

+-----------+
|(ft * 1000)|
+-----------+
|      11000|
|    3435000|
|     450000|
|     820000|
|     237000|
|    1100000|
|    3810000|
|    3038000|
|      87000|
|    3350000|
+-----------+
only showing top 10 rows



In [22]:
def pandas_filter_func(iterator):
    for pandas_df in iterator:
        yield pandas_df[pandas_df.ft > 10000]

df_airport.mapInPandas(pandas_filter_func, schema=df_airport.schema).show(5)

+-------+--------------+--------------------+-----+------------------+-----+----+
|  index|          type|                name|   ft|      municipality|  lat|long|
+-------+--------------+--------------------+-----+------------------+-----+----+
|   01CO|      heliport|St Vincent Genera...|10175|         Leadville| 39.2|-107|
|   AF13| small_airport|Sheber Too Landin...|10490|              null| 34.8|  67|
|    AHJ|medium_airport|    Hongyuan Airport|11600|               Aba| 32.5| 102|
|AR-0478|      heliport|Pachón Minera Hel...|11810|Pachón, Calingasta|-31.8| -71|
|   CD19|      heliport|   Arapahoe Heliport|10672|     Idaho Springs| 39.7|-106|
+-------+--------------+--------------------+-----+------------------+-----+----+
only showing top 5 rows



In [28]:
# Creates a user defined function (UDF)

udf_func = F.udf(lambda x,y: str(x)+" , "+ str(y)) 

df_airport.withColumn('lat-long', udf_func(df_airport.lat,df_airport.long)).show(5)

+-----+-------------+--------------------+----+------------+----+----+-----------+
|index|         type|                name|  ft|municipality| lat|long|   lat-long|
+-----+-------------+--------------------+----+------------+----+----+-----------+
|  00A|     heliport|   Total Rf Heliport|  11|    Bensalem|40.1| -75| 40.1 , -75|
| 00AA|small_airport|Aero B Ranch Airport|3435|       Leoti|38.7|-102|38.7 , -102|
| 00AK|small_airport|        Lowell Field| 450|Anchor Point|59.9|-152|59.9 , -152|
| 00AL|small_airport|        Epps Airpark| 820|     Harvest|34.9| -87| 34.9 , -87|
| 00AR|       closed|Newport Hospital ...| 237|     Newport|35.6| -92| 35.6 , -92|
+-----+-------------+--------------------+----+------------+----+----+-----------+
only showing top 5 rows



## Data load/save timing

CSV is straightforward and easy to use. Parquet is efficient and compact file formats to read and write faster.

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/PySpark/data/amazonFood.csv

In [None]:
amazon = spark.read.csv('amazonFood.csv', header=True)

for csv files

In [None]:
# Write CSV

%%time
amazon.write.csv('amazon.csv', header=True)

CPU times: user 31.4 ms, sys: 4.22 ms, total: 35.7 ms
Wall time: 4.06 s


In [None]:
# Read CSV

%%time
df_amazon = spark.read.csv('amazon.csv', header=True)

CPU times: user 9.2 ms, sys: 0 ns, total: 9.2 ms
Wall time: 636 ms


In [None]:
# collect

%%time
%%capture
df_amazon.collect()

CPU times: user 403 ms, sys: 72.1 ms, total: 475 ms
Wall time: 3.31 s


for parquet files

In [None]:
# Write PARQUET

%%time
amazon.write.parquet('amazon.parquet')

CPU times: user 18.9 ms, sys: 4.01 ms, total: 22.9 ms
Wall time: 2.96 s


In [None]:
# Read PARQUET

%%time
df_amazon = spark.read.parquet('amazon.parquet')

CPU times: user 4.11 ms, sys: 0 ns, total: 4.11 ms
Wall time: 259 ms


In [None]:
# collect

%%time
%%capture
df_amazon.collect()

CPU times: user 388 ms, sys: 85.7 ms, total: 473 ms
Wall time: 2.14 s


# PySpark SQL

## Interacting with DataFrames using PySpark SQL

A SQL query returns a table derived from one or more tables contained in a database.

Every SQL query is made up of commands that tell the database what you want to do with the data. The two commands that every query has to contain are SELECT and FROM.

The SELECT command is followed by the columns you want in the resulting table.

The FROM command is followed by the name of the table that contains those columns. The minimal SQL query is:

SELECT * FROM my_table;

The * selects all columns, so this returns the entire table named my_table.

Similar to .withColumn(), you can do column-wise computations within a SELECT statement. For example,

SELECT origin, dest, air_time / 60 FROM flights;

returns a table with the origin, destination, and duration in hours for each flight.

Another commonly used command is WHERE. This command filters the rows of the table based on some logical condition you specify. The resulting table contains the rows where your condition is true. For example, if you had a table of students and grades you could do:

SELECT * FROM students
WHERE grade = 'A';

to select all the columns and the rows containing information about students who got As.


Another common database task is aggregation. That is, reducing your data by breaking it into chunks and summarizing each chunk.

This is done in SQL using the GROUP BY command. This command breaks your data into groups and applies a function from your SELECT statement to each group.

For example, if you wanted to count the number of flights from each of two origin destinations, you could use the query

SELECT COUNT(*) FROM flights
GROUP BY origin;

GROUP BY origin tells SQL that you want the output to have a row for each unique value of the origin column. The SELECT statement selects the values you want to populate each of the columns. Here, we want to COUNT() every row in each of the groups.

It's possible to GROUP BY more than one column. When you do this, the resulting table has a row for every combination of the unique values in each column. The following query counts the number of flights from SEA and PDX to every destination airport:

SELECT origin, dest, COUNT(*) FROM flights
GROUP BY origin, dest;

The output will have a row for every combination of the values in origin and dest (i.e. a row listing each origin and destination that a flight flew to). There will also be a column with the COUNT() of all the rows in each group.

Another very common data operation is the join. Joins are a whole topic unto themselves, so in this course we'll just look at simple joins. If you'd like to learn more about joins, you can take a look here.

A join will combine two different tables along a column that they share. This column is called the key. Examples of keys here include the tailnum and carrier columns from the flights table.

For example, suppose that you want to know more information about the plane that flew a flight than just the tail number. This information isn't in the flights table because the same plane flies many different flights over the course of two years, so including this information in every row would result in a lot of duplication. To avoid this, you'd have a second table that has only one row for each plane and whose columns list all the information about the plane, including its tail number. You could call this table planes

When you join the flights table to this table of airplane information, you're adding all the columns from the planes table to the flights table. To fill these columns with information, you'll look at the tail number from the flights table and find the matching one in the planes table, and then use that row to fill out all the new columns.

Now you'll have a much bigger table than before, but now every row has all information about the plane that flew that flight!



In [None]:
# Executing SQL Queries

# The SparkSession sql() method executes SQL query

# sql() method takes a SQL statement as an argument and returns the result as DataFrame

df_airport.createOrReplaceTempView("table1")

df_sql = spark.sql("SELECT * FROM table1 ")

df_sql.show(5)

+-----+-------------+--------------------+----+------------+-----------------+------------------+
|index|         type|                name|  ft|municipality|              lat|              long|
+-----+-------------+--------------------+----+------------+-----------------+------------------+
|  00A|     heliport|   Total Rf Heliport|  11|    Bensalem|   40.07080078125|-74.93360137939453|
| 00AA|small_airport|Aero B Ranch Airport|3435|       Leoti|        38.704022|       -101.473911|
| 00AK|small_airport|        Lowell Field| 450|Anchor Point|      59.94919968|    -151.695999146|
| 00AL|small_airport|        Epps Airpark| 820|     Harvest|34.86479949951172|-86.77030181884766|
| 00AR|       closed|Newport Hospital ...| 237|     Newport|          35.6087|        -91.254898|
+-----+-------------+--------------------+----+------------+-----------------+------------------+
only showing top 5 rows



In [None]:
df_airport.createOrReplaceTempView("table1")

df_sql = spark.sql("SELECT name, type, municipality, lat, long FROM table1 WHERE ft > 100")

df_sql.show(5)

+--------------------+-------------+------------+-----------------+------------------+
|                name|         type|municipality|              lat|              long|
+--------------------+-------------+------------+-----------------+------------------+
|Aero B Ranch Airport|small_airport|       Leoti|        38.704022|       -101.473911|
|        Lowell Field|small_airport|Anchor Point|      59.94919968|    -151.695999146|
|        Epps Airpark|small_airport|     Harvest|34.86479949951172|-86.77030181884766|
|Newport Hospital ...|       closed|     Newport|          35.6087|        -91.254898|
|      Fulton Airport|small_airport|        Alex|       34.9428028|       -97.8180194|
+--------------------+-------------+------------+-----------------+------------------+
only showing top 5 rows



In [None]:
df_airport.createOrReplaceTempView("table2")

query = 'SELECT type, max(lat) FROM table2 GROUP BY type'

spark.sql(query).show(10)

+--------------+-----------------+
|          type|         max(lat)|
+--------------+-----------------+
|   balloonport|        51.710142|
|        closed|           9.3615|
|      heliport|9.971667289733887|
| large_airport|     9.0713596344|
|medium_airport|9.993860244750977|
| seaplane_base|         9.299746|
| small_airport|         9.994933|
+--------------+-----------------+



In [None]:
df_sql.createOrReplaceTempView("airport")

query = "FROM airport SELECT * LIMIT 10"

# Get the first 10 rows of flights
airport = spark.sql(query)

# Show the results
airport.show()

+--------------------+-------------+------------+------------------+-------------------+
|                name|         type|municipality|               lat|               long|
+--------------------+-------------+------------+------------------+-------------------+
|Aero B Ranch Airport|small_airport|       Leoti|         38.704022|        -101.473911|
|        Lowell Field|small_airport|Anchor Point|       59.94919968|     -151.695999146|
|        Epps Airpark|small_airport|     Harvest| 34.86479949951172| -86.77030181884766|
|Newport Hospital ...|       closed|     Newport|           35.6087|         -91.254898|
|      Fulton Airport|small_airport|        Alex|        34.9428028|        -97.8180194|
|      Cordes Airport|small_airport|      Cordes|34.305599212646484|-112.16500091552734|
|Goldstone /Gts/ A...|small_airport|     Barstow|35.350498199499995|     -116.888000488|
|Kitchen Creek Hel...|     heliport| Pine Valley|        32.7273736|       -116.4597417|
|          Cass Field

In [None]:
query = "SELECT type, COUNT(*) as N FROM airport GROUP BY type"

# Run the query
flight_counts = spark.sql(query)

# Convert the results to a pandas DataFrame
pd_counts = flight_counts.toPandas()

# Print the head of pd_counts
print(pd_counts)

             type      N
0   large_airport    406
1     balloonport     17
2   seaplane_base    592
3        heliport   7497
4          closed   2126
5  medium_airport   3083
6   small_airport  26014


# Resources

1. https://spark.apache.org/docs/latest/rdd-programming-guide.html
2. https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html#
3. https://github.com/vkocaman/PySpark_Essentials_March_2019
4. https://github.com/sundarramamurthy/pyspark
5. https://towardsdatascience.com/beginners-guide-to-pyspark-bbe3b553b79f
6. https://www.guru99.com/pyspark-tutorial.html
7. https://towardsdatascience.com/exploratory-data-analysis-eda-with-pyspark-on-databricks-e8d6529626b1

