<a href="https://colab.research.google.com/github/Fuenfgeld/2022TeamADataEngineeringBC/blob/PySpark/PySparkTutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
pip install pyspark



In [7]:
!wget -cq https://raw.githubusercontent.com/Fuenfgeld/2022TeamADataEngineeringBC/ca4b2ecc9e9ee242037d11c27edd4f4ad770e7ee/iris.json

In [8]:
!wget -cq https://raw.githubusercontent.com/Fuenfgeld/2022TeamADataEngineeringBC/PySpark/iris2.json

##1. Loading Data

Before we can analyze data we have to load it into our working environment. PySpark has a lot of functions that can deal with all kinds of formats from `.csv` to `.json`. The basic unit of data storage in PySpark is the so called `DataFrame` class.

In [9]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

In [10]:
df1 = spark.read.option("multiline",True).json('iris.json')
print(f"Object Type: {type(df1)}\n")
print("Column Info:")
df1.printSchema()
print("Overview Dataframe:")
df1.show(10)

Object Type: <class 'pyspark.sql.dataframe.DataFrame'>

Column Info:
root
 |-- petalLength: double (nullable = true)
 |-- petalWidth: double (nullable = true)
 |-- sepalLength: double (nullable = true)
 |-- sepalWidth: double (nullable = true)
 |-- species: string (nullable = true)

Overview Dataframe:
+-----------+----------+-----------+----------+-------+
|petalLength|petalWidth|sepalLength|sepalWidth|species|
+-----------+----------+-----------+----------+-------+
|        1.4|       0.2|        5.1|       3.5| setosa|
|        1.4|       0.2|        4.9|       3.0| setosa|
|        1.3|       0.2|        4.7|       3.2| setosa|
|        1.5|       0.2|        4.6|       3.1| setosa|
|        1.4|       0.2|        5.0|       3.6| setosa|
|        1.7|       0.4|        5.4|       3.9| setosa|
|        1.4|       0.3|        4.6|       3.4| setosa|
|        1.5|       0.2|        5.0|       3.4| setosa|
|        1.4|       0.2|        4.4|       2.9| setosa|
|        1.5|       0.1|

In [12]:
df1.select("petalLength").describe().show()

+-------+------------------+
|summary|       petalLength|
+-------+------------------+
|  count|               150|
|   mean|3.7580000000000027|
| stddev|1.7652982332594662|
|    min|               1.0|
|    max|               6.9|
+-------+------------------+



##2. Basic transformations
Some of the most basic functionalities of tables are that we can access specific chunks of the tables rows and columns as well as new rows and columns.

`DataFrame.collect()` collects the distributed data to the driver side as the local data in Python. Note that this can throw an out-of-memory error when the dataset is too large to fit in the driver side because it collects all the data from executors to the driver side.

In [13]:
local_df1 = df1.collect()
local_df1[0]

Row(petalLength=1.4, petalWidth=0.2, sepalLength=5.1, sepalWidth=3.5, species='setosa')

If we want to access specific columns we can use `.NameOfColumn`

In [14]:
petalLength = df1.petalLength
petalWidth = df1.petalWidth
df1.select(petalLength, petalWidth).show(5)

+-----------+----------+
|petalLength|petalWidth|
+-----------+----------+
|        1.4|       0.2|
|        1.4|       0.2|
|        1.3|       0.2|
|        1.5|       0.2|
|        1.4|       0.2|
+-----------+----------+
only showing top 5 rows



Suppose we have similar datasets from multiple sources. Wouldn't it be practical to combine them into one table ? `pyspark` provides such a funcionality via the `.union()` method which is equivalent to `UNION ALL` in SQL.

In [15]:
df2 = spark.read.json('iris2.json')
df2.show()
df1.union(df2)

+-----------+----------+-----------+----------+---------+
|petalLength|petalWidth|sepalLength|sepalWidth|  species|
+-----------+----------+-----------+----------+---------+
|        5.1|       1.8|        5.9|       3.0|virginica|
+-----------+----------+-----------+----------+---------+



DataFrame[petalLength: double, petalWidth: double, sepalLength: double, sepalWidth: double, species: string]

In case we want to add columns we can do so via the `.withColumn()` method. Note that we have to specify the name of the column which is in this case `petalSum`. Usually the new column is a function of one or more of the old columns. 

In [16]:
df_extraCol = df1.withColumn('newColumn', df1.petalWidth + df1.petalLength)
df_extraCol.show(5)

+-----------+----------+-----------+----------+-------+------------------+
|petalLength|petalWidth|sepalLength|sepalWidth|species|         newColumn|
+-----------+----------+-----------+----------+-------+------------------+
|        1.4|       0.2|        5.1|       3.5| setosa|1.5999999999999999|
|        1.4|       0.2|        4.9|       3.0| setosa|1.5999999999999999|
|        1.3|       0.2|        4.7|       3.2| setosa|               1.5|
|        1.5|       0.2|        4.6|       3.1| setosa|               1.7|
|        1.4|       0.2|        5.0|       3.6| setosa|1.5999999999999999|
+-----------+----------+-----------+----------+-------+------------------+
only showing top 5 rows



The name `'newColumn'` isn't really informative. It's therefore hard for the user to deduce that is it the sum of `'petalWidth'` and `'petalLength'`. So why not rename it to something more indicative ? We can do this via the `withColumnRenamed` method.

In [17]:
df_extraCol = df_extraCol.withColumnRenamed('newColumn','petalSum')
df_extraCol.show(5)

+-----------+----------+-----------+----------+-------+------------------+
|petalLength|petalWidth|sepalLength|sepalWidth|species|          petalSum|
+-----------+----------+-----------+----------+-------+------------------+
|        1.4|       0.2|        5.1|       3.5| setosa|1.5999999999999999|
|        1.4|       0.2|        4.9|       3.0| setosa|1.5999999999999999|
|        1.3|       0.2|        4.7|       3.2| setosa|               1.5|
|        1.5|       0.2|        4.6|       3.1| setosa|               1.7|
|        1.4|       0.2|        5.0|       3.6| setosa|1.5999999999999999|
+-----------+----------+-----------+----------+-------+------------------+
only showing top 5 rows



In order to get rid of our new column `.drop()` can be used. In contrast to `.select()`, this method removes the specified column completely instead of returning it as slice ot the table.




In [18]:
df1 = df_extraCol.drop(df_extraCol.petalSum)
df1.show(5)

+-----------+----------+-----------+----------+-------+
|petalLength|petalWidth|sepalLength|sepalWidth|species|
+-----------+----------+-----------+----------+-------+
|        1.4|       0.2|        5.1|       3.5| setosa|
|        1.4|       0.2|        4.9|       3.0| setosa|
|        1.3|       0.2|        4.7|       3.2| setosa|
|        1.5|       0.2|        4.6|       3.1| setosa|
|        1.4|       0.2|        5.0|       3.6| setosa|
+-----------+----------+-----------+----------+-------+
only showing top 5 rows



Sometimes rows also contain entries that make dealing with our data more difficult or lower its quality. Two examples come to mind: Duplicate entries could bias introduce bias into our data which negatively impacts the performance of a lot of machine learning algorithms.

The second example would be null entries which might render some rows useless due to the fact that most algorithms generally can't handle such entries. Luckily PySpark provides us with two methods `.dropna()` and `.dropDuplicates()` to get rid of such problematic rows.



In [19]:
df1 = df1.dropna()

In [20]:
df1 = df1.dropDuplicates() 

You learned how to perform some basic transformations of the table, but what if you want to look up values not based on indices but rather on criteria such as a certain column's entry being bigger than some threshold? In the next chapter we are going to take a look at how to select rows via user given conditions.