<a href="https://colab.research.google.com/github/Fuenfgeld/2022TeamADataEngineeringBC/blob/PySpark/PySparkTutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##0. Data Engineering Bootcamp
In this tutorial you will be introduced to an aspect of Data Engineering called ETL. Together we will implement an ETL workflow with Apache Spark in Python. By the end of the tutorial you will be able to adapt such a workflow to your specific needs and the benefits of using Spark in doing so.



### 0.1 What is Data Engineering ?

Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale. Data engineers work in a variety of settings to build systems that collect, manage, and convert raw data into usable information for data scientists and business analysts to interpret. Their ultimate goal is to make data accessible so that organizations can use it to evaluate and optimize their performance. This last sentence also sums up the difference between a data engineer and a data analyst, whereas the former manages the data resources the later exploits them to gain valuable insights.

### 0.2 What is ETL ?

According to IBM ETL, which stands for extract, transform and load, is a data integration process that combines data from multiple data sources into a single, consistent data store. It is closely linked with the concept of a *Data Warehouse* describes central repositories of integrated data from one or more disparate sources. 

#### Extraction
During data extraction, raw data is copied or exported from source locations from a variety of data sources, which can be structured or unstructured such as SQL databases, json files or even web pages.
#### Transformation
The collected raw data then undergoes data processing. Here, the data is transformed and consolidated for its intended analytical use case. Steps taken during transformation are de-duplicating values, performing calculations, translations, or summarizations based on the raw data and changing the shape of the dataa via joining and grouping operation in order to match the schema of the target data warehouse. The environment in which the transformation step is performed is also called *staging area*.
#### Loading
In this last step, the transformed data is moved from the staging area into a target data warehouse. Typically, this involves an initial loading of all data, followed by periodic loading of incremental data changes 

### 0.3 What is Spark ?

According to the official website

>*Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.*

Now what does that mean in ? You can think of Spark as a programming library that allows you to outsource your data engineering workflow to a set of servers (cluster) which enables you to parallelize operations, enabling faster execution and the ability to work with amounts of data that couldn't be handled on a single computer (Big Data). 

Hence what Spark does is managing the interaction between your node (computer) and each node (server) of the cluster. Since Spark was originally written in Scala there is no direct way to access its functionality in Python. This is where *PySpark* comes into play. You can think of PySpark as a Python-based wrapper on top of the Scala API there are also similar wrappers for *R* and other programming languages, this is why the official website describes Spark as a *multi-language engine*. 

In [1]:
pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 32 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 54.9 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=6e237fbf417a9ed198dfc17d15d2ed62f12f587856b3600fbf9bd9733161c8f4
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


In [2]:
!wget -cq https://raw.githubusercontent.com/Fuenfgeld/2022TeamADataEngineeringBC/ca4b2ecc9e9ee242037d11c27edd4f4ad770e7ee/iris.json

In [3]:
!wget -cq https://raw.githubusercontent.com/Fuenfgeld/2022TeamADataEngineeringBC/PySpark/iris2.json

##1. Loading Data

Before we can analyze data we have to load it into our working environment. PySpark has a lot of functions that can deal with all kinds of formats from `.csv` to `.json`. The basic unit of data storage in PySpark is the so called `DataFrame` class.

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

In [5]:
df1 = spark.read.option("multiline",True).json('iris.json')
print(f"Object Type: {type(df1)}\n")
print("Column Info:")
df1.printSchema()
print("Summary Statistics of columns:")
df1.describe().show()
print("Overview Dataframe:")
df1.show(10)

Object Type: <class 'pyspark.sql.dataframe.DataFrame'>

Column Info:
root
 |-- petalLength: double (nullable = true)
 |-- petalWidth: double (nullable = true)
 |-- sepalLength: double (nullable = true)
 |-- sepalWidth: double (nullable = true)
 |-- species: string (nullable = true)

Summary Statistics of columns:
+-------+------------------+------------------+------------------+-------------------+---------+
|summary|       petalLength|        petalWidth|       sepalLength|         sepalWidth|  species|
+-------+------------------+------------------+------------------+-------------------+---------+
|  count|               150|               150|               150|                150|      150|
|   mean|3.7580000000000027| 1.199333333333334| 5.843333333333335|  3.057333333333334|     null|
| stddev|1.7652982332594662|0.7622376689603467|0.8280661279778637|0.43586628493669793|     null|
|    min|               1.0|               0.1|               4.3|                2.0|   setosa|
|    m

##2. Basic transformations
Some of the most basic functionalities of tables are that we can access specific chunks of the table's rows and columns as well as create new rows and columns.

###2.1. Accessing Rows

Since Spark was concieved to work with distributed data there is no simple way to access rows at will.

If you want to do so anyways you have the possibility to pull the data onto your local node.

`DataFrame.collect()` collects the distributed data to the driver side as local data in Python. Note that this can throw an out-of-memory error when the dataset is too large to fit in the driver side because it collects all the data from executors to the driver side.

In [6]:
# Returns list of Row objects
local_df1 = df1.collect()
print(f"Type of entries: {type(local_df1[0])}\n")
print(f"Entries: {local_df1[:5]}")

Type of entries: <class 'pyspark.sql.types.Row'>

Entries: [Row(petalLength=1.4, petalWidth=0.2, sepalLength=5.1, sepalWidth=3.5, species='setosa'), Row(petalLength=1.4, petalWidth=0.2, sepalLength=4.9, sepalWidth=3.0, species='setosa'), Row(petalLength=1.3, petalWidth=0.2, sepalLength=4.7, sepalWidth=3.2, species='setosa'), Row(petalLength=1.5, petalWidth=0.2, sepalLength=4.6, sepalWidth=3.1, species='setosa'), Row(petalLength=1.4, petalWidth=0.2, sepalLength=5.0, sepalWidth=3.6, species='setosa')]


### 2.2. Accessing Columns

Accessing columns doesn't come with the difficulties associated with handling rows. If we want to get specific columns we can simply do so through the `.select()` method. 

In [7]:
df1.select("petalLength").show(5)

+-----------+
|petalLength|
+-----------+
|        1.4|
|        1.4|
|        1.3|
|        1.5|
|        1.4|
+-----------+
only showing top 5 rows



It is also possible to choose mutiple columns. Notice that we can adress our columns with `DataFrame.NameOfColumn` instead of `"NameOfColumn"`.

In [8]:
petalLength = df1.petalLength
petalWidth = df1.petalWidth
df1.select(petalLength, petalWidth).show(5)

+-----------+----------+
|petalLength|petalWidth|
+-----------+----------+
|        1.4|       0.2|
|        1.4|       0.2|
|        1.3|       0.2|
|        1.5|       0.2|
|        1.4|       0.2|
+-----------+----------+
only showing top 5 rows



### 2.3. Concatenating DataFrames

Suppose we have a dataset that is split into multiple DataFrames. Wouldn't it be practical to combine them into one table ? `pyspark` provides such a funcionality via the `.union()` method.

In [9]:
df2 = spark.read.json('iris2.json')
df2.show()
df1.union(df2)

+-----------+----------+-----------+----------+---------+
|petalLength|petalWidth|sepalLength|sepalWidth|  species|
+-----------+----------+-----------+----------+---------+
|        5.1|       1.8|        5.9|       3.0|virginica|
+-----------+----------+-----------+----------+---------+



DataFrame[petalLength: double, petalWidth: double, sepalLength: double, sepalWidth: double, species: string]

### 2.4 Adding Columns

In case we want to add columns we can do so via the `.withColumn()` method. Note that we have to specify the name of the column which is in this case `petalSum`. Usually the new column is a function of one or more of the old columns. 

In [10]:
df_extraCol = df1.withColumn('newColumn', df1.petalWidth + df1.petalLength)
df_extraCol.show(5)

+-----------+----------+-----------+----------+-------+------------------+
|petalLength|petalWidth|sepalLength|sepalWidth|species|         newColumn|
+-----------+----------+-----------+----------+-------+------------------+
|        1.4|       0.2|        5.1|       3.5| setosa|1.5999999999999999|
|        1.4|       0.2|        4.9|       3.0| setosa|1.5999999999999999|
|        1.3|       0.2|        4.7|       3.2| setosa|               1.5|
|        1.5|       0.2|        4.6|       3.1| setosa|               1.7|
|        1.4|       0.2|        5.0|       3.6| setosa|1.5999999999999999|
+-----------+----------+-----------+----------+-------+------------------+
only showing top 5 rows



The name `'newColumn'` isn't really informative. It's therefore hard for the user to deduce that is it the sum of `'petalWidth'` and `'petalLength'`. So why not rename it to something more indicative ? We can do this via the `.withColumnRenamed()` method.

In [11]:
df_extraCol = df_extraCol.withColumnRenamed('newColumn','petalSum')
df_extraCol.show(5)

+-----------+----------+-----------+----------+-------+------------------+
|petalLength|petalWidth|sepalLength|sepalWidth|species|          petalSum|
+-----------+----------+-----------+----------+-------+------------------+
|        1.4|       0.2|        5.1|       3.5| setosa|1.5999999999999999|
|        1.4|       0.2|        4.9|       3.0| setosa|1.5999999999999999|
|        1.3|       0.2|        4.7|       3.2| setosa|               1.5|
|        1.5|       0.2|        4.6|       3.1| setosa|               1.7|
|        1.4|       0.2|        5.0|       3.6| setosa|1.5999999999999999|
+-----------+----------+-----------+----------+-------+------------------+
only showing top 5 rows



### 2.5. Removing Columns

In order to get rid of our new column `.drop()` can be used. In contrast to `.select()`, this method removes the specified column completely instead of returning it as slice ot the table.




In [12]:
df1 = df_extraCol.drop(df_extraCol.petalSum)
df1.show(5)

+-----------+----------+-----------+----------+-------+
|petalLength|petalWidth|sepalLength|sepalWidth|species|
+-----------+----------+-----------+----------+-------+
|        1.4|       0.2|        5.1|       3.5| setosa|
|        1.4|       0.2|        4.9|       3.0| setosa|
|        1.3|       0.2|        4.7|       3.2| setosa|
|        1.5|       0.2|        4.6|       3.1| setosa|
|        1.4|       0.2|        5.0|       3.6| setosa|
+-----------+----------+-----------+----------+-------+
only showing top 5 rows



### 2.6. Basic Data Cleaning

Just as in the hospital, hygiene is of great importance to working with data, sometimes rows contain entries that make dealing with our data more difficult or lower its quality (information pollution). Two examples come to mind: Duplicate entries could bias introduce into our data which negatively impacts the performance of a lot of machine learning algorithms.

The second example would be null entries which might render some rows useless due to the fact that most algorithms generally can't handle such entries. Luckily PySpark provides us with two methods `.dropna()` and `.dropDuplicates()` to get rid of such problematic rows.



In [13]:
df1 = df1.dropna()

In [14]:
df1 = df1.dropDuplicates() 

Although our dataframe is now free of unwanted entries we might still want to put further restrictions on the data we want to keep. 

### 2.7. Conditional Selection of Rows.

In 2.1. we explained that directly accessing rows of a DataFrame comes with some caveats, it is however possible to indirectly access rows without pulling all the data onto your local node. This is done via conditional selection where we select rows based on user given conditions via the `.filter()` method. This means however that we don't know which rows we will obtain in the end, hence why we speak of indirect access.

Let's say we want to get only the flowers of type `"virginica"` we then have to write the following:

In [15]:
df_virginica = df1.filter(df1.species == "virginica")
df_virginica.show(5)

+-----------+----------+-----------+----------+---------+
|petalLength|petalWidth|sepalLength|sepalWidth|  species|
+-----------+----------+-----------+----------+---------+
|        6.0|       1.8|        7.2|       3.2|virginica|
|        5.6|       2.1|        6.4|       2.8|virginica|
|        5.1|       2.3|        6.9|       3.1|virginica|
|        6.1|       2.5|        7.2|       3.6|virginica|
|        5.7|       2.3|        6.9|       3.2|virginica|
+-----------+----------+-----------+----------+---------+
only showing top 5 rows



### 2.8. Conclusion

You learned how to perform some basic transformations of the table, but maybe you also want to apply more complex functions to the dataframe's rows or columns such as summary statistics. In the next chapter we are going to take a look at advanced transformations.

##3. Advanced Transformations
Advanced transformations are where PySpark really shines enabling us to execute very complex queries using simple syntax to extract valuable insights from our data. In this chapter we will see the power of methods such as `.groupBy()`, `.join()` especially in combination with more complex functions that are provided by the `functions` module. 

###3.1 Why use Spark functions ?
In general it is possible to use functions from other libraries such as `numpy` on Spark `DataFrame` objects, however this defeats the purpose of Spark which is its ability to optimize the performance of transformation pipelines due to lazy execution. 

This is why the `functions` exists which provides use with a copious amount of functions for all kinds of purposes.

Suppose we want to take the mean petal length of the virginica species. We can reuse the DataFrame `df_virginica` that we created before.


In [16]:
from pyspark.sql.functions import mean
virginica_mean_petalLength = df_virginica.select(mean("petalLength"))
# Execute pipeline.
virginica_mean_petalLength = virginica_mean_petalLength.collect()
print(f"Type of virginica_mean_petalLength: {type(virginica_mean_petalLength[0])}\n")
print(f"Mean petal length of virginica species: {virginica_mean_petalLength[0]}")

Type of virginica_mean_petalLength: <class 'pyspark.sql.types.Row'>

Mean petal length of virginica species: Row(avg(petalLength)=5.561224489795917)
