# Jeudi 04 Avril

# Introduction to PySpark - Part 2 - Dataframes 🗄️🗄️

## What will you learn in this course? 🧐🧐

* DataFrames
    * Differences vs RDDs
        * Pros of DataFrames vs RDDs
    * Creation
        * From a RDD
        * From a pandas DataFrame
* Running SQL queries against DataFrames
    * Select columns in Spark DataFrames
    * Actions
        * `.show()`
        * `.printSchema()`
        * `.take()`
        * `.collect()`
        * `.count()`
        * `.describe()`
        * `display()`
        * `.toPandas()`
        * `..write()`
    * Transformations
        * `.na`
        * `.fill()`
        * `.drop()`
        * `.isNull()`
        * `.replace()`
        * `.sql()`
        * `.select()`
        * `.alias(...)`
        * `.drop(...)`
        * `.limit()`
        * `.filter()`
        * `.selectExpr()`
        * `.dropDuplicates()`
        * `.distinct()`
        * `.orderBy()`
        * `.groupBy()`
        * `.withColumn()`
        * `.withColumnRenamed()`
        * Chaining everything together
        
* Some differences with pandas' DataFrames

## DataFrames 🗄️🗄️

A distributed collection of data grouped into named columns.  

A DataFrame is equivalent to a relational table in SQL.

---
> ⚠️ Although they're called DataFrames, Spark DataFrames are actually closer to SQL tables than pandas'.

---

---
> 💡 If you want an API closer to pandas while maintaining fast big data processing capabilities, take a look at [koalas](https://github.com/databricks/koalas) (still in beta).
---

Spark DataFrames actually have richer optimizations than both SQL tables and pandas DataFrames (cf. [doc](https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#overview)).

### DataFrames vs RDDs 🗄️🆚📃

Contrary to Spark's RDDs, DataFrames are not schema-less.

#### Pros of DataFrames vs RDDs ➕➖

* \- they enforce a schema
* \+ you can run SQL queries against them
* \+ faster than RDDs
* \+ much smaller than RDDs when stored in parquet format

### Creation ✨

There are several ways of creating a Spark DataFrame, one way is to build it from an RDD, or from a pandas DataFrame, another way is to create it directly from a `.csv` or `.parquet` file stored in a distributed file system. (`.parquet` is a compression format for column oriented files, it encodes the values in each column and stores the actual values in correspondance tables, which allows for much smaller storage format).

#### From a RDD 📃➡🗄️

Let's see how we can create a Spark DataFrame from and RDD

In [None]:
sc = spark.sparkContext

In [None]:
numbers = [i for i in range(10)]
numbers_rdd = sc.parallelize(numbers)

In [None]:
# This will fail, requires either rdd of tuples or a pandas DataFrame
spark.createDataFrame(numbers_rdd)

We know how to transform values of a RDD: `.map(...)`. Let's try.

In [None]:
df = spark.createDataFrame(numbers_rdd.map(lambda k: (k,)))
display(df)

_1
0
1
2
3
4
5
6
7
8
9


#### From a pandas DataFrame 🐼➡🗄️

In [None]:
import pandas as pd
import numpy as np
data_dict = {'a': 1, 'b': 2, 'c': 3, 'd':np.NaN, 'e':3}
pandas_df = pd.DataFrame.from_dict(
    data_dict, orient='index', columns=['position'])
pandas_df

Unnamed: 0,position
a,1.0
b,2.0
c,3.0
d,
e,3.0


In [None]:
spark_df = spark.createDataFrame(pandas_df)
spark_df

In [None]:
display(spark_df)

position
1.0
2.0
3.0
""
3.0


## Running sql queries against DataFrames 🗄️🔢

Spark let's you run classic SQL queries on your tables, however, using classic SQL in Spark requires you to load the data in memory before running any query. We will use the `.createOrReplaceTempView` Spark DataFrame method in order to load the data in memory under a certain table name, we will then be able to run SQL queries on it.

In [None]:
spark_df.createOrReplaceTempView('my_table') # Creates a temporary view of the spark dataframe table in memory under the name
# my_table, which we can now query!

The `.sql` method let's you write queries in SQL while benefiting from the distributed computing advantages of Spark!

In [None]:
result = spark.sql("SELECT * FROM my_table WHERE position >= 2") # filters elements from my_table where position 
# is greater or equal to 2
display(result)

position
2.0
3.0
3.0


This will return a `DataFrame` but **will not compute until an action is called**. Even though the query we wrote is SQL, we use it through the Spark framework which is lazy!

### Select columns in Spark DataFrames ⬇️
There are three ways of selecting columns in spark dataframes. Note that columns in dataframes in spark are objects in themselves, and sometimes it is not enough to call them simply by name, we need to refer to the column object directly to prevent ambiguity and bugs.

In [None]:
# First way: refer to column by indexing
result["position"]

In [None]:
# Second way: refer to column like an attribute
result.position

In [None]:
# Third way: use pyspark sql (you'll learn a lot more about this in further lectures)
from pyspark.sql import functions as F
result.select(F.col("position")) # this works only inside pyspark sql commands

### Actions 🦸
All actions perform computations, some like `show` or `printSchema` print out results without returning anything, others, like `count` will return a value.

#### `.show(...)`
Prints out the first 20 values of the DataFrame.

In [None]:
spark_df.show()

Default can be changed.

In [None]:
spark_df.show(2)

#### `.printSchema()`
Prints out the schema of the DataFrame.

In [None]:
spark_df.printSchema()

A schema is a description of the content of structured data. A schema is composed of column names and types. For example, the DataFrame above contains a single column called `position`, which type is `double` meaning a long floating point number. Columns may be of many other types, like `int`, `str`, or even interables like lists or dictionnaries. We will teach you more about schemas when we cover the topics of nested schemas and flat schemas.

In [None]:
spark_df.columns  # not an `action` (nor a transformation)

#### `.take(...)`
Compute the first n values of the DataFrame.

In [None]:
spark_df.take(2)

As you can see, a PySpark `DataFrame` is a collection of [`Row`](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.Row) objects (cf [doc](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.Row)).

#### `.collect(...)`
Like `.take(...)` but will take effect on all rows of the DataFrame.

In [None]:
spark_df.collect()

---
⚠️ `.collect()` will collect all the values, do **NOT** perform this action on a full DataFrame, only on small DataFrames like aggregated results.

---

#### `.count(...)`
Returns the number of `Rows` in the DataFrame

In [None]:
spark_df.count()

#### `.describe()`

In [None]:
spark_df.describe()

#### Databrick's `display(...)`
For when `.show()` won't cut it...


In [None]:
display(spark_df)

position
1.0
2.0
3.0
""
3.0


Slow and won't work everywhere... BUT! Let's you access the GraphX interface to do some visualization, all you have to do is click on the barchart button to start visualizing, and use the button plot options that just appeared to refine your viz!

#### Alternative: converting to pandas with `.toPandas()`
Using `toPandas()`: this is an action, it will compute.  
Hence, do **NOT** forget to `limit` or you'll explode the memory (unless the DataFrame is small, like the result of an aggregate).

In [None]:
spark_df.limit(5).toPandas()

Unnamed: 0,position
0,1.0
1,2.0
2,3.0
3,
4,3.0


#### `.write()`

If you wish to save your files to a location in the S3 it is possible with the `.write()` method.

In [None]:
spark_df.write("path", mode="overwrite") # mode overwrite will erase any file that 
# already occupies the destination path.

# If you wish to save the file in compressed parquet format it is possible using this option
playlog_processed.write.parquet(output_path, mode='overwrite')

# To load a parquet file as a dataframe use
spark_df = spark.read.parquet("path")

### Transformations 🧙
Let's study some transformations available on spark DataFrames, more exhaustive content may be found in the following link:
- [PySpark Doc](https://spark.apache.org/docs/2.1.0/sql-programming-guide.html)

#### `.na` for missing values
This is a method associated with spark DataFrame that let's you run jobs on the missing values, like replacig them etc...

In [None]:
spark_df.na

#### `.fill(...)`

In [None]:
spark_df.na.fill(0).show() # this will replace the missing value with 0.0

#### `.drop()`

In [None]:
spark_df.na.drop().show() # this will drop the lines containing missing values

Equivalent to `.dropna()`

In [None]:
spark_df.dropna(subset=['position']).show() # this will also drop the lines with missing values

Optional parameter, select a `subset` of columns. This is useful if you only wish to drop lines with missing values on specific columns.

#### `.isNull()`
Another way to detect missing values in specific columns

In [None]:
spark_df.select(spark_df["position"].isNull()).show()

#### `.replace(pattern, value)`
This method will replace every data point equal to `pattern` with `value`.

In [None]:
spark_df.replace(2, 4).show() 

Be careful however you may not replace values in the DataFrame that conflict with the schema, for example it is not possible to replace the value 2 in a double type column with a character string.

In [None]:
spark_df.replace(2, "jedha").show() 

#### `.sql()` 
We can run SQL queries against a registered view

In [None]:
spark.sql("SELECT * FROM my_table LIMIT 5").show()

Multi-line statements need the use of triple quotes `"""`

In [None]:
spark.sql("""
    SELECT position
    FROM my_table
    LIMIT 5
""").show()

That's convenient, but we can use PySpark DataFrames API to perform the same operations. The main difference between writing standard SQL and using PysparkSQL is that you will no longer need to load the entire data in memory to start running queries.

#### `.select()`
The select method works similarly to the select statement in SQL, it let's you access columns of your DataFrame by name.

In [None]:
spark_df.printSchema()

In [None]:
spark_df.select('position')

Similar to `spark.sql.select("SELECT position FROM my_table")`.  
To claim equivalence, we would have to check the execution plan of both (which is beyond the content of this course).

#### `.alias(...)`

In [None]:
spark_df.select(spark_df['position'].alias('aliased_column')).show()

In [None]:
spark_df.select(spark_df['position'].alias('aliased_column')).show()

In [None]:
# Won't work on this, it requires a Column selector, this is where the ambiguity of calling colums by name hurts
spark_df.select('position'.alias('aliased_column')).show()

#### `.drop(...)`
This method let's you remove columns from the DataFrame

In [None]:
spark_df.drop('position')

In [None]:
spark_df.drop('position').show()

#### `.limit(num)`
Like SQL's `LIMIT`.  
Limits the DataFrame to `num` rows.

In [None]:
spark_df.limit(5).show()

#### `.filter(...)`
It just works like the WHERE clause in SQL and lets you keep only rows of the DataFrame that verify the condition inside the filter method.

In [None]:
spark_df.filter(spark_df['position'] < 3)

In [None]:
spark_df.filter(spark_df.position < 3).show()

--- 
> 💡 We can even mix both the SQL and SparkSQL APIs

---

#### `.selectExpr`
This method lets you select columns in a DataFrame using SQL statements without having to store a temp view of the table.

In [None]:
spark_df.limit(5).selectExpr("position * 2", "abs(position)").show()

#### `.dropDuplicates(...)`
As its name suggests, this method will drop rows that are identical to other rows in the DataFrame, returning a DataFrame where all rows are different.

In [None]:
spark_df.dropDuplicates().show()

#### `.distinct()`
This method will return all distinct non missing values in the DataFrame.

In [None]:
spark_df.distinct().show()

#### `.orderBy(...)`
Alias to `.sort(...)`
Will sort the DataFrame according to some column.

In [None]:
spark_df.orderBy('position').show()

We can call `.desc()` to get a descending order, but that means we need an actual `Column` object to call it on.

In [None]:
# This will fail
spark_df.orderBy(('position').desc()).show()

In [None]:
# This won't
spark_df.orderBy(spark_df['position'].desc()).show()

That's actually one of the key to SparkSQL fluency, but it requires some practice.

---

⭐️ No worries, we will review all this later.

---

#### `.groupBy(...)`

It is possible to group your data according to values in a certain column and then aggregate it, we will learn more on data aggregating in further lecture, this is just a brief introduction.

In [None]:
spark_df.groupBy('position') 

Returns a `GroupedData` object. We need to take some action on this.

In [None]:
# This won't work
spark_df.groupBy('position').show()

In [None]:
# Another action, this one works
spark_df.groupBy('position').count()

⚠️ When applied to a DataFrame, `.count()` is an action. In this case we apply it to an object of type `GroupedData` and it returns a `DataFrame`, e.g. still waiting for an action.

In [None]:
spark_df.groupBy('position').count().show()

### Adding columns ➕
Using pure select is possible, but can feel tedious

In [None]:
spark_df.select('*', (spark_df.position*2).alias('newColumn')).show()

#### `.withColumn(...)`
It's usually easier to use `.withColumn` for the same effect.

In [None]:
spark_df.withColumn('newColumn', 2*spark_df['position']).show()

#### `withColumnRenamed(...)`
Will change the name of a given column

In [None]:
spark_df.withColumnRenamed('position', 'newName').show()

#### Chaining everything together ⛓️

In [None]:
spark_df \
    .filter(spark_df.position < 2) \
    .groupBy('position') \
    .count() \
    .orderBy('count') \
    .limit(5) \
    .show()

## Some differences with pandas' DataFrames 🐼🆚🗄️

- Accessor: `df.features`, `df['features']` vs `df.select('features')` -> more later
- Also, most (if not all) transformations in PySpark are not `inplace`

In [None]:
spark_df.position 

In [None]:
 spark_df['position']

But in a case like this, just like SQL, the executor can infer the "table" schema, this will work:

In [None]:
spark_df.select('position')

It is recommended to always access columns with column objects in select statements and not by name in str directly as it will cause your jobs to fail when running more advanced queries on your DataFrame columns. Usin the column object removes any ambiguity.

In [None]:
spark_df.select('position').show()

These two statements are returning column objects.
Not very useful by themselves, but can be passed to a `.select(...)`, and then let you run more advanced operations on columns.

In [None]:
spark_df.select(spark_df.position).show()

In [None]:
spark_df.select(spark_df['position']).show()

## Resources 📚📚
- The part about DataFrames in [Mastering Spark SQL](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-DataFrame.html) (Scala based)
- [Learning Apache Spark with PySpark & Databricks](https://hackersandslackers.com/learning-to-use-apache-spark-pyspark/)