### Library Imports

In [12]:
from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

### Template

In [14]:
spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 4 - More Comfortable with SQL?")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path

df = spark.read.csv(path, header=True)
df.toPandas()

Unnamed: 0,id,species_id,name,birthday,color
0,1,1,King,2014-11-22 12:30:31,brown
1,2,3,Argus,2016-11-22 10:05:10,


### Transformation

In [22]:
(
    df
    .withColumn('birthday_date', F.col('birthday').cast('date'))
    .withColumn('birthday_date_2', df['birthday'].cast('date'))
    .withColumn('owned_by', F.lit('me'))
    .withColumnRenamed('id', 'pet_id')
    .where(F.col('birthday_date') > datetime(2015,1,1))
).toPandas()

Unnamed: 0,pet_id,species_id,name,birthday,color,birthday_date,birthday_date_2,owned_by
0,2,3,Argus,2016-11-22 10:05:10,,2016-11-22,2016-11-22,me


### What Happened?
In the small transformation above, we called the most frequently used functions, let's dig into what each one of them does.

1. `F.col(col_name)`  
Whenever you want to use a `column` within this `df` you need to use this function to use it. You cannot reference a `column` outside of this `df` with this function.

**Alternatively...**  
You can call a `column` with `df[col_name]` but as the `df` variable name gets longer, this way of calling a `column` becomes ugly. The only acceptable place to call a `column` like so is if you need to reference a column not in this `df`.

2. `df.withColumn(colName, col)`  
This function let's you define a new column for your `df` using either `literal` types (explained later) `F.lit('me')` or columns that existed within the `df` already `F.col('birthday').cast('date')`.


3. `df.withColumnRenamed(old_col_name, new_col_name)`  
This function let's you rename the an existing column with a new one. The existing column will no longer appear in your `df`. 

**Note:** the order of arguments, in terms of the `new_col_name` for this function is the opposite of `withColumn`.

4. `df.where(condition)`  
This function is self explanatory, it filters the data to fit the conditions passed to it.

**Note:** this function is an alias for `df.filter(condition)` which performs the same thing. It's a bit more intuitive in terms of being closer to the sql function.

### Helper Functions

In [43]:
def with_named_columns(df, kwargs):
    for col_name, exp in kwargs.items():
        df = df.withColumn(col_name, exp)
    return df

def with_renamed_columns(df, kwargs):
    for old_col_name, new_col_name in kwargs.items():
        df = df.withColumnRenamed(old_col_name, new_col_name)
    return df

In [39]:
with_named_columns(df, {
    'birthday_date':   F.col('birthday').cast('date'),
    'birthday_date_2': F.col('birthday').cast('date'),
}).toPandas()

Unnamed: 0,id,species_id,name,birthday,color,birthday_date_2,birthday_date
0,1,1,King,2014-11-22 12:30:31,brown,2014-11-22,2014-11-22
1,2,3,Argus,2016-11-22 10:05:10,,2016-11-22,2016-11-22


In [44]:
with_renamed_columns(df, {
    'id':   'pet_id',
    'name': 'pet_name',
}).toPandas()

Unnamed: 0,pet_id,species_id,pet_name,birthday,color
0,1,1,King,2014-11-22 12:30:31,brown
1,2,3,Argus,2016-11-22 10:05:10,


### What Happened?
At Shopify we created similar wrapper functions around the `withColumn` and `withColumnRenamed` functions to provide a cleaner and more readable API to work with, #syntax-sugar.

### Conclusion

We learnt some of the basic, yet very often used spark functions. We built useful wrapper functions around some of the functions, for code readability purposes.