# Intro to Pyspark

Welcome to the lesson on working with pyspark .

## Importing libraries
First we have to install pyspark.



In [None]:
!pip install pyspark

Then we  import pyspark

In [None]:
import pyspark

## Creating a Spark Session

In order to actually use spark you need to create something called a `Spark Session`.  Your spark session is what creates the context and configuration for spark to run on and allows spark to access all the functionality (e.g. make dataframes, run SQL queries, etc).

if you're going to work locally or on any other configuration you need to  create your session.  You can also start setting custom configuration options here as well.  For example if you wanted to work locally and use four cores of your CPU (assuming you have four), you could set the `.master()` option to `.master("local[4]")`

For now, we'll just call `SparkSession.builder` and then `getOrCreate()` to tell spark to build up a session for us to use.  It'll automatically default to standard options based on the configuration of the machine.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("intro_pyspark") \
    .getOrCreate()

from pyspark.sql.functions import *

### Importing data

To get files into your environment and accessible to spark you need to provide a path to where the file is located in this environment.  To do this you add the file to your spark environment with `sparkContext.addFile()` and feed it the URL inside the `addFile()` function.  

We'll need to import the `SparkFiles` function from pyspark as well.

In [None]:
from pyspark import SparkFiles
url = 'http://131.193.32.85:9000/mybucket/covid_daily_cases.csv'
spark.sparkContext.addFile(url)

Once that's done now you can read it in using the local filepath.

Using `SparkFiles` and `.get()` we can get the filepath to our data called 'covid_daily_cases.csv'

In [None]:
SparkFiles.get("covid_daily_cases.csv")

But we need to provide the full path, so we must add 'file://' before that path to the file.

In [None]:
fp = 'file://'+SparkFiles.get("covid_daily_cases.csv")
fp

Great!  We have a filepath to our data.  Reading data through pyspark is similar to pandas.  We first call our spark session.  Then the `.read` function, followed by the file type function.  In this case `.csv()`.  Inside `.csv()` you can feed it the filepath

In [None]:
covid = spark.read.csv(fp)

We can take a quick look at our data by calling the `.show()` method on our covid data.  `.show()` Kicks back a nice neatly formatted view of your data.

In [None]:
# Use .show()() to look at your data
covid.show()

### Modifying your import

Spark has some clear differences in how it imports data.  The big one you can probably see from our `.show()` is that unlike pandas, pyspark doesn't by default import the first row of your dataframe as the column names.  

You can modify that behavior along with a host of other import behaviors by adding in a `.option()` call.  Inside you can feed it a specific option you want to modify.  In this case we'll specify `'header, True'`.

In [None]:
covid = (spark.read
         .option('header', True)
         .csv(fp)
        )


There are other options such as specifying the delimiter.  Obviously we're using a comma, but you might have a file that needs to be separated by a different symbol.

In [None]:
covid = (spark.read
         .option('header', True)
         .option('delimiter', ',')
         .csv(fp)
        )

## Printing and altering schema

Pyspark readily works with unstructured data.  We're going to hold off on that, but one thing that's worth noting here is that you can apply a schema to your data on import.  This is important for working with nested data.  **But**, it's also important because pyspark doesn't infer datatypes when importing data.  This means that the default read is to have everything be strings. Let's take a look at the schema using the `printSchema()` method.

In [None]:
# use printSchema()
covid.printSchema()

So above we can see that all our columns are at the same level.  But also that they're all strings and null values are allowed.  But should they all be strings?  Of course not!  Many of the columns are just numeric.  We can give our read another option to infer what the schema and datatypes should be.  Just toss in `.option('inferSchema', True)`

In [None]:
covid = (spark.read
         .option('header', True)
         .option('delimiter', ',')
         .option('inferSchema', True)
         .csv(fp)
        )

Looking at our schema below we can see that now the numeric columns are all considered as integers.  We'll convert those next.

In [None]:
covid.printSchema()

## Datatype conversions in pyspark

Datatype conversions in spark are conceptually straightforward, but the syntax is a bit different.  In python you'd use a format as follows to convert a column to an integer:
```
df['new_column'] = df['column_to_convert'].astype(int)
```

But in pyspark you it differs a bit.  Specifically, you'll use the `.withColumn()` function to apply a datatype to an existing column.  The generalized format is:
```
df = df.withColumn('new_column', col('column_to_convert').cast(dataType()))
```

The first argument within `withColumn()` is just the column name you want to asign the output of modified column.  This could be something new, or it could overwrite an the column you modified... for example if you just wanted change a datatype of a column you'd overwrite.  

The second argument then selects a column.  You select columns a bit differently in pyspark, instead using `col()` with the column name inside.  `.cast()` then, well ,casts that column as a new datatype.  The datatype you specify inside `.cast()`.  

A third thing to note.  You actually import your datatype methods.  So 'IntegerType', 'StringType', etc.  

The last thing to note.  In python you'd create a new column by calling `df['new_name']` on the left side of the equals.  When using pyspark's `withColumn()` function you only need to put the dataframe to the left of the equals as it understands that you want to make a new column inside that dataframe and not overwrite the whole thing.

Let's go convert 'county_code' to a string and then make a new column called 'date_dt' that's a datetime version of our 'date' column.

In [None]:
# Import our needed functions
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType, StringType, DateType, TimestampType
# Also need to modify a setting to deal with a later date conversion - don't need to know this!
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")


Alright, so we'll start by converting the 'county_code' column back to a string.  

The code below essentially says "make a new column called 'country_code' using the existing column 'country_code' cast as a string datatype"

In [None]:
# Convert
covid = covid.withColumn('county_code', col('county_code').cast(StringType()))

In [None]:
# Check schema to see datatypes
covid.printSchema()

Great, now let's make a new column called 'date_dt' that converts our date column from a string to a date.  

The syntax is slightly different in that you call `col()` inside a `to_date()` function followed by the date format you want to convert.  Our dates are like 01/28/2020, so the format is 'mm/dd/yyyy'

In [None]:
covid = covid.withColumn('date_dt', to_date(col('date'), 'mm/dd/yyyy'))

In [None]:
# check data
covid.show()

## Other transforms in pyspark

Let's just look at some quick transforms.  

* We'll make a new column of the ratio of covid cases to deaths by.  This again just uses `withColumn()`.
* We'll do a groupby so you can see how similar that is!

We'll start by making a column of the ratio of deaths to total cases, so dividing the number of deaths by total cases.  Let's assign to a new column called 'case_death_ratio'.

Again, the first argument in `withColumn` is the column name of the result, second is just the math!

In [None]:
# Make 'case_death_ratio' column
covid = covid.withColumn('case_death_ratio', col('daily_deaths')/col('daily_cases') )

In [None]:
covid.show()

Groupby operations work *very* similar to straight python.  The format is still `df.groupBy('column_to_group').agg(math)`.  Couple things to note:
* pyspark uses camelCase, so it's `groupBy` not `groupby`
* `.agg()` works a bit differently.  Instead of a dictionary of the column you want to do math on as the key and math function as the value, you instead use a format like `.agg(math_function('column_to_do_math_on'))`

Let's go and count the total number of deaths by state, by county.

In [None]:
# Make covid_grouped where we group by state and county and then sum up the number of deaths
covid_grouped = covid.groupBy('state', 'county').agg(sum('daily_deaths'))

In [None]:
# Check
covid_grouped.show()

## Wrapping up

This lesson was meant to be a short primer to pyspark and obviously not an exhaustive overview.  Hopefully you can see how syntax makes it easy to do all the same things that you would in python.