# Handling Missing Data in PySpark HW Solutions

In this HW assignment you will be strengthening your skill sets dealing with missing data.
 
**Review:** you have 2 basic options for filling in missing data (you will personally have to make the decision for what is the right approach:

1. Drop the missing data points (including the entire row)
2. Fill them in with some other value.

Let's practice some examples of each of these methods!


#### But first!

Start your Spark session

In [None]:
!pip install pyspark
import pyspark 
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("nulls").getOrCreate()

spark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824025 sha256=07b6005968e3e8772c957437958465e2e6393835426f33145f19f417dd57d2b1
  Stored in directory: /root/.cache/pip/wheels/b1/59/a0/a1a0624b5e865fd389919c1a10f53aec9b12195d6747710baf
Successfully built pyspark
Installing collected packages: py4j, pyspa

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
path = 'drive/MyDrive/5. Spark/spark-scripts/section2/Datasets/'

## Read in the dataset for this Notebook

Weather.csv attached to this lecture. 

In [None]:
df = spark.read.csv(path+'Weather.csv', inferSchema = True, header = True)

## About this dataset

**New York City Taxi Trip - Hourly Weather Data**

Here is some detailed weather data for the New York City Taxi Trips.

**Source:** https://www.kaggle.com/meinertsen/new-york-city-taxi-trip-hourly-weather-data

### Print a view of the first several lines of the dataframe to see what our data looks like

In [None]:
import pandas as pd
pd.set_option('display.max_columns',None)
pd.set_option('display.max_colwidth',None)
pd.set_option('display.max_rows',None)

### Print the schema 

So that we can see if we need to make any corrections to the data types.

In [None]:
df.printSchema()

root
 |-- pickup_datetime: timestamp (nullable = true)
 |-- tempm: double (nullable = true)
 |-- tempi: double (nullable = true)
 |-- dewptm: double (nullable = true)
 |-- dewpti: double (nullable = true)
 |-- hum: double (nullable = true)
 |-- wspdm: double (nullable = true)
 |-- wspdi: double (nullable = true)
 |-- wgustm: double (nullable = true)
 |-- wgusti: double (nullable = true)
 |-- wdird: integer (nullable = true)
 |-- wdire: string (nullable = true)
 |-- vism: double (nullable = true)
 |-- visi: double (nullable = true)
 |-- pressurem: double (nullable = true)
 |-- pressurei: double (nullable = true)
 |-- windchillm: double (nullable = true)
 |-- windchilli: double (nullable = true)
 |-- heatindexm: double (nullable = true)
 |-- heatindexi: double (nullable = true)
 |-- precipm: double (nullable = true)
 |-- precipi: double (nullable = true)
 |-- conds: string (nullable = true)
 |-- icon: string (nullable = true)
 |-- fog: integer (nullable = true)
 |-- rain: integer (nullab

## 1. How much missing data are we working with?

Get a count and percentage of each variable in the dataset to answer this question.

In [None]:
from pyspark.sql.functions import *
import numpy as np
analysis = []
for column in df.columns:
  if df.where(col(column).isNull()).count() == 0:
    continue
  row = column,\
        df.where(col(column).isNull()).count(), \
        df.where(col(column).isNull()).count()/df.count()
  analysis.append(row)

null_analysis_df = spark.createDataFrame(analysis, ['Column name', 'Number of nulls', 'percentage of nulls'])
null_analysis_df.show()

+-----------+---------------+--------------------+
|Column name|Number of nulls| percentage of nulls|
+-----------+---------------+--------------------+
|      tempm|              5|4.770537162484495...|
|      tempi|              5|4.770537162484495...|
|     dewptm|              5|4.770537162484495...|
|     dewpti|              5|4.770537162484495...|
|        hum|              5|4.770537162484495...|
|      wspdm|            737| 0.07031771777502147|
|      wspdi|            737| 0.07031771777502147|
|     wgustm|           8605|  0.8210094456635817|
|     wgusti|           8605|  0.8210094456635817|
|       vism|            245| 0.02337563209617403|
|       visi|            245| 0.02337563209617403|
|  pressurem|            239| 0.02280316763667589|
|  pressurei|            239| 0.02280316763667589|
| windchillm|           7775|  0.7418185287663391|
| windchilli|           7775|  0.7418185287663391|
| heatindexm|           9644|  0.9201412079000095|
| heatindexi|           9644|  

## 2. How many rows contain at least one null value?

We want to know, if we use the df.na option, how many rows will we loose. 

In [None]:
before = df.count()
after = df.na.drop(thresh = len(df.columns)).count()
print("Number of rows containing at least one null value: ", before - after)

Number of rows containing at least one null value:  10481


## 3. Drop the missing data

Drop any row that contains missing data across the whole dataset

In [None]:
before = df.count()
new_df = df.na.drop(how = 'any')
print('Number of rows in original data: ',df.count())
print('Dropped null rows: ',before - new_df.count())

Number of rows in original data:  10481
Dropped null rows:  10481


## 4. Drop with a threshold

Count how many rows would be dropped if we only dropped rows that had a least 12 NON-Null values

In [None]:
new_df = df.na.drop(thresh = 12)                 
print('After dropping rows with less than 12 non-null values')
print('Number of rows in original data: ',df.count())
print('Dropped null rows: ',df.count()-new_df.count())

After dropping rows with less than 12 non-null values
Number of rows in original data:  10481
Dropped null rows:  5


## 5. Drop rows according to specific column value

Now count how many rows would be dropped if you only drop rows whose values in the tempm column are null/NaN

In [None]:
new_df = df.na.drop(subset = ['tempm'])
print('Number of rows in original data: ',df.count())
print('Dropped null rows: ',df.count()-new_df.count())

Number of rows in original data:  10481
Dropped null rows:  5


## 6. Drop rows that are null accross all columns

Count how many rows would be dropped if you only dropped rows where ALL the values are null

In [None]:
new_df = df.na.drop(how = 'all')
print('Number of rows in original data: ',df.count())
print('Dropped null rows: ',df.count()-new_df.count())

Number of rows in original data:  10481
Dropped null rows:  0


## 7. Fill in all the string columns missing values with the word "N/A"

Make sure you don't edit the df dataframe itself. Create a copy of the df then edit that one.

In [None]:
data = df
data.na.fill('N/A').limit(5).toPandas()

Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,wdird,wdire,vism,visi,pressurem,pressurei,windchillm,windchilli,heatindexm,heatindexi,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,2015-12-31 00:15:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,40,NE,4.0,2.5,1018.2,30.07,6.6,43.9,,,0.5,0.02,Light Rain,rain,0,1,0,0,0,0
1,2015-12-31 00:42:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,0,Variable,6.4,4.0,1017.8,30.06,6.6,43.9,,,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
2,2015-12-31 00:51:00,7.8,46.0,6.1,43.0,89.0,5.6,3.5,,,20,NNE,8.0,5.0,1017.0,30.04,7.1,44.8,,,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
3,2015-12-31 01:51:00,7.2,45.0,5.6,42.1,90.0,7.4,4.6,,,0,Variable,12.9,8.0,1016.5,30.02,5.9,42.6,,,0.3,0.01,Overcast,cloudy,0,0,0,0,0,0
4,2015-12-31 02:51:00,7.2,45.0,5.6,42.1,90.0,0.0,0.0,,,0,North,12.9,8.0,1016.7,30.03,,,,,,,Overcast,cloudy,0,0,0,0,0,0


## 8. Fill in NaN values with averages for the tempm and tempi columns

*Note: you will first need to compute the averages for each column and then fill in with the corresponding value.*

In [None]:
def fill_with_mean(df, include=set()): 
    stats = df.agg(*(avg(c).alias(c) for c in df.columns if c in include))
    print(stats)
    return df.na.fill(stats.first().asDict())

data =fill_with_mean(df ,['tempm','tempi'])
data.limit(5).toPandas()

DataFrame[tempm: double, tempi: double]


Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,wdird,wdire,vism,visi,pressurem,pressurei,windchillm,windchilli,heatindexm,heatindexi,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,2015-12-31 00:15:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,40,NE,4.0,2.5,1018.2,30.07,6.6,43.9,,,0.5,0.02,Light Rain,rain,0,1,0,0,0,0
1,2015-12-31 00:42:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,0,Variable,6.4,4.0,1017.8,30.06,6.6,43.9,,,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
2,2015-12-31 00:51:00,7.8,46.0,6.1,43.0,89.0,5.6,3.5,,,20,NNE,8.0,5.0,1017.0,30.04,7.1,44.8,,,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
3,2015-12-31 01:51:00,7.2,45.0,5.6,42.1,90.0,7.4,4.6,,,0,Variable,12.9,8.0,1016.5,30.02,5.9,42.6,,,0.3,0.01,Overcast,cloudy,0,0,0,0,0,0
4,2015-12-31 02:51:00,7.2,45.0,5.6,42.1,90.0,0.0,0.0,,,0,North,12.9,8.0,1016.7,30.03,,,,,,,Overcast,cloudy,0,0,0,0,0,0


### That's it! Great Job!