### MAST30034: Applied Data Science Project 1
---
# Preprocessing Part 2: Aggregating Data by MMWR Week
#### Xavier Travers (1178369)

Aggregate all the data by MMWR week (defined [here](https://ndc.services.cdc.gov/wp-content/uploads/MMWR_Week_overview.pdf)).
This means counting trips to and from each of the boroughs per month.
This is done for each of the taxi types.

In [1]:
# imports used throughout this notebook
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
import os
import sys
import re
from itertools import chain

# add homemade helpers
sys.path.insert(1, '../scripts')
import helpers.aggregation_helpers as ah

# for printouts
DEBUGGING = True

In [2]:
from pyspark.sql import SparkSession

# Create a spark session (which will run spark jobs)
spark = (
    SparkSession.builder.appName('MAST30034 XT Project 1')
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.sql.repl.eagerEval.enabled', True) 
    .config('spark.sql.parquet.cacheMetadata', 'true')
    .getOrCreate()
)

22/08/09 01:03:10 WARN Utils: Your hostname, Polaris resolves to a loopback address: 127.0.1.1; using 172.20.95.79 instead (on interface eth0)
22/08/09 01:03:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/08/09 01:03:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/08/09 01:03:12 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [3]:
# read in the taxi zones dataset
zones_df = spark.read.csv('../data/raw/tlc_zones/zones.csv',
    header = True)
zones_df.limit(5)

OBJECTID,Shape_Leng,the_geom,Shape_Area,zone,LocationID,borough
1,0.116357453189,MULTIPOLYGON (((-...,0.0007823067885,Newark Airport,1,EWR
2,0.43346966679,MULTIPOLYGON (((-...,0.00486634037837,Jamaica Bay,2,Queens
3,0.0843411059012,MULTIPOLYGON (((-...,0.000314414156821,Allerton/Pelham G...,3,Bronx
4,0.0435665270921,MULTIPOLYGON (((-...,0.000111871946192,Alphabet City,4,Manhattan
5,0.0921464898574,MULTIPOLYGON (((-...,0.000497957489363,Arden Heights,5,Staten Island


### 1. Aggregating the TLC dataset

In [5]:
# TODO: commenting
TLC_NAMES = ['yellow']

In [7]:
# read in the yellow dataset# TODO: commenting
tlc_df = None
for filename in os.listdir(f'../data/curated/tlc/cleaned/yellow'):
    if tlc_df == None:
        tlc_df = spark.read.parquet(f'../data/curated/tlc/cleaned/yellow/{filename}')
    else:
        tlc_df = tlc_df.union(
            spark.read.parquet(f'../data/curated/tlc/cleaned/yellow/{filename}')
        )

print(f'{tlc_df.count()} ROWS')
print(tlc_df.limit(5))


                                                                                

240824104 ROWS




+----+-----+---+----------+--------+----------+----------+-------------+--------------+--------------+
|year|month|day|week_index|cdc_week|      date|passengers|trip_distance|pu_location_id|do_location_id|
+----+-----+---+----------+--------+----------+----------+-------------+--------------+--------------+
|2018|    1|  1|         1|       1|01/01/2018|       1.0|          0.5|            41|            24|
|2018|    1|  1|         1|       1|01/01/2018|       1.0|          2.7|           239|           140|
|2018|    1|  1|         1|       1|01/01/2018|       2.0|          0.8|           262|           141|
|2018|    1|  1|         1|       1|01/01/2018|       1.0|         10.2|           140|           257|
|2018|    1|  1|         1|       1|01/01/2018|       2.0|          2.5|           246|           239|
+----+-----+---+----------+--------+----------+----------+-------------+--------------+--------------+



                                                                                

In [8]:
# join the yellow dataset with the taxi zones dataset
tlc_colnames = tlc_df.columns
tlc_df = ah.extract_borough_name(tlc_df, zones_df, 'pu_location_id', 'pu')
tlc_df = ah.extract_borough_name(tlc_df, zones_df, 'do_location_id', 'do')

In [9]:
TLC_GROUP_COLUMNS = [
    'year',
    'cdc_week',
    'week_index',
    'pu_borough',
    'do_borough',
    'passengers'
]
# TODO: commenting
TLC_AGGREGATE_COLUMNS = {
    '*': ['count'],
    'trip_distance': ['total', 'average'],
}

In [10]:
tlc_df = ah.group_and_aggregate(tlc_df, TLC_GROUP_COLUMNS, 
    TLC_AGGREGATE_COLUMNS)
# TODO: commenting
# force this into memory 
# otherwise writing parquets results in a java executor out of memory error
tlc_df = spark.createDataFrame(tlc_df.collect())

                                                                                

In [11]:
tlc_df.sort('week_index').limit(5)
# TODO: commenting

year,cdc_week,week_index,pu_borough,do_borough,passengers,num_*,tot_trip_distance,avg_trip_distance
2017,1,1,Queens,Brooklyn,1.0,1,11.91,11.91
2018,1,1,Queens,Staten Island,2.0,14,346.01000000000005,24.715000000000003
2017,1,1,Manhattan,Brooklyn,4.0,2,10.22,5.109999999999999
2018,1,1,Queens,Bronx,2.0,311,4892.789999999997,15.732443729903528
2018,1,1,Manhattan,Staten Island,5.0,11,200.72,18.24727272727273


In [12]:
tlc_df.write.mode('overwrite').parquet('../data/curated/tlc/aggregated/yellow')
# TODO: commenting

22/08/09 01:04:22 WARN MemoryManager: Total allocation exceeds 95.00% (989,174,157 bytes) of heap memory
Scaling row group sizes to 92.12% for 8 writers
22/08/09 01:04:22 WARN MemoryManager: Total allocation exceeds 95.00% (989,174,157 bytes) of heap memory
Scaling row group sizes to 81.89% for 9 writers
22/08/09 01:04:22 WARN MemoryManager: Total allocation exceeds 95.00% (989,174,157 bytes) of heap memory
Scaling row group sizes to 73.70% for 10 writers
22/08/09 01:04:22 WARN MemoryManager: Total allocation exceeds 95.00% (989,174,157 bytes) of heap memory
Scaling row group sizes to 67.00% for 11 writers
22/08/09 01:04:22 WARN MemoryManager: Total allocation exceeds 95.00% (989,174,157 bytes) of heap memory
Scaling row group sizes to 61.42% for 12 writers
22/08/09 01:04:22 WARN MemoryManager: Total allocation exceeds 95.00% (989,174,157 bytes) of heap memory
Scaling row group sizes to 56.69% for 13 writers
22/08/09 01:04:22 WARN MemoryManager: Total allocation exceeds 95.00% (989,174

### 2. Aggregating the COVID dataset

In [14]:
# read in the covid dataset
covid_df = spark.read.parquet('../data/curated/virals/covid/cases-by-day')
covid_df.limit(5)
# TODO: commenting

date,year,cdc_week,week_index,cases,deaths,hospitalised,borough
02/29/2020,2020,9,113,1,0,1,Overall
03/01/2020,2020,10,114,0,0,1,Overall
03/02/2020,2020,10,114,0,0,2,Overall
03/03/2020,2020,10,114,1,0,7,Overall
03/04/2020,2020,10,114,5,0,2,Overall


In [15]:
COVID_GROUP_COLUMNS = [
    'year',
    'cdc_week',
    'week_index',
    'borough'
]
# TODO: commenting
COVID_AGGREGATE_COLUMNS = {
    'cases': ['total', 'average'],
    'deaths': ['total', 'average'],
    'hospitalised': ['total', 'average'],
}

In [16]:
covid_df = ah.group_and_aggregate(covid_df, COVID_GROUP_COLUMNS, 
    COVID_AGGREGATE_COLUMNS)

# force this into memory 
# otherwise writing parquets results in a java executor out of memory error
covid_df = spark.createDataFrame(covid_df.collect())
# TODO: commenting

In [33]:
covid_df.sort('week_index').limit(5)
# TODO: commenting

year,cdc_week,week_index,borough,tot_cases,avg_cases,tot_deaths,avg_deaths,tot_hospitalised,avg_hospitalised
2020,9,113,Overall,1.0,1.0,0.0,0.0,1.0,1.0
2020,9,113,Brooklyn,0.0,0.0,0.0,0.0,1.0,1.0
2020,9,113,Manhattan,1.0,1.0,0.0,0.0,0.0,0.0
2020,9,113,Queens,0.0,0.0,0.0,0.0,0.0,0.0
2020,9,113,Bronx,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
# save it
# TODO: commenting
covid_df.write.mode('overwrite').parquet('../data/curated/virals/covid/cases-by-week')

### 3. Aggregating the Flu dataset
*Nothing is done here, since the flu dataset is already grouped by MMWR week.*