# NYC Parking Violations
This example demonstrates ETL operations for transforming New York City parking summons data to create maps. 

Original example can be found [here](https://github.com/JBlumstein/NYCParking/blob/master/NYC_Parking_Violations_Mapping_Example.ipynb). The 2016 and 2017 dataset [here](https://www.kaggle.com/new-york-city/nyc-parking-tickets) is used which is ~4GB.

### Start an IPyParallel cluster 
Run the following code in a cell to start an IPyParallel cluster. 8 cores are used in this example. 

In [1]:
import os
if os.environ.get("BODO_PLATFORM_WORKSPACE_UUID",'NA') == 'NA':
    import ipyparallel as ipp
    import psutil; n = min(psutil.cpu_count(logical=False), 8)
    rc = ipp.Cluster(engines='mpi', n=n).start_and_connect_sync(activate=True)

Starting 8 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
100%|██████████| 8/8 [00:07<00:00,  1.12engine/s]


In [2]:
%%px
import numpy as np
import pandas as pd
import time
import bodo

## Data Loading
In this section parking tickets data is loaded from S3 bucket and aggregated by day, violation type, and police precinct and placed in a dataframe. 

Each dataframe is added to a list of dataframes, and then the dataframes are all appended into a single dataframe named `main_df`.

In addition, violcation codes, and precincts information are loaded as well.

In [3]:
%%px

@bodo.jit(cache=True)
def load_parking_tickets():
    start = time.time()
    year_2016_df = pd.read_csv('s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv', parse_dates=["Issue Date"])
    year_2016_df = year_2016_df.groupby(['Issue Date','Violation County','Violation Precinct','Violation Code'], as_index=False)['Summons Number'].count()        

    year_2017_df = pd.read_csv('s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2017.csv', parse_dates=["Issue Date"])        
    year_2017_df = year_2017_df.groupby(['Issue Date','Violation County','Violation Precinct','Violation Code'], as_index=False)['Summons Number'].count()    
     
    # concatenate all dataframes into one dataframe
    many_year_df = pd.concat([year_2016_df, year_2017_df])
    end = time.time()
    print("Reading Time: ", end - start)
    return many_year_df

main_df = load_parking_tickets()
if bodo.get_rank() == 0:
    display(main_df.head())

%px:   0%|          | 0/16 [00:00<?, ?tasks/s]

[stdout:0] Reading Time:  11.105094708482284


[output:0]

Unnamed: 0,Issue Date,Violation County,Violation Precinct,Violation Code,Summons Number
0,2015-07-09,K,90,21,134
1,2015-07-09,K,90,37,10
2,2015-07-09,K,90,40,16
3,2015-06-30,Q,110,20,22
4,2015-07-10,BX,46,40,17


In [4]:
%%px
@bodo.jit(distributed=False)
def load_violation_precincts_codes():
    start = time.time()
    violation_codes = pd.read_csv("./DOF_Parking_Violation_Codes.csv")
    violation_codes.columns = ['Violation Code','Definition','manhattan_96_and_below','all_other_areas']
    nyc_precincts_df = pd.read_csv("./nyc_precincts.csv", index_col='index')
    end = time.time()
    if bodo.get_rank() == 0:
        print("Violation and precincts load Time: ", end - start)
    return violation_codes, nyc_precincts_df

violation_codes, nyc_precincts_df = load_violation_precincts_codes()

%px:   0%|          | 0/16 [00:00<?, ?tasks/s]

[stdout:0] Violation and precincts load Time:  0.3703936311126199


## Data Cleaning

1. Remove summons with undefined violations (violation code 36).
2. Delete entries that have dates not within our dataset dates.

In [5]:
%%px
@bodo.jit(cache=True)
def elim_code_36(main_df):
    start = time.time()
    '''function to take out all violations with code 36 (other)'''    
    main_df = main_df[main_df['Violation Code']!=36].sort_values('Summons Number',ascending=False)
    end = time.time()
    print("Eliminate undefined violations time: ", end - start)
    return main_df

main_df = elim_code_36(main_df)
if bodo.get_rank() == 0:
    print(main_df.head())

[stdout:0] Eliminate undefined violations time:  0.0531956434147105
       Issue Date Violation County  Violation Precinct  Violation Code  \
280408 2015-11-27                Q                 114              21   
371916 2016-06-19               BK                   0               7   
336114 2017-05-19               QN                   0               7   
371914 2016-06-19               QN                   0               7   
73316  2016-06-18               QN                   0               7   

        Summons Number  
280408            1165  
371916             910  
336114             907  
371914             891  
73316              889  


In [None]:
%%px
@bodo.jit(cache=True)
def remove_outliers(main_df):
    start = time.time()
    main_df = main_df[(main_df['Issue Date'] >= '2016-01-01') & (main_df['Issue Date'] <= '2017-12-31')]
    end = time.time()
    print("Remove outliers time: ", (end-start)) 
    return main_df

main_df = remove_outliers(main_df)
if bodo.get_rank() == 0:
    display(main_df.head())

%px:   0%|          | 0/16 [00:00<?, ?tasks/s]

[stdout:0] Remove outliers time:  0.008827869091874163


[output:0]

Unnamed: 0,Issue Date,Violation County,Violation Precinct,Violation Code,Summons Number
371916,2016-06-19,BK,0,7,910
336114,2017-05-19,QN,0,7,907
371914,2016-06-19,QN,0,7,891
73316,2016-06-18,QN,0,7,889
335189,2016-06-26,BK,0,7,888


## Collect More Information
Data on each violation type, like ticket cost and violation descriptions, are added to the dataset by joining our main_df dataset with a violation type level dataset

In [None]:
%%px
@bodo.jit(cache=True)
def merge_violation_code(main_df, violation_codes):
    start = time.time()
    # left join main_df and violation_codes df so that there's more info on violation in main_df
    main_df = pd.merge(main_df, violation_codes, on='Violation Code', how='left')
    # cast precincts as integers from floats (inadvertent type change by merge)
    main_df['Violation Precinct'] = main_df['Violation Precinct'].astype(int)    
    end = time.time()
    print("Merge time: ", (end-start))
    print(main_df.shape)
    return main_df

main_w_violation = merge_violation_code(main_df, violation_codes)
if bodo.get_rank() == 0:
    display(main_w_violation.head())

%px:   0%|          | 0/16 [00:00<?, ?tasks/s]

[stdout:0] Merge time:  0.3143356828525157
(872465, 8)


[output:0]

Unnamed: 0,Issue Date,Violation County,Violation Precinct,Violation Code,Summons Number,Definition,manhattan_96_and_below,all_other_areas
0,2016-06-19,BK,0,7,910,Vehicles photographed going through a red ligh...,50,50
1,2017-05-19,QN,0,7,907,Vehicles photographed going through a red ligh...,50,50
2,2016-06-19,QN,0,7,891,Vehicles photographed going through a red ligh...,50,50
3,2016-06-18,QN,0,7,889,Vehicles photographed going through a red ligh...,50,50
4,2016-06-26,BK,0,7,888,Vehicles photographed going through a red ligh...,50,50


## Compute Cost of Summons For Each Precinct.

1. Most violations have different ticket prices, based on whether they occur in Manhattan below 96th St. or elsewhere in New York City. The daily revenue for each violation type in each precinct are determined by multiplying the number of offenses by the average cost of the offense (based on how much of the precinct is in Manhattan below 96th St.).

In [None]:
%%px
#calculate the total summonses in dollars for a violation in a precinct on a day
@bodo.jit(cache=True)
def calculate_total_summons(main_df):
    start = time.time()
    #create column for portion of precinct 96th st. and below
    n = len(main_df)
    portion_manhattan_96_and_below = np.empty(n, np.int64)
    # NOTE: To run pandas, use this loop.
    # for i in range(n):
    for i in bodo.prange(n):
        x = main_df['Violation Precinct'].iat[i]
        if x < 22 or x == 23:
            portion_manhattan_96_and_below[i] = 1.0
        elif x == 22:
            portion_manhattan_96_and_below[i] = 0.75
        elif x == 24:
            portion_manhattan_96_and_below[i] = 0.5
        else: #other
            portion_manhattan_96_and_below[i] = 0
    main_df["portion_manhattan_96_and_below"] = portion_manhattan_96_and_below

    #create column for average dollar amount of summons based on location
    main_df['average_summons_amount'] = (main_df['portion_manhattan_96_and_below'] * main_df['manhattan_96_and_below'] 
                                     + (1 - main_df['portion_manhattan_96_and_below']) * main_df['all_other_areas'])

    #get total summons dollars by multiplying average dollar amount by number of summons given
    main_df['total_summons_dollars'] = main_df['Summons Number'] * main_df['average_summons_amount']
    main_df = main_df.sort_values(by=['total_summons_dollars'], ascending=False)
    end = time.time()    
    print("Calculate Total Summons Time: ", (end-start))
    return main_df

total_summons = calculate_total_summons(main_w_violation)
if bodo.get_rank() == 0:
    display(total_summons.head())

%px:   0%|          | 0/16 [00:00<?, ?tasks/s]

[stdout:0] Calculate Total Summons Time:  0.5892386897232882


[output:0]

Unnamed: 0,Issue Date,Violation County,Violation Precinct,Violation Code,Summons Number,Definition,manhattan_96_and_below,all_other_areas,portion_manhattan_96_and_below,average_summons_amount,total_summons_dollars
290,2017-04-11,NY,19,46,554,Standing or parking on the roadway side of a v...,115,115,1,115,63710
317,2017-03-22,NY,19,46,544,Standing or parking on the roadway side of a v...,115,115,1,115,62560
325,2016-09-30,BK,0,5,542,Failure to make a right turn from a bus lane.,115,115,1,115,62330
329,2017-03-30,NY,19,46,540,Standing or parking on the roadway side of a v...,115,115,1,115,62100
366,2017-04-13,NY,19,46,526,Standing or parking on the roadway side of a v...,115,115,1,115,60490


2. The aggregate function aggregates main_df by precinct. Once the data is run through this function that it will have a single row per precinct with the precinct number, the number of summonses, and the combined dollar value of the summonses.

In [None]:
%%px

@bodo.jit(cache=True)
def aggregate(main_df):
    '''function that aggregates and filters data
    e.g. total violations by precinct
    '''
    start = time.time()
    filtered_dataset = main_df[['Violation Precinct','Summons Number', 'total_summons_dollars']]
    precinct_offenses_df = filtered_dataset.groupby(by=['Violation Precinct']).sum().reset_index().fillna(0)
    end = time.time()
    precinct_offenses_df = precinct_offenses_df.sort_values("total_summons_dollars", ascending=False)
    print("Aggregate code time: ", (end-start))
    return precinct_offenses_df

precinct_offenses_df = aggregate(total_summons)
if bodo.get_rank() == 0:
    display(precinct_offenses_df.head())    

%px:   0%|          | 0/16 [00:00<?, ?tasks/s]

[stdout:0] Aggregate code time:  0.009074624489130656


[output:0]

Unnamed: 0,Violation Precinct,Summons Number,total_summons_dollars
204,19,795615,69402435
133,14,500438,48019275
242,0,710758,46152490
223,1,480341,45338900
102,18,453506,44068990
