# DASK Example work

Using parking ticket data from NYC

Based on ideas from "Data Science with Python and DASK", Manne, by Jesse Daniel

In [59]:
import dask.dataframe as dd
from dask.diagnostics import ProgressBar

import pandas as pd
import numpy as np

In [60]:
import os

# not needed when on colab
#os.chdir("D:\\Example_data\\Assorted_DL_data_sets\\NYC_parking_violations")

### NYC parking data set

ticket_url="https://data.cityofnewyork.us/resource/2bnn-yakx.csv"

https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2017/2bnn-yakx

10.8M rows by 43 columns

This is a big data set,   let's read the first 5 lines into a pandas data frame to see what we are dealing with

Just see what we have have in the data set

-Do we want to load all these columns?

-We should set the data types during the file load into DASK,    setting the data types instead of using the "auto" estimaation of data
types will save a lot of time in the load process


In [61]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [62]:
infile="/content/drive/MyDrive/Colab Notebooks/Spring 2024/Stats for Big Data/Data/Parking_Violations_Issued_-_Fiscal_Year_2017_short.csv"

temp=pd.read_csv(infile,nrows=5)

In [63]:
temp.head()

Unnamed: 0.1,Unnamed: 0,Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,...,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
0,0,5092469481,GZH7067,NY,PAS,2016-07-10,7,SUBN,TOYOT,V,...,GY,,2001,,0,,FAILURE TO STOP AT RED LIGHT,,,
1,1,5092451658,GZH7067,NY,PAS,2016-07-08,7,SUBN,TOYOT,V,...,GY,,2001,,0,,FAILURE TO STOP AT RED LIGHT,,,
2,2,4631633384,AVM7975,NY,PAS,2017-03-09,36,SUBN,GMC,V,...,GY,,2010,,0,,PHTO SCHOOL ZN SPEED VIOLATION,,,
3,3,8196557280,GWB7054,NY,PAS,2017-01-18,70,SUBN,TOYOT,T,...,BL,,2015,,0,5.0,70A-Reg. Sticker Expired (NYS),,,
4,4,4631184358,EXZ9820,NY,PAS,2017-03-02,36,4DSD,HONDA,V,...,GR,,1997,,0,,PHTO SCHOOL ZN SPEED VIOLATION,,,


In [64]:
# what are all the columns?  Do we want all these?  It looks like it
temp.columns

Index(['Unnamed: 0', 'Summons Number', 'Plate ID', 'Registration State',
       'Plate Type', 'Issue Date', 'Violation Code', 'Vehicle Body Type',
       'Vehicle Make', 'Issuing Agency', 'Street Code1', 'Street Code2',
       'Street Code3', 'Vehicle Expiration Date', 'Violation Location',
       'Violation Precinct', 'Issuer Precinct', 'Issuer Code',
       'Issuer Command', 'Issuer Squad', 'Violation Time',
       'Time First Observed', 'Violation County',
       'Violation In Front Of Or Opposite', 'House Number', 'Street Name',
       'Intersecting Street', 'Date First Observed', 'Law Section',
       'Sub Division', 'Violation Legal Code', 'Days Parking In Effect    ',
       'From Hours In Effect', 'To Hours In Effect', 'Vehicle Color',
       'Unregistered Vehicle?', 'Vehicle Year', 'Meter Number',
       'Feet From Curb', 'Violation Post Code', 'Violation Description',
       'No Standing or Stopping Violation', 'Hydrant Violation',
       'Double Parking Violation'],
      dt

In [65]:
temp.dtypes

Unnamed: 0                             int64
Summons Number                         int64
Plate ID                              object
Registration State                    object
Plate Type                            object
Issue Date                            object
Violation Code                         int64
Vehicle Body Type                     object
Vehicle Make                          object
Issuing Agency                        object
Street Code1                           int64
Street Code2                           int64
Street Code3                           int64
Vehicle Expiration Date                int64
Violation Location                   float64
Violation Precinct                     int64
Issuer Precinct                        int64
Issuer Code                            int64
Issuer Command                        object
Issuer Squad                          object
Violation Time                        object
Time First Observed                  float64
Violation 

# Two data versions
infile- full length

infile2-shorter version

## Dask runs slow

Whenever we process data with DASK it runs slowly,  the bigger the set, the slower dask runs

I tend to have to fix a lot of things when I write code,  it's not right the first time.   I tend to doublecheck myself as I go,   to find out
if code is doing what I want it to.   That approach is really, really slow in Dask.  

So,  if I were doing something serious with Dask,   I would do the entire analysis with a smaller version of the file first,  make sure it all works, then run it
on the full size file.

You really want to take advantage of the delayed execution in Dask,  so that the scheduler can figure out how to do a lot of calculations on each load of a block.
That means avoiding using compute instructions until you have a series of operations set up in a row,  to build efficiency.

I created a shorted version of this file,  infile2 which has the first few thousand lines of infile,    I can use it to test my code, then run the full data set.

The other option is to use head() to load a limited amount of data for initial testing

nyc_data_raw = dd.read_csv(infile, dtype=dtypes, usecols=dtypes.keys()).head(20000)

In [66]:
# code segment listing 5.1

#here is a listing of all the data types in the form of a dictionary
# the pandas read_csv() function reads all the data in and uses all the data to estimate the data type
#
# this automatic estimation of data types requires all the data be loaded as text and then converted
# to the appropriate type.    That process takes a fair amount of memory and time,  so if space is tight
# setting the data types manually with a dictionary as shown will save a lot of time and memory,  and give
# you better control of the data load

# even if you are using a smaller data set, and pandas,  it may make sense to set this up for long duration projects.  It saves time when loading data
# and ensures the data types are all set correctly.  The initial effort setting up this dictionary will pay off

dtypes = {
 'Date First Observed': str,
 'Days Parking In Effect    ':str,
 'Double Parking Violation': str,
 'Feet From Curb': np.float32,
 'From Hours In Effect': str,
 'House Number': str,
 'Hydrant Violation': str,
 'Intersecting Street': str,
 'Issue Date': str,
 'Issuer Code': np.float32,
 'Issuer Command': str,
 'Issuer Precinct': np.float32,
 'Issuer Squad': str,
 'Issuing Agency': str,
 'Law Section': np.float32,
 'Meter Number': str,
 'No Standing or Stopping Violation': str,
 'Plate ID': str,
 'Plate Type': str,
 'Registration State': str,
 'Street Code1': np.uint32,
 'Street Code2': np.uint32,
 'Street Code3': np.uint32,
 'Street Name': str,
 'Sub Division': str,
 'Summons Number': np.uint32,
 'Time First Observed': str,
 'To Hours In Effect': str,
 'Unregistered Vehicle?': str,
 'Vehicle Body Type': str,
 'Vehicle Color': str,
 'Vehicle Expiration Date': str,
 'Vehicle Make': str,
 'Vehicle Year': np.float32,
 'Violation Code': np.uint16,
 'Violation County': str,
 'Violation Description': str,
 'Violation In Front Of Or Opposite': str,
 'Violation Legal Code': str,
 'Violation Location': str,
 'Violation Post Code': str,
 'Violation Precinct': np.float32,
 'Violation Time': str
}

infile2="D:\\Example_data\\Assorted_DL_data_sets\\NYC_parking_violations\\Parking_Violations_Issued_-_Fiscal_Year_2017_short.csv"

nyc_data_raw = dd.read_csv(infile, dtype=dtypes, usecols=dtypes.keys())


In [67]:
#Write output  so we have a smaller test version of this file
# I created this early on to allow for testing

outfile="D:\\Example_data\\Assorted_DL_data_sets\\NYC_parking_violations\\Parking_Violations_Issued_-_Fiscal_Year_2017_short.csv"

nyc_data_raw.to_csv(outfile)



['/content/D:\\Example_data\\Assorted_DL_data_sets\\NYC_parking_violations\\Parking_Violations_Issued_-_Fiscal_Year_2017_short.csv/0.part']

In [68]:
# Listing 5.2
# use of the progress bar will cause the code below it to compute, and run immediately,    this
# is for demonstration/test purposes here- don't run these until you need to do the compute
#
# these are most useful in a "testing phase"- turn them off to run a big data set


with ProgressBar():
    display(nyc_data_raw['Plate ID'].head())

[########################################] | 100% Completed | 4.47 s


0    GZH7067
1    GZH7067
2    AVM7975
3    GWB7054
4    EXZ9820
Name: Plate ID, dtype: object

In [69]:
# Listing 5.3

with ProgressBar():
    print(nyc_data_raw[['Plate ID', 'Registration State']].head())

[########################################] | 100% Completed | 4.07 s
  Plate ID Registration State
0  GZH7067                 NY
1  GZH7067                 NY
2  AVM7975                 NY
3  GWB7054                 NY
4  EXZ9820                 NY


In [70]:
# to find the shape, we actually have to compute it.  Dask doesn't know how much is still on disk- the data is not fully loaded a the start
a=nyc_data_raw.shape
a[0].compute()

500000

In [71]:
nyc_data_raw.columns

Index(['Summons Number', 'Plate ID', 'Registration State', 'Plate Type',
       'Issue Date', 'Violation Code', 'Vehicle Body Type', 'Vehicle Make',
       'Issuing Agency', 'Street Code1', 'Street Code2', 'Street Code3',
       'Vehicle Expiration Date', 'Violation Location', 'Violation Precinct',
       'Issuer Precinct', 'Issuer Code', 'Issuer Command', 'Issuer Squad',
       'Violation Time', 'Time First Observed', 'Violation County',
       'Violation In Front Of Or Opposite', 'House Number', 'Street Name',
       'Intersecting Street', 'Date First Observed', 'Law Section',
       'Sub Division', 'Violation Legal Code', 'Days Parking In Effect    ',
       'From Hours In Effect', 'To Hours In Effect', 'Vehicle Color',
       'Unregistered Vehicle?', 'Vehicle Year', 'Meter Number',
       'Feet From Curb', 'Violation Post Code', 'Violation Description',
       'No Standing or Stopping Violation', 'Hydrant Violation',
       'Double Parking Violation'],
      dtype='object')

# Data Manipulation in Dask

The following examples show how data manipulation can be done in Dask

Look through these examples and make use of them on the next homework

## Selecting and Dropping Columns

In [72]:
# Listing 5.4

columns_to_select = ['Plate ID', 'Registration State']

with ProgressBar():
    display(nyc_data_raw[columns_to_select].head())

[########################################] | 100% Completed | 3.35 s


Unnamed: 0,Plate ID,Registration State
0,GZH7067,NY
1,GZH7067,NY
2,AVM7975,NY
3,GWB7054,NY
4,EXZ9820,NY


In [73]:
# listing 5.5   dropping one column

with ProgressBar():
    display(nyc_data_raw.drop('Violation Code', axis=1).head())

[########################################] | 100% Completed | 4.50 s


Unnamed: 0,Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,...,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
0,797502185,GZH7067,NY,PAS,2016-07-10,SUBN,TOYOT,V,0,0,...,GY,,2001.0,,0.0,,FAILURE TO STOP AT RED LIGHT,,,
1,797484362,GZH7067,NY,PAS,2016-07-08,SUBN,TOYOT,V,0,0,...,GY,,2001.0,,0.0,,FAILURE TO STOP AT RED LIGHT,,,
2,336666088,AVM7975,NY,PAS,2017-03-09,SUBN,GMC,V,0,0,...,GY,,2010.0,,0.0,,PHTO SCHOOL ZN SPEED VIOLATION,,,
3,3901589984,GWB7054,NY,PAS,2017-01-18,SUBN,TOYOT,T,59590,8590,...,BL,,2015.0,,0.0,5.0,70A-Reg. Sticker Expired (NYS),,,
4,336217062,EXZ9820,NY,PAS,2017-03-02,4DSD,HONDA,V,0,0,...,GR,,1997.0,,0.0,,PHTO SCHOOL ZN SPEED VIOLATION,,,


In [74]:
# Listing 5.6  Dropping multiple columns
violationColumnNames = list(filter(lambda columnName: 'Violation' in columnName, nyc_data_raw.columns))

with ProgressBar():
    display(nyc_data_raw.drop(violationColumnNames, axis=1).head())


[########################################] | 100% Completed | 4.05 s


Unnamed: 0,Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,...,Law Section,Sub Division,Days Parking In Effect,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb
0,797502185,GZH7067,NY,PAS,2016-07-10,SUBN,TOYOT,V,0,0,...,1111.0,D,,,,GY,,2001.0,,0.0
1,797484362,GZH7067,NY,PAS,2016-07-08,SUBN,TOYOT,V,0,0,...,1111.0,D,,,,GY,,2001.0,,0.0
2,336666088,AVM7975,NY,PAS,2017-03-09,SUBN,GMC,V,0,0,...,1180.0,B,,,,GY,,2010.0,,0.0
3,3901589984,GWB7054,NY,PAS,2017-01-18,SUBN,TOYOT,T,59590,8590,...,408.0,j3,YYYYYYY,,,BL,,2015.0,,0.0
4,336217062,EXZ9820,NY,PAS,2017-03-02,4DSD,HONDA,V,0,0,...,1180.0,B,,,,GR,,1997.0,,0.0


In [75]:
# 5.7 Renaming a column
nyc_data_renamed = nyc_data_raw.rename(columns={'Plate ID':'License Plate'})
nyc_data_renamed


Unnamed: 0_level_0,Summons Number,License Plate,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1
,uint32,object,object,object,object,uint16,object,object,object,uint32,uint32,uint32,object,object,float32,float32,float32,object,object,object,object,object,object,object,object,object,object,float32,object,object,object,object,object,object,object,float32,object,float32,object,object,object,object,object
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [76]:
# 5.8 getting a row by index
#  Dask works by index
with ProgressBar():
    display(nyc_data_raw.loc[56].head(1))

[########################################] | 100% Completed | 3.58 s


Unnamed: 0,Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,...,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
56,3988081146,PN6B5W,MO,PAS,2016-11-03,19,SUBN,LINCO,T,18520,...,WHITE,,0.0,,0.0,22 2,19-No Stand (bus stop),,,


# Listing 5.11 Calculating the percentage of missing values by column

In [77]:
missing_values = nyc_data_raw.isnull().sum()
missing_values

Dask Series Structure:
npartitions=1
Date First Observed    int64
Violation Time           ...
dtype: int64
Dask Name: dataframe-sum-agg, 4 graph layers

In [78]:
with ProgressBar():
    percent_missing = ((missing_values / nyc_data_raw.index.size) * 100).compute()
percent_missing

[########################################] | 100% Completed | 6.82 s


Summons Number                         0.0000
Plate ID                               0.0058
Registration State                     0.0000
Plate Type                             0.0000
Issue Date                             0.0000
Violation Code                         0.0000
Vehicle Body Type                      0.3904
Vehicle Make                           0.6554
Issuing Agency                         0.0000
Street Code1                           0.0000
Street Code2                           0.0000
Street Code3                           0.0000
Vehicle Expiration Date                0.0000
Violation Location                    18.4468
Violation Precinct                     0.0000
Issuer Precinct                        0.0000
Issuer Code                            0.0000
Issuer Command                        18.3628
Issuer Squad                          18.3704
Violation Time                         0.0002
Time First Observed                   92.1286
Violation County                  

In [79]:
# 5.12  drop the variables with high missing counts

columns_to_drop = percent_missing[percent_missing > 50].index
with ProgressBar():
   nyc_data_clean_stage1 = nyc_data_raw.drop(columns_to_drop, axis=1).persist()

[########################################] | 100% Completed | 4.12 s


In [80]:
 nyc_data_clean_stage1.head()

Unnamed: 0,Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,...,Street Name,Date First Observed,Law Section,Sub Division,Days Parking In Effect,Vehicle Color,Vehicle Year,Feet From Curb,Violation Post Code,Violation Description
0,797502185,GZH7067,NY,PAS,2016-07-10,7,SUBN,TOYOT,V,0,...,ALLERTON AVE (W/B) @,0,1111.0,D,,GY,2001.0,0.0,,FAILURE TO STOP AT RED LIGHT
1,797484362,GZH7067,NY,PAS,2016-07-08,7,SUBN,TOYOT,V,0,...,ALLERTON AVE (W/B) @,0,1111.0,D,,GY,2001.0,0.0,,FAILURE TO STOP AT RED LIGHT
2,336666088,AVM7975,NY,PAS,2017-03-09,36,SUBN,GMC,V,0,...,WB LINDEN BLVD @ LIN,0,1180.0,B,,GY,2010.0,0.0,,PHTO SCHOOL ZN SPEED VIOLATION
3,3901589984,GWB7054,NY,PAS,2017-01-18,70,SUBN,TOYOT,T,59590,...,Prince St,0,408.0,j3,YYYYYYY,BL,2015.0,0.0,5.0,70A-Reg. Sticker Expired (NYS)
4,336217062,EXZ9820,NY,PAS,2017-03-02,36,4DSD,HONDA,V,0,...,WB FLATLANDS AVE @ E,0,1180.0,B,,GR,1997.0,0.0,,PHTO SCHOOL ZN SPEED VIOLATION


with ProgressBar():
    print(df['Vehicle Year'].unique().head(10))

with ProgressBar():
    condition = (df['Vehicle Year'] > 0) & (df['Vehicle Year'] <= 2018)
    vehicle_age_by_year = df[condition]['Vehicle Year'].value_counts().compute().sort_index()
vehicle_age_by_year

https://learning.oreilly.com/library/view/data-science-with/9781617295607/c06.xhtml#h2-295607c06-0003

 ## 5.2.2   Dropping columns with missing values

In [81]:
columns_to_drop = list(percent_missing[percent_missing >= 50].index)
nyc_data_clean_stage1 = nyc_data_raw.drop(columns_to_drop, axis=1)

In [82]:
columns_to_drop

['Time First Observed',
 'Intersecting Street',
 'Violation Legal Code',
 'From Hours In Effect',
 'To Hours In Effect',
 'Unregistered Vehicle?',
 'Meter Number',
 'No Standing or Stopping Violation',
 'Hydrant Violation',
 'Double Parking Violation']

### 5.2.3 Imputing missing values

In [83]:
# 5.13 assign missing color entries to the most common color
with ProgressBar():
    count_of_vehicle_colors = nyc_data_clean_stage1['Vehicle Color'].value_counts().compute()
most_common_color = count_of_vehicle_colors.sort_values(ascending=False).index[0]


nyc_data_clean_stage2 = nyc_data_clean_stage1.fillna({'Vehicle Color': most_common_color})

[########################################] | 100% Completed | 3.56 s


In [84]:
count_of_vehicle_colors

GY       80295
WH       78237
BK       69441
WHITE    58188
BLACK    29590
         ...  
GRYA         1
GRY/G        1
GRWH         1
GRU          1
ZREEN        1
Name: Vehicle Color, Length: 544, dtype: int64

## 5.2.4 Dropping rows with missing data

Drop all rows with missing entries in columns that are rarely missing



In [85]:
rows_to_drop = list(percent_missing[(percent_missing > 0) & (percent_missing < 5)].index)
nyc_data_clean_stage3 = nyc_data_clean_stage2.dropna(subset=rows_to_drop)

5.2.5 Imputing multiple columns with missing values

In [86]:
remaining_columns_to_clean = list(percent_missing[(percent_missing >= 5) & (percent_missing < 50)].index)
nyc_data_clean_stage3.dtypes[remaining_columns_to_clean]

Violation Location                   object
Issuer Command                       object
Issuer Squad                         object
Violation In Front Of Or Opposite    object
House Number                         object
Days Parking In Effect               object
Violation Post Code                  object
Violation Description                object
dtype: object

In [87]:
# this is the dictionary to set up the missing data replacements

unknown_default_dict = dict(map(lambda columnName: (columnName, 'Unknown'), remaining_columns_to_clean))

In [88]:
unknown_default_dict

{'Violation Location': 'Unknown',
 'Issuer Command': 'Unknown',
 'Issuer Squad': 'Unknown',
 'Violation In Front Of Or Opposite': 'Unknown',
 'House Number': 'Unknown',
 'Days Parking In Effect    ': 'Unknown',
 'Violation Post Code': 'Unknown',
 'Violation Description': 'Unknown'}

In [89]:
# 5.17 replace entries with missing values with the word "unknown"
nyc_data_clean_stage4 =nyc_data_clean_stage3.fillna(unknown_default_dict)

Look at where we still have missing data,  this is a check

In [90]:
with ProgressBar():
    print(nyc_data_clean_stage4.isnull().sum().compute())


[########################################] | 100% Completed | 9.73 s
Summons Number                       0
Plate ID                             0
Registration State                   0
Plate Type                           0
Issue Date                           0
Violation Code                       0
Vehicle Body Type                    0
Vehicle Make                         0
Issuing Agency                       0
Street Code1                         0
Street Code2                         0
Street Code3                         0
Vehicle Expiration Date              0
Violation Location                   0
Violation Precinct                   0
Issuer Precinct                      0
Issuer Code                          0
Issuer Command                       0
Issuer Squad                         0
Violation Time                       0
Violation County                     0
Violation In Front Of Or Opposite    0
House Number                         0
Street Name                       

In [91]:
nyc_data_clean_stage4.persist()

Unnamed: 0_level_0,Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Date First Observed,Law Section,Sub Division,Days Parking In Effect,Vehicle Color,Vehicle Year,Feet From Curb,Violation Post Code,Violation Description
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1
,uint32,object,object,object,object,uint16,object,object,object,uint32,uint32,uint32,object,object,float32,float32,float32,object,object,object,object,object,object,object,object,float32,object,object,object,float32,float32,object,object
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


## 5.3 Recoding Data

We have a number of singleton types in the plate type category,  set these all to "Other"

In [92]:
with ProgressBar():
    license_plate_types = nyc_data_clean_stage4['Plate Type'].value_counts().compute()
license_plate_types

[########################################] | 100% Completed | 7.43 s


PAS    365949
COM     83826
OMT     20791
OMS      4986
SRF      4046
        ...  
CLG         1
CCK         1
ATV         1
ATD         1
WUG         1
Name: Plate Type, Length: 72, dtype: int64

Use the isin function to determine whether the plate type is in the top two categories
then  create a variable from this where the other conditions are all pooled into an other category

In [93]:
condition = nyc_data_clean_stage4['Plate Type'].isin(['PAS', 'COM'])
plate_type_masked = nyc_data_clean_stage4['Plate Type'].where(condition, 'Other')
nyc_data_recode_stage1 = nyc_data_clean_stage4.drop('Plate Type', axis=1)
nyc_data_recode_stage2 = nyc_data_recode_stage1.assign(PlateType=plate_type_masked)
nyc_data_recode_stage3 = nyc_data_recode_stage2.rename(columns={'PlateType':'Plate Type'})

In [94]:
# note that "stacking" of different stages of the data.   These are all "labels" for the underlying data, not independent copies.   They will in the graph
# used by the scheduler in the delayed execuation graph
# this line forces the actual computation

with ProgressBar():
    display(nyc_data_recode_stage3['Plate Type'].value_counts().compute())

[########################################] | 100% Completed | 7.24 s


PAS      365949
COM       83826
Other     43885
Name: Plate Type, dtype: int64

We have car colors as well

In [95]:
with ProgressBar():
    count_of_vehicle_colors = nyc_data_recode_stage3['Vehicle Color'].value_counts().compute()

[########################################] | 100% Completed | 8.29 s


In [96]:
count_of_vehicle_colors

GY       86289
WH       77630
BK       69282
WHITE    57094
BLACK    29102
         ...  
GYOR         1
GUY          1
GU           1
GRYE         1
ZREEN        1
Name: Vehicle Color, Length: 517, dtype: int64

In [97]:
single_color = list(count_of_vehicle_colors[count_of_vehicle_colors == 1].index)
condition = nyc_data_clean_stage4['Vehicle Color'].isin(single_color)
vehicle_color_masked = nyc_data_clean_stage4['Vehicle Color'].mask(condition, 'Other')
nyc_data_recode_stage4 = nyc_data_recode_stage3.drop('Vehicle Color', axis=1)
nyc_data_recode_stage5 = nyc_data_recode_stage4.assign(VehicleColor=vehicle_color_masked)
nyc_data_recode_stage6 = nyc_data_recode_stage5.rename(columns={'VehicleColor':'Vehicle Color'})

In [98]:
# compute all these substitutions and then check

with ProgressBar():
    count_of_vehicle_colors2 = nyc_data_recode_stage6['Vehicle Color'].value_counts().compute()

count_of_vehicle_colors2

[########################################] | 100% Completed | 7.65 s


GY       86289
WH       77630
BK       69282
WHITE    57094
BLACK    29102
         ...  
GREN         2
SIVLE        2
GRT          2
GRY.         2
AMETH        2
Name: Vehicle Color, Length: 288, dtype: int64

## 5.4 Parsing the issue date column

In [99]:
from datetime import datetime

def my_striptime(x):
    while True:
        try:
            y=datetime.strptime(x, "%m-%d-%Y")
            break
        except ValueError:
            y=datetime.strptime(x, "%Y-%m-%d")
            break
    return(y)



issue_date_parsed = nyc_data_recode_stage6['Issue Date'].apply(lambda x: datetime.strptime(x, "%Y-%m-%d"), meta=datetime)
#issue_date_parsed = nyc_data_recode_stage6['Issue Date'].apply(lambda x: datetime.strptime(x, "%m/%d/%Y"), meta=datetime)
nyc_data_derived_stage1 = nyc_data_recode_stage6.drop('Issue Date', axis=1)
nyc_data_derived_stage2 = nyc_data_derived_stage1.assign(IssueDate=issue_date_parsed)
nyc_data_derived_stage3 = nyc_data_derived_stage2.rename(columns={'IssueDate':'Issue Date'})



In [100]:
with ProgressBar():
    display(nyc_data_derived_stage3['Issue Date'].head())

[########################################] | 100% Completed | 13.23 s


0   2016-07-10
1   2016-07-08
2   2017-03-09
3   2017-01-18
4   2017-03-02
Name: Issue Date, dtype: datetime64[ns]

In [101]:
# operation 5.25 extracting month and year
issue_date_month_year = nyc_data_derived_stage3['Issue Date'].apply(lambda dt: dt.strftime("%Y%m"), meta=int)
nyc_data_derived_stage4 = nyc_data_derived_stage3.assign(IssueMonthYear=issue_date_month_year)
nyc_data_derived_stage5 = nyc_data_derived_stage4.rename(columns={'IssueMonthYear':'Citation Issued Month Year'})



In [102]:
# Listing 5.26
with ProgressBar():
    display(nyc_data_derived_stage4.head(1))

[########################################] | 100% Completed | 15.53 s


Unnamed: 0,Summons Number,Plate ID,Registration State,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,...,Sub Division,Days Parking In Effect,Vehicle Year,Feet From Curb,Violation Post Code,Violation Description,Plate Type,Vehicle Color,Issue Date,IssueMonthYear
0,797502185,GZH7067,NY,7,SUBN,TOYOT,V,0,0,0,...,D,Unknown,2001.0,0.0,Unknown,FAILURE TO STOP AT RED LIGHT,PAS,GY,2016-07-10,201607


In [103]:
# listing 5.26
with ProgressBar():
    display(nyc_data_derived_stage5['Citation Issued Month Year'].head())

[########################################] | 100% Completed | 15.86 s


0    201607
1    201607
2    201703
3    201701
4    201703
Name: Citation Issued Month Year, dtype: object

## Selecting values within time spans,  an example of extracting subsets

In [104]:
#listing 5.27  Finding all october citations
months=['201310','201410','201510','201610','201710']
condition = nyc_data_derived_stage5['Citation Issued Month Year'].isin(months)
october_citations = nyc_data_derived_stage5[condition]

with ProgressBar():
    display(october_citations.head())

[########################################] | 100% Completed | 15.68 s


Unnamed: 0,Summons Number,Plate ID,Registration State,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,...,Sub Division,Days Parking In Effect,Vehicle Year,Feet From Curb,Violation Post Code,Violation Description,Plate Type,Vehicle Color,Issue Date,Citation Issued Month Year
8,3998577006,FYP7892,NY,14,SUBN,TOYOT,T,14380,35980,36030,...,c,YYYYYYY,2016.0,0.0,CC1,14-No Standing,PAS,WH,2016-10-03,201610
14,3516909129,655790,RI,21,SUBN,FORD,T,18910,27790,24890,...,d1,Y Y,0.0,0.0,CCA,21-No Parking (street clean),PAS,OTHER,2016-10-07,201610
45,4139043893,64533JW,NY,40,VAN,FORD,T,10110,18820,18830,...,e2,YYYYYYY,2006.0,0.0,C 77,40-Fire Hydrant,COM,WH,2016-10-18,201610
52,3859075742,HHU5583,NY,68,SUBN,DODGE,T,70430,49630,48830,...,a1,Unknown,2016.0,0.0,01 3,68-Not Pkg. Comp. w Psted Sign,PAS,BK,2016-10-13,201610
78,4099064391,AG61612,AZ,78,VAN,GMC,T,14230,36430,36480,...,k6,YYYYYYY,0.0,0.0,A 31,78-Nighttime PKG on Res Street,PAS,WHITE,2016-10-22,201610


In [105]:
# listing 5/28

bound_date = '2016-4-25'
condition = nyc_data_derived_stage5['Issue Date'] > bound_date
citations_after_bound = nyc_data_derived_stage5[condition]

with ProgressBar():
    display(citations_after_bound.head())

[########################################] | 100% Completed | 15.88 s


Unnamed: 0,Summons Number,Plate ID,Registration State,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,...,Sub Division,Days Parking In Effect,Vehicle Year,Feet From Curb,Violation Post Code,Violation Description,Plate Type,Vehicle Color,Issue Date,Citation Issued Month Year
0,797502185,GZH7067,NY,7,SUBN,TOYOT,V,0,0,0,...,D,Unknown,2001.0,0.0,Unknown,FAILURE TO STOP AT RED LIGHT,PAS,GY,2016-07-10,201607
1,797484362,GZH7067,NY,7,SUBN,TOYOT,V,0,0,0,...,D,Unknown,2001.0,0.0,Unknown,FAILURE TO STOP AT RED LIGHT,PAS,GY,2016-07-08,201607
2,336666088,AVM7975,NY,36,SUBN,GMC,V,0,0,0,...,B,Unknown,2010.0,0.0,Unknown,PHTO SCHOOL ZN SPEED VIOLATION,PAS,GY,2017-03-09,201703
3,3901589984,GWB7054,NY,70,SUBN,TOYOT,T,59590,8590,57790,...,j3,YYYYYYY,2015.0,0.0,05,70A-Reg. Sticker Expired (NYS),PAS,BL,2017-01-18,201701
4,336217062,EXZ9820,NY,36,4DSD,HONDA,V,0,0,0,...,B,Unknown,1997.0,0.0,Unknown,PHTO SCHOOL ZN SPEED VIOLATION,PAS,GR,2017-03-02,201703


In [106]:
# Listing 5.29
with ProgressBar():
    condition = (nyc_data_derived_stage5['Issue Date'] > '2014-01-01') & (nyc_data_derived_stage5['Issue Date'] <= '2017-12-31')
    nyc_data_filtered = nyc_data_derived_stage5[condition]
    nyc_data_new_index = nyc_data_filtered.set_index('Citation Issued Month Year')

[########################################] | 100% Completed | 43.98 s


In [107]:
# ## Writing to other file formats

# The parquet data format is a storage form used in many distributed storage systems, such as Pig, Hive and Spark

# Dask should write to parquet files, which should offer some better performance, and offers some level of file compression

# However,  it appears that the routines that write from Dask to parquet format are "fussy" at best.

# Users report that it does run with very recent versions of python, (3.9 and higher), but I haven't chased the issue down

In [108]:
nyc_data_derived_stage5.to_csv("nyc_data_derived5.csv")

['/content/nyc_data_derived5.csv/0.part']

In [109]:
# this code usually doesn't run, it is attempting to save to parquet files
import numpy as np

import pyarrow
#import fastparquet

#nyc_data_derived_stage5.index = nyc_data_derived_stage5.index.astype('i8')

#nyc_data_derived_stage5.map_partitions(lambda d: d[nyc_data_derived_stage5.columns.tolist()])

#tdf.to_parquet("nyc_data_derived5",compression='snappy', engine='fastparquet')

nyc_data_derived_stage5.to_parquet("nyc_data_derived5",compression='snappy', engine='pyarrow', write_index=False)

#nyc_data_derived5=dd.read_parquet("nyc_data_derived5")

ValueError: Failed to convert partition to expected pyarrow schema:
    `ArrowTypeError('Expected a string or bytes dtype, got datetime64[ns]', 'Conversion failed for column Issue Date with type datetime64[ns]')`

Expected partition schema:
    Summons Number: uint32
    Plate ID: string
    Registration State: string
    Violation Code: uint16
    Vehicle Body Type: string
    Vehicle Make: string
    Issuing Agency: string
    Street Code1: uint32
    Street Code2: uint32
    Street Code3: uint32
    Vehicle Expiration Date: string
    Violation Location: string
    Violation Precinct: float
    Issuer Precinct: float
    Issuer Code: float
    Issuer Command: string
    Issuer Squad: string
    Violation Time: string
    Violation County: string
    Violation In Front Of Or Opposite: string
    House Number: string
    Street Name: string
    Date First Observed: string
    Law Section: float
    Sub Division: string
    Days Parking In Effect    : string
    Vehicle Year: float
    Feet From Curb: float
    Violation Post Code: string
    Violation Description: string
    Plate Type: string
    Vehicle Color: string
    Issue Date: string
    Citation Issued Month Year: int64

Received partition schema:
    Summons Number: uint32
    Plate ID: string
    Registration State: string
    Violation Code: uint16
    Vehicle Body Type: string
    Vehicle Make: string
    Issuing Agency: string
    Street Code1: uint32
    Street Code2: uint32
    Street Code3: uint32
    Vehicle Expiration Date: string
    Violation Location: string
    Violation Precinct: float
    Issuer Precinct: float
    Issuer Code: float
    Issuer Command: string
    Issuer Squad: string
    Violation Time: string
    Violation County: string
    Violation In Front Of Or Opposite: string
    House Number: string
    Street Name: string
    Date First Observed: string
    Law Section: float
    Sub Division: string
    Days Parking In Effect    : string
    Vehicle Year: float
    Feet From Curb: float
    Violation Post Code: string
    Violation Description: string
    Plate Type: string
    Vehicle Color: string
    Issue Date: timestamp[ns]
    Citation Issued Month Year: string

This error *may* be resolved by passing in schema information for
the mismatched column(s) using the `schema` keyword in `to_parquet`.

## Joins

Dask has join abilities, so you can join two or files,  each on disk,  into a single Dask object,  this would be one way to pull information spread out in multiple
large files and extract information from them.