# Preparing the Data

<div class="alert alert-block alert-info">
<b>Warning:</b>
Executing this initial notebook is likely to crash your Jupyter kernel on a system with less than 16GB physical RAM—the basic calls to Pandas's `read_csv` function alone will use up nearly all the available memory. In that case, you can use the <a href="https://www.bodo.ai/try-bodo">Bodo cloud platform</a> for free, or just read the notebook without attempting to execute cells.
</div>

This series of notebooks is inspired by Jonah Blumstein's [NYC Parking Violations Mapping Example](https://github.com/JBlumstein/NYCParking/blob/master/NYC_Parking_Violations_Mapping_Example.ipynb).

The data used comes from NYC parking tickets issued during the years 2016 and 2017. The original CSV files are available in an S3 bucket (`s3://bodo-examples-data/nyc-parking-tickets`). You can alternatively get the data from [Kaggle](https://www.kaggle.com/new-york-city/nyc-parking-tickets) (data for other years are also available, but may not run through the analysis here).

In our code, we would very much like to have a function that looks something like the following:

```python
def load_parking_tickets():
    groupby_cols = ['Issue Date','Violation County','Violation Precinct','Violation Code']
    DATA_SRC = 'ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2016.csv'
    df = pd.read_csv(DATA_SRC, parse_dates=["Issue Date"])
    df = df.groupby(groupby_cols, as_index=False)['Summons Number'].count()
    return df
```

Even assuming the data is stored locally in the directory `ParkingData`, you encounter a few warnings on execution:
- First, mixed data types (those `NaN`s get you every time).
- Second, this will eat up a lot of physical memory; yes, you might happen to have a machine/cluster that handles it cleanly, but that might not always be the case with other data.

---------------------

## Examining a chunk of the 2016 data
Let's take a quick look at what we're dealing with:

In [1]:
# import some relevant packages
import pandas as pd
from s3fs import S3FileSystem
import pathlib

The data is available remotely on an S3 bucket. It will be more convenient to work from local files. The code in the following cell will download two large CSV files from the S3 bucket (unless the files have already been downloaded).

<div class="alert alert-block alert-info">
<b>Warning:</b>
<ul>
    <li> Downloading this data could take some time (depending on your network speed).</li>
    <li> The "test" <tt>not local_file.exists()</tt> is easily fooled (for instance, by an empty file in the same location). In particular, there is no attempt made to verify the integrity of the data downloaded (e.g., by verifying a hash or some similar method).</li>
</ul>
<div>

In [2]:
# Make sure files are downloaded
s3 = S3FileSystem(anon=True)
S3_PATH = 'bodo-example-data/nyc-parking-tickets'
LOCAL_PATH = pathlib.Path('.') / 'ParkingData'
FNAME_TMPL = 'Parking_Violations_Issued_-_Fiscal_Year_{}.csv'
years = range(2016, 2018)

for yr in years:
    fname = FNAME_TMPL.format(yr)
    local_file = LOCAL_PATH / fname
    remote_file = f'{S3_PATH}/{fname}'
    if not local_file.exists():
        # Checks only filename, not contents/type!
        s3.get(rpath=remote_file, lpath=str(local_file))

These CSV files each contain almost 11 million rows. On a Unix-based system, the `wc` command shows the file line counts:

In [3]:
!wc -l ParkingData/Parking_Violations_Issued_*.csv

  10626900 ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2016.csv
  10803029 ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2017.csv
  21429929 total


The preceding shell command, if it executes on your operating system, produces output like this
```bash
  10626900 ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2016.csv
  10803029 ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2017.csv
  21429929 total
```

To get a better sense of the data, let's read the first million rows from the file for 2016.

In [4]:
# Examine a chunk from local source
fname = FNAME_TMPL.format(2016)
DATA_SRC = LOCAL_PATH / fname
%time df_chunk = pd.read_csv(DATA_SRC, parse_dates=["Issue Date"], nrows=1_000_000 )
display(df_chunk.head())

  exec(code, glob, local_ns)


CPU times: user 2.69 s, sys: 301 ms, total: 2.99 s
Wall time: 3.01 s


Unnamed: 0,Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,...,Hydrant Violation,Double Parking Violation,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,1363745270,GGY6450,99,PAS,2015-07-09,46,SDN,HONDA,P,0,...,,,,,,,,,,
1,1363745293,KXD355,SC,PAS,2015-07-09,21,SUBN,CHEVR,P,55730,...,,,,,,,,,,
2,1363745438,JCK7576,PA,PAS,2015-07-09,21,SDN,ME/BE,P,42730,...,,,,,,,,,,
3,1363745475,GYK7658,NY,OMS,2015-07-09,21,SUBN,NISSA,P,58130,...,,,,,,,,,,
4,1363745487,GMT8141,NY,PAS,2015-07-09,21,P-U,LINCO,P,58130,...,,,,,,,,,,


On reading the data, we'll see a warning that resembles this:

<div class="alert alert-block alert-info">
<tt>DtypeWarning: Columns (17,18,20,21,22,23,29,30,31,32,34,36,38,39) have mixed types.Specify dtype option on import or set low_memory=False.</tt>
</div>

This tells us that some columns are being parsed using mixed data-types—with an accompanying memory penalty. Let's see which columns these correspond to.

In [5]:
df_chunk.columns[[17,18,20,21,22,23,29,30,31,32,34,36,38,39]]

Index(['Issuer Command', 'Issuer Squad', 'Time First Observed',
       'Violation County', 'Violation In Front Of Or Opposite', 'House Number',
       'Violation Legal Code', 'Days Parking In Effect    ',
       'From Hours In Effect', 'To Hours In Effect', 'Unregistered Vehicle?',
       'Meter Number', 'Violation Post Code', 'Violation Description'],
      dtype='object')

A simple solution may be to cast those columns as strings on read with the `dtype` option to `pd.read_csv` and then clean things up later. However, we want to explore a bit first to see what is used otherwise.

In [6]:
df_chunk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 51 columns):
 #   Column                             Non-Null Count    Dtype         
---  ------                             --------------    -----         
 0   Summons Number                     1000000 non-null  int64         
 1   Plate ID                           999785 non-null   object        
 2   Registration State                 1000000 non-null  object        
 3   Plate Type                         1000000 non-null  object        
 4   Issue Date                         1000000 non-null  datetime64[ns]
 5   Violation Code                     1000000 non-null  int64         
 6   Vehicle Body Type                  996603 non-null   object        
 7   Vehicle Make                       994287 non-null   object        
 8   Issuing Agency                     1000000 non-null  object        
 9   Street Code1                       1000000 non-null  int64         
 10  Street 

The columns are mostly strings (represented as `object` in the output to `df_chunk.info()`). The `Issue Date` column is of dtype `datetime64` (as specified in the call to `read_csv`) and the remaining columns are numeric.

Looking at a few of the offending mixed-dtype columns, we can see that they are mostly strings. We can use the `DataFrame.unique` method to get a sense of how many distinct entries there are.

In [7]:
# How many distinct entries are there in the 'Violation County' column?
df_chunk['Violation County'].unique()

array(['K', nan, 'Q', 'NY', 'BX', 'R', 'QU', 'KINGS'], dtype=object)

In [8]:
# How many distinct entries are there in the 'Issuer Squad' column?
df_chunk['Issuer Squad'].unique()

array([0, nan, 'J', 'D', '0000', 'O', 'H', 'U', 'E', 'B', 'R', 'G', 'F',
       'T', 'S', 'A', 'Q', 'P', 'M', 'L', 'N', 'I', 'C', 'K', 'X', 'V',
       'X1', 'CC', 'Y', 'E1', 'B2', 'GP', 'A2', 'AA', 'A1', 'B1', 'X2',
       'D1', 'YA'], dtype=object)

In [9]:
# How many distinct entries are there in the 'Violation In Front Of Or Opposite' column?
df_chunk['Violation In Front Of Or Opposite'].unique()

array(['F', 'O', nan, 'I', 'R', 'X'], dtype=object)

These columns are in fact largely *categorical* (in having relatively few distinct strings). Moreover, any missing entries are interpreted as `numpy.nan`. There are also some instances of strings like `'0'` being parsed as the integer `0` which also contributes to the "mixed dtype" warning (which suggests that the first row in the CSV file has a numeric string in that column, giving the misleading impression that the entire column should be all integers).

---------------------

## Working with the entire data set

From this quick look, we might decide to cast the NaNs in `'Violation County'` as strings explicitly and handle other columns later. Another alternative is to use the `read_csv` option `'usecols'`.

```python
def load_parking_tickets():
    groupby_cols = ['Issue Date','Violation County','Violation Precinct','Violation Code']
    DATA_SRC = 'ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2016.csv'
    df = pd.read_csv(DATA_SRC, parse_dates=["Issue Date"])
    df['Violation County'] = df['Violation County'].fillna('NAN')
    df = df.groupby(groupby_cols, as_index=False)['Summons Number'].count()
    return df
```

Let's now load the entire file of about 11 million rows into memory (if possible).

In [10]:
%time year_2016_df = pd.read_csv(DATA_SRC, parse_dates=["Issue Date"])

  exec(code, glob, local_ns)


CPU times: user 30.6 s, sys: 6.7 s, total: 37.3 s
Wall time: 37.3 s


---------------------

## Working with Parquet files

The strategies above address the issue of parsing columns from a CSV file as mixed dtypes; if we're stuck with no choice but to use CSV files, so be it. But another option is available: re-encoding the data as *Parquet* files. This has a number of advantages, but encoding the data as a Parquet file does require being explicit about the column data types. To resolve the matter of a few columns being parsed as a mixture of data types, we can force those columns to parse as strings:

In [11]:
year_2016_df['Summons Number'] = year_2016_df['Summons Number'].astype(str) # This identifier
year_2016_df['Violation County'] = year_2016_df['Violation County'].astype(str)
year_2016_df['House Number'] = year_2016_df['House Number'].astype(str)
year_2016_df['Issuer Squad'] = year_2016_df['Issuer Squad'].astype(str)
year_2016_df['Unregistered Vehicle?'] = year_2016_df['Unregistered Vehicle?'].astype(str)
year_2016_df['Violation Post Code'] = year_2016_df['Violation Post Code'].astype(str)

We can now encode the dataframe as a Parquet file using the `DataFrame.to_parquet` method. Actually, we'll do it twice: once without using any keyword options and the second time using the `row_group_size` option to segment the resulting Parquet file (more on this later).

In [12]:
DATA_TARGET = LOCAL_PATH / 'Parking_Violations_Issued_-_Fiscal_Year_2016.parquet'
if not DATA_TARGET.exists():
    year_2016_df.to_parquet(DATA_TARGET)

In [13]:
DATA_TARGET = LOCAL_PATH / 'Parking_Violations_Issued_-_Fiscal_Year_2016_segmented.parquet'
if not DATA_TARGET.exists():
    year_2016_df.to_parquet(DATA_TARGET, row_group_size=100_000, engine='pyarrow')

There are a number of advantages to using PArquet files. First of all, Parquet encoding comes with considerable space savings. This data required ~2 GB when stored as a CSV file; as Parquet files—with the same data, and even dtypes to boot—the storage needed is about 380 MB.

And then there's how the data is read in: Parquet files are *column-oriented* (rather than row-oriented like CSV files). So reading in a selected subset of columns from Parquet is very efficient (saving both time and RAM overhead). By contrast, for CSV files, you need to read through the entire file... and wait for each row to be parsed to extract the required columns.

---------------------

## Repeating for the 2017 data

If you have the memory to do so, you can just do the same again for 2017, otherwise you'll likely need to restart your kernel and skip all the 2016 data, before repeating the write for 2017. Notice there are slightly different columns yielding mixed-dtype warning messages for this file.

In [14]:
import pandas as pd
DATA_SRC = 'ParkingData/Parking_Violations_Issued_-_Fiscal_Year_2017.csv'
year_2017_df = pd.read_csv(DATA_SRC, parse_dates=["Issue Date"])

  exec(code_obj, self.user_global_ns, self.user_ns)


In [15]:
year_2017_df['Summons Number'] = year_2017_df['Summons Number'].astype(str) # This identifier
year_2017_df['Violation County'] = year_2017_df['Violation County'].astype(str)
year_2017_df['House Number'] = year_2017_df['House Number'].astype(str)
year_2017_df['Issuer Squad'] = year_2017_df['Issuer Squad'].astype(str)
year_2017_df['Unregistered Vehicle?'] = year_2017_df['Unregistered Vehicle?'].astype(str)
year_2017_df['Violation Post Code'] = year_2017_df['Violation Post Code'].astype(str)
year_2017_df['Issuer Squad'] = year_2017_df['Issuer Squad'].astype(str)

In [16]:
DATA_TARGET = LOCAL_PATH / 'Parking_Violations_Issued_-_Fiscal_Year_2017.parquet'
if not DATA_TARGET.exists():
    year_2017_df.to_parquet(DATA_TARGET)

In [17]:
DATA_TARGET = LOCAL_PATH / 'Parking_Violations_Issued_-_Fiscal_Year_2017_segmented.parquet'
if not DATA_TARGET.exists():
    year_2017_df.to_parquet(DATA_TARGET, row_group_size=100_000, engine='pyarrow')

Now we have the data encoded as CSV, Parquet, and segmented Parquet files—in the next notebook we'll take a look at what that gets us.

---------------------