## Exercise - cuDF - NYC Parking Violations

We've learned how to work with numeric data using CuPy. But many applications, in data science and machine learning involve other kinds of data, like dates and strings. 

[cuDF](https://docs.rapids.ai/api/cudf/stable/) is a DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. It offers both a [Pandas](https://docs.rapids.ai/api/cudf/stable/cudf_pandas/) and a [Polars](https://docs.rapids.ai/api/cudf/stable/cudf_polars/) API.

### A quick Pandas introduction

In [1]:
import pandas as pd

#### Series and DataFrame objects

In [2]:
s = pd.Series([1, 2, 3])
s

0    1
1    2
2    3
dtype: int64

In [3]:
print("Max value: ", s.max())
print("Mean value: ", s.mean())

Max value:  3
Mean value:  2.0


In [4]:
s = pd.Series(["one", "two", "three"])

In [5]:
print("Max value: ", s.max())

Max value:  two


In [6]:
df = pd.DataFrame({
    "a" : [1, 2, 1, 3, 2],
    "b" : [1, 4, 7, 2, 0],
    "c" : [3, 3, 3, 4, 5]
}, index = [1, 2, 3, 4, 5])
df

Unnamed: 0,a,b,c
1,1,1,3
2,2,4,3
3,1,7,3
4,3,2,4
5,2,0,5


In [7]:
df.index

Index([1, 2, 3, 4, 5], dtype='int64')

In [8]:
df.columns

Index(['a', 'b', 'c'], dtype='object')

#### Selecting and filtering data

In [9]:
df.head(2)

Unnamed: 0,a,b,c
1,1,1,3
2,2,4,3


In [10]:
df.tail(2)

Unnamed: 0,a,b,c
4,3,2,4
5,2,0,5


In [11]:
df["a"]

1    1
2    2
3    1
4    3
5    2
Name: a, dtype: int64

In [12]:
df[["b", "c"]]

Unnamed: 0,b,c
1,1,3
2,4,3
3,7,3
4,2,4
5,0,5


In [13]:
df.iloc[0:2]

Unnamed: 0,a,b,c
1,1,1,3
2,2,4,3


In [14]:
df.iloc[0, 1:3]

b    1
c    3
Name: 1, dtype: int64

In [15]:
df.loc[2:3, "b":"c"]

Unnamed: 0,b,c
2,4,3
3,7,3


In [16]:
df[df['a'] > 1]

Unnamed: 0,a,b,c
2,2,4,3
4,3,2,4
5,2,0,5


#### Sorting

In [17]:
df.sort_values("a")

Unnamed: 0,a,b,c
1,1,1,3
3,1,7,3
2,2,4,3
5,2,0,5
4,3,2,4


#### Summarizing Data

In [18]:
df.sum()

a     9
b    14
c    18
dtype: int64

In [19]:
df["a"].mean()

np.float64(1.8)

#### Grouped aggregations

In [20]:
df["a"].value_counts()

a
1    2
2    2
3    1
Name: count, dtype: int64

In [21]:
df["c"].value_counts()

c
3    3
4    1
5    1
Name: count, dtype: int64

In [22]:
df.groupby("a").mean()

Unnamed: 0_level_0,b,c
a,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.0,3.0
2,2.0,4.0
3,2.0,4.0


In [23]:
df.groupby("c").count()

Unnamed: 0_level_0,a,b
c,Unnamed: 1_level_1,Unnamed: 2_level_1
3,3,3
4,1,1
5,1,1


In [24]:
df.groupby("a").agg({"b": ["min", "mean"], "c": ["max"]})

Unnamed: 0_level_0,b,b,c
Unnamed: 0_level_1,min,mean,max
a,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,1,4.0,3
2,0,2.0,5
3,2,2.0,4


#### String operations

In [25]:
df["d"] = ["mario", "luigi", "yoshi", "peach", "toad"]
df

Unnamed: 0,a,b,c,d
1,1,1,3,mario
2,2,4,3,luigi
3,1,7,3,yoshi
4,3,2,4,peach
5,2,0,5,toad


In [26]:
df["d"].str.upper()

1    MARIO
2    LUIGI
3    YOSHI
4    PEACH
5     TOAD
Name: d, dtype: object

#### Time Series

In [27]:
import numpy as np

date_df = pd.DataFrame()
date_df["date"] = pd.date_range("11/20/2018", periods=72, freq="D")
date_df["value"] = np.random.sample(len(date_df))
date_df

Unnamed: 0,date,value
0,2018-11-20,0.453224
1,2018-11-21,0.127411
2,2018-11-22,0.439165
3,2018-11-23,0.694320
4,2018-11-24,0.979057
...,...,...
67,2019-01-26,0.326564
68,2019-01-27,0.303860
69,2019-01-28,0.936735
70,2019-01-29,0.307780


In [28]:
date_df[date_df["date"] < "2018-11-24"]

Unnamed: 0,date,value
0,2018-11-20,0.453224
1,2018-11-21,0.127411
2,2018-11-22,0.439165
3,2018-11-23,0.69432


In [29]:
date_df["year"] = date_df["date"].dt.year
date_df

Unnamed: 0,date,value,year
0,2018-11-20,0.453224,2018
1,2018-11-21,0.127411,2018
2,2018-11-22,0.439165,2018
3,2018-11-23,0.694320,2018
4,2018-11-24,0.979057,2018
...,...,...,...
67,2019-01-26,0.326564,2019
68,2019-01-27,0.303860,2019
69,2019-01-28,0.936735,2019
70,2019-01-29,0.307780,2019


#### User-defined operations

In [30]:
def add_ten(x):
    return x + 10

df["a"] = df["a"].apply(add_ten)
df

Unnamed: 0,a,b,c,d
1,11,1,3,mario
2,12,4,3,luigi
3,11,7,3,yoshi
4,13,2,4,peach
5,12,0,5,toad


### Now let's do the same thing with cuDF

In [31]:
import cudf

In [32]:
df = cudf.DataFrame({
    "a" : [1, 2, 1, 3, 2],
    "b" : [1, 4, 7, 2, 0],
    "c" : [1, 1, 8, 2, 9]
}, index = [1, 2, 3, 4, 5])
df

Unnamed: 0,a,b,c
1,1,1,1
2,2,4,1
3,1,7,8
4,3,2,2
5,2,0,9


In [33]:
type(df)

cudf.core.dataframe.DataFrame

In [34]:
df.loc[2:3, "b":"c"]

Unnamed: 0,b,c
2,4,1
3,7,8


In [35]:
df.groupby("a").agg({"b": ["min", "mean"], "c": ["max"]})

Unnamed: 0_level_0,b,b,c
Unnamed: 0_level_1,min,mean,max
a,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2,0,2.0,9
1,1,4.0,8
3,2,2.0,2


Some things are different though!

In [36]:
import numpy as np

date_df = cudf.DataFrame()
date_df["date"] = cudf.date_range("11/20/2018", periods=72, freq="D")
date_df["value"] = np.random.sample(len(date_df))
date_df

ValueError: Error parsing datetime string "11/20/2018" at position 2

Unlike Pandas, cuDF does not (yet) have the ability to interpret the date `"11/20/2018"`, instead use the more standard `"2018-11-20"`:

In [37]:
date_df = cudf.DataFrame()
date_df["date"] = cudf.date_range("2018-11-20", periods=72, freq="D")
date_df["value"] = np.random.sample(len(date_df))
date_df

Unnamed: 0,date,value
0,2018-11-20,0.896092
1,2018-11-21,0.524976
2,2018-11-22,0.375614
3,2018-11-23,0.950067
4,2018-11-24,0.291348
...,...,...
67,2019-01-26,0.099531
68,2019-01-27,0.372808
69,2019-01-28,0.379842
70,2019-01-29,0.164194


In [38]:
date_df[date_df["date"] < "2018-11-24"]

Unnamed: 0,date,value
0,2018-11-20,0.896092
1,2018-11-21,0.524976
2,2018-11-22,0.375614
3,2018-11-23,0.950067


In [39]:
def add_ten(x):
    return x + 10

df["a"] = df["a"].apply(add_ten)
df

ValueError: user defined function compilation failed.

### Exercise: Working With Real Data

In this exercise, you'll use Pandas to analyze some real-world data, and then repeat the analysis with cuDF.

#### Download the data

The data we'll be working with is the [Parking Violations Issued - Fiscal Year 2022](https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2022/7mxj-7a6y) dataset from NYC Open Data.

We're downloading a copy of this dataset from an s3 bucket hosted by NVIDIA to provide faster download speeds. We'll start by downloading the data. This should take about 30 seconds.

#### Data License and Terms
As this dataset originates from the NYC Open Data Portal, it's governed by their license and terms of use.

###### Are there restrictions on how I can use Open Data?

> Open Data belongs to all New Yorkers. There are no restrictions on the use of Open Data. Refer to Terms of Use for more information.

##### [Terms of Use](https://opendata.cityofnewyork.us/overview/#termsofuse)

> By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.

> The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.

> Submitting City Agencies are the authoritative source of data available on NYC Open Data. These entities are responsible for data quality and retain version control of data sets and feeds accessed on the Site. Data may be updated, corrected, or refreshed at any time.

In [None]:
!wget -nc https://data.rapids.ai/datasets/nyc_parking/nyc_parking_violations_2022.parquet -O nyc_parking_violations_2022.parquet

In [None]:
import pandas as pd

data = pd.read_parquet("nyc_parking_violations_2022.parquet")
data.head()

#### Task 1

This dataset is relatively large, with lots of columns.

* How many columns are there?
* Extract a subset of the data with just the following columns:
  * `"Registration State"`
  * `"Violation Description"`
  * `"Vehicle Body Type"`
  * `"Issue Date"`

#### Task 2

For vehicles with body type `"TAXI"`, what is the number of vehicles from each state?

#### Task 3

Now, repeat the analysis (starting from `read_parquet`) using cuDF. How much faster is it compared to Pandas? To measure the execution of a cell in Jupyter Notebook, you can add the line `%%time` at the top of a cell. For example:

```python
%%time

import cudf
data = cudf.read_parquet("nyc_parking_violations_2022.parquet")
data.head()
````

### Resources

* `cudf.pandas` docs: https://docs.rapids.ai/api/cudf/stable/cudf_pandas/
* cuDF documentation: https://docs.rapids.ai/api/cudf/stable
* cuDF API reference: https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/