# Exercise - cuDF - NYC Parking Violations

We've learned how to work with numeric data using CuPy. But many applications, in data science and machine learning involve other kinds of data, like dates and strings. 

[cuDF](https://docs.rapids.ai/api/cudf/stable/) is a DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. It offers both a [Pandas](https://docs.rapids.ai/api/cudf/stable/cudf_pandas/) and a [Polars](https://docs.rapids.ai/api/cudf/stable/cudf_polars/) API.

## A quick Pandas introduction

In [None]:
import pandas as pd

### Series and DataFrame objects

In [None]:
s = pd.Series([1, 2, 3])
s

In [None]:
print("Max value: ", s.max())
print("Mean value: ", s.mean())

In [None]:
s = pd.Series(["one", "two", "three"])

In [None]:
print("Max value: ", s.max())

In [None]:
df = pd.DataFrame({
    "a" : [1, 2, 1, 3, 2],
    "b" : [1, 4, 7, 2, 0],
    "c" : [3, 3, 3, 4, 5]
}, index = [1, 2, 3, 4, 5])
df

In [None]:
df.index

In [None]:
df.columns

### Selecting and filtering data

In [None]:
df.head(2)

In [None]:
df.tail(2)

In [None]:
df["a"]

In [None]:
df[["b", "c"]]

In [None]:
df.iloc[0:2]

In [None]:
df.iloc[0, 1:3]

In [None]:
df.loc[2:3, "b":"c"]

In [None]:
df[df['a'] > 1]

### Sorting

In [None]:
df.sort_values("a")

### Summarizing Data

In [None]:
df.sum()

In [None]:
df["a"].mean()

### Grouped aggregations

In [None]:
df["a"].value_counts()

In [None]:
df["c"].value_counts()

In [None]:
df.groupby("a").mean()

In [None]:
df.groupby("c").count()

In [None]:
df.groupby("a").agg({"b": ["min", "mean"], "c": ["max"]})

### String operations

In [None]:
df["d"] = ["mario", "luigi", "yoshi", "peach", "toad"]
df

In [None]:
df["d"].str.upper()

### Time Series

In [None]:
import numpy as np

date_df = pd.DataFrame()
date_df["date"] = pd.date_range("11/20/2018", periods=72, freq="D")
date_df["value"] = np.random.sample(len(date_df))
date_df

In [None]:
date_df[date_df["date"] < "2018-11-24"]

In [None]:
date_df["year"] = date_df["date"].dt.year
date_df

### User-defined operations

In [None]:
def add_ten(x):
    return x + 10

df["a"] = df["a"].apply(add_ten)
df

## Now let's do the same thing with cuDF

In [None]:
import cudf

In [None]:
df = cudf.DataFrame({
    "a" : [1, 2, 1, 3, 2],
    "b" : [1, 4, 7, 2, 0],
    "c" : [1, 1, 8, 2, 9]
}, index = [1, 2, 3, 4, 5])
df

In [None]:
type(df)

In [None]:
df.loc[2:3, "b":"c"]

In [None]:
df.groupby("a").agg({"b": ["min", "mean"], "c": ["max"]})

Some things are different though!

In [None]:
import numpy as np

date_df = cudf.DataFrame()
date_df["date"] = cudf.date_range("11/20/2018", periods=72, freq="D")
date_df["value"] = np.random.sample(len(date_df))
date_df

Unlike Pandas, cuDF does not (yet) have the ability to interpret the date `"11/20/2018"`, instead use the more standard `"2018-11-20"`:

In [None]:
date_df = cudf.DataFrame()
date_df["date"] = cudf.date_range("2018-11-20", periods=72, freq="D")
date_df["value"] = np.random.sample(len(date_df))
date_df

In [None]:
date_df[date_df["date"] < "2018-11-24"]

In [None]:
def add_ten(x):
    return x + 10

df["a"] = df["a"].apply(add_ten)
df

## Exercise: Working With Real Data

In this exercise, you'll use Pandas to analyze some real-world data, and then repeat the analysis with cuDF.

### Download the data

The data we'll be working with is the [Parking Violations Issued - Fiscal Year 2022](https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2022/7mxj-7a6y) dataset from NYC Open Data.

We're downloading a copy of this dataset from an s3 bucket hosted by NVIDIA to provide faster download speeds. We'll start by downloading the data. This should take about 30 seconds.

### Data License and Terms
As this dataset originates from the NYC Open Data Portal, it's governed by their license and terms of use.

##### Are there restrictions on how I can use Open Data?

> Open Data belongs to all New Yorkers. There are no restrictions on the use of Open Data. Refer to Terms of Use for more information.

#### [Terms of Use](https://opendata.cityofnewyork.us/overview/#termsofuse)

> By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.

> The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.

> Submitting City Agencies are the authoritative source of data available on NYC Open Data. These entities are responsible for data quality and retain version control of data sets and feeds accessed on the Site. Data may be updated, corrected, or refreshed at any time.

In [None]:
!wget -nc https://data.rapids.ai/datasets/nyc_parking/nyc_parking_violations_2022.parquet -O nyc_parking_violations_2022.parquet

In [None]:
import pandas as pd

data = pd.read_parquet("nyc_parking_violations_2022.parquet")
data.head()

### Task 1

This dataset is relatively large, with lots of columns.

* How many columns are there?
* Extract a subset of the data with just the following columns:
  * `"Registration State"`
  * `"Violation Description"`
  * `"Vehicle Body Type"`
  * `"Issue Date"`

### Task 2

For vehicles with body type `"TAXI"`, what is the number of vehicles from each state?

### Task 3

Now, repeat the analysis (starting from `read_parquet`) using cuDF. How much faster is it compared to Pandas? To measure the execution of a cell in Jupyter Notebook, you can add the line `%%time` at the top of a cell. For example:

```python
%%time

import cudf
data = cudf.read_parquet("nyc_parking_violations_2022.parquet")
data.head()
````

## Resources

* `cudf.pandas` docs: https://docs.rapids.ai/api/cudf/stable/cudf_pandas/
* cuDF documentation: https://docs.rapids.ai/api/cudf/stable
* cuDF API reference: https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/