# Data Analysis with cuDF

We've learned how to work with numeric data using CuPy. But many applications, in data science and machine learning involve other kinds of data, like dates and strings. 

[cuDF](https://docs.rapids.ai/api/cudf/stable/) is a DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. It offers both a [Pandas](https://docs.rapids.ai/api/cudf/stable/cudf_pandas/) and a [Polars](https://docs.rapids.ai/api/cudf/stable/cudf_polars/) API.

# Download the data

The data we'll be working with is the [Parking Violations Issued - Fiscal Year 2022](https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2022/7mxj-7a6y) dataset from NYC Open Data.

We're downloading a copy of this dataset from an s3 bucket hosted by NVIDIA to provide faster download speeds. We'll start by downloading the data. This should take about 30 seconds.

## Data License and Terms
As this dataset originates from the NYC Open Data Portal, it's governed by their license and terms of use.

### Are there restrictions on how I can use Open Data?

> Open Data belongs to all New Yorkers. There are no restrictions on the use of Open Data. Refer to Terms of Use for more information.

### [Terms of Use](https://opendata.cityofnewyork.us/overview/#termsofuse)

> By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.

> The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.

> Submitting City Agencies are the authoritative source of data available on NYC Open Data. These entities are responsible for data quality and retain version control of data sets and feeds accessed on the Site. Data may be updated, corrected, or refreshed at any time.

In [None]:
!wget -nc https://data.rapids.ai/datasets/nyc_parking/nyc_parking_violations_2022.parquet -O nyc_parking_violations_2022.parquet

## Analysis using standard Pandas

First, let's use Pandas to read in some columns of the dataset:

In [None]:
%load_ext cudf.pandas

In [None]:
import pandas as pd

In [None]:
%%time
# read 5 columns data:
df = pd.read_parquet(
    "/tmp/nyc_parking_violations_2022.parquet",
    columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"],
    
)
df["Issue Date"] = df["Issue Date"].astype("datetime64[s]")

# view a random sample of 10 rows:
df.sample(10)

Each record in our dataset contains the state of registration of the offending vehicle, and the type of parking offence. Let's say we want to get the most common type of offence for vehicles registered in different states. We can do this in Pandas using a combination of [value_counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) and [GroupBy.head](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.head.html):

In [None]:
%%time
(df[["Registration State", "Violation Description"]]  # get only these two columns
 .value_counts()  # get the count of offences per state and per type of offence
 .groupby("Registration State")  # group by state
 .head(1)  # get the first row in each group (the type of offence with the largest count)
 .sort_index()  # sort by state name
 .reset_index()
)

The code above uses [method chaining](https://tomaugspurger.net/posts/method-chaining/) to combine a series of operations into a single statement. You might find it useful to break the code up into multiple statements and inspect each of the intermediate results!

### Which vehicle body types are most frequently involved in parking violations?

We can also investigate which vehicle body types most commonly appear in parking violations

In [None]:
%%time
(df
 .groupby(["Vehicle Body Type"])
 .agg({"Summons Number": "count"})
 .rename(columns={"Summons Number": "Count"})
 .sort_values(["Count"], ascending=False)
)

### How do parking violations vary across days of the week?

In [None]:
%%time
weekday_names = {
    0: "Monday",
    1: "Tuesday",
    2: "Wednesday",
    3: "Thursday",
    4: "Friday",
    5: "Saturday",
    6: "Sunday",
}

df["Issue Date"] = df["Issue Date"].astype("datetime64[ms]")
df["issue_weekday"] = df["Issue Date"].dt.weekday.map(weekday_names)

df.groupby(["issue_weekday"])["Summons Number"].count().sort_values()

## Exercise 1: using cudf.pandas

In this exercise, you'll repeat the analysis we did above using `cudf.pandas`. Fortunately, it needs no code changes at all.

1. Make note of the timings we obtained in the previous cells.
2. Restart this Jupyter notebook server (Kernel->Restart Kernel).
3. Before the line `import pandas`, insert the following line of code

```python
%load_ext cudf.pandas
```

4. Re-run the rest of the cells and note the new timings.

# Understanding Performance

`cudf.pandas` works by using the GPU for operations that are supported, and falling back to Pandas (CPU) when an operation is not supported.

The `line_profile` magic can help you figure out when 

In [None]:
%%cudf.pandas.line_profile

small_df = pd.DataFrame({'a': ["0", "1", "2", "0", "1", "2"], 
                         'b': ["x", "y", "z", "x", "y", "z"]})
small_df.min(axis=0)
small_df.min(axis=1)
counts = small_df.groupby("a").b.count()

## Exercise 2: writing GPU-optimized Pandas code

`cudf.pandas` isn't completely magic. It's very possible to write code that will perform quite badly if you don't know what you're doing. Below is some code to extract just the records for violations in the month of March:

In [None]:
%%time
def is_in_march(datetime):
    datetime = str(datetime)  # YYYY-MM-DD HH:MM:SS'
    date = datetime.split(" ")[0]
    year, month, day = date.split("-")
    return month == "03"

date = df["Issue Date"]
cond = date.apply(is_in_march)
df[cond].head()

The above snippet takes several seconds to complete - why do you think it's so slow? Rewrite the code above to do the same operation using more efficient operations.