# DATA 608: HW3

**Learning Objectives**
- Perform exploratory data analysis on a large dataset. 
- Investigate performance improvements promised by GPUs.

_This is an individual homework assignment._ 

Please complete this homework assignment within the Jupypter notebook environment providing markdown cells and code cells as appropriate.

#### Submission 

Your submission will be manually graded. Please submit via Gradescope.

_Please submit two jupyter notebook files, one for Q1 and one for Q2._

## Question 1: Exploring a Large Dataset

In this question, you are asked to use Dask to explore a large dataset that is too big to completely fit within memory.

### Step 1: Setup Environment

Please setup a cloud instance with at least 8GB of memory and preferably 4 vCPUs. Please install Dask on your instance. Ensure that you are able to launch a local cluster.  

In your Jupyter notebook, include approproate cells that show the configuration of your instance and  Dask cluster:
- No. of CPUs
- Amount of RAM 
- No. of workers in your cluster
- No. of threads per worker in your cluster
- Amount of memory per worker in your cluster.

### Step 2: Download and Organize Data

The data for this question can be downloaded from NYC Taxi and Limousine Commission. Follow the link below to download _Yellow Taxi Trip Records_ for all months for the year _2023_. The files are available in parquet format - there is one file per month.

- [TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

More information about the different fields is available at the following link:
- [Yello Trips Data Dictionary](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)

Store the files locally within block storage (for fast reads) that is available to your instance. Ensure that you put all the files within a dedicated folder.

You may find it helpful to write a short Python loop to download the files. You can use the Unix command `wget` to download the files.  

All together, there are over 30 million records for the year 2023. It is therefore not advisable to load the data all at once into memory. 

### Step 3: Counts by Month and Weekday

Use your Dask cluster to read _relevant_ data to produce the following charts:

1. Monthly distribution of trips for the year 2023 (i.e. how many trips were there in January, February, March, ...,  December).
2. Weekday distribution of trips for the year 2023 (i.e. how many trips fall on a Monday, Tuesday, Wednesday, ... , Sunday).

For the tasks above, use the trip start time to determine the month/weekday of the trip.

For at least one of the tasks above, please produce a visualization of the Dask task graph associate with the task.  

Please include commentary on the following:
- For your chosen task graph, please explain how the computation is parallelized.
- Comment on the monthly and weekday distribution of trips. Are there any noticeable trends? Which months are the busiest? Which weekdays are the busiest?

_Hint: Date/time information should be read in as appropriate `datetime` types. Please use existing [`datetime`](https://pandas.pydata.org/docs/user_guide/timeseries.html) functionality whenever possible._  


### Step 4: Histogram of Trip Durations

Read in _relevant_ data and perform appropriate computations to determine the duration (in seconds) for each trip for the year 2023. Then use the computed durations to produce a histogram that shows the distribution of the trip durations. 

- You can use [`dask.array.histogram`](https://docs.dask.org/en/stable/generated/dask.array.histogram.html) to compute the historgram using your Dask cluster. Alternatively, if your instance configuration allows, you may be able convert the results to a `numpy` array and use approproate `numpy` functions to compute the histogram.
- You will need to specify the number of bins and an appropriate range to limit your histogram. Set the number of bins to $128$ and the range to $[0, 7200]$ (seconds).

Please comment on your findings. How long are most trips?

_Hint: You can use the [`astype()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) function to convert `datetime` and `timedelta` types._

### Step 5: Scatter plot of Trip Duration and Tip Amount.

Finally, produce a scatter plot that shows the relationship between trip duration and tip amount. Note that your visualization package will likely choke if you try to plot all the data. Instead, subsample the data and choose an appropriate marker style so that the visualization is clear and can be completed in a reasonable amount of time. <br>
Again, you may want to limit the trip duration to the range: $[0, 7200]$ (seconds). 

Please comment on your findings. Are there any observable relationships between trip duration and tip amount? 

## Question 2: CPU vs GPU Performance for Normalization

Normalizing observations is a common task in data preparation. In this question, you will investigate if there is any benefit to performing the normalization on the GPU rather than the CPU.

### Environment Setup

You will need access to a GPU-enabled Jupyter notebook environment that is capable of running `CuPy`. If you are unable to set up your own environment, [Google Colab](https://colab.research.google.com/) is a good option. Please ensure that you are using GPU harware acceleration on Google Colab.  

### Problem Statement

Let $\mathbf{A}$ be an $n \times m$ data matrix which consists of $n$ observations (rows), where each observation belongs to $\mathbb{R}^m$ ($m$ columns). Normalizing the rows so that they have unit 2-norm (Euclidean-norm) is a common data pre-processing step. Let $\mathbf{a}_i$ represent the $i$-th row of $\mathbf{A}$. Normalizing $\mathbf{a}_i$ replaces the $i$-th row as follows:<br>
$\mathbf{a}_i \mapsto \frac{\mathbf{a}_i}{\lVert \mathbf{a}_i \rVert}$ (where $\lVert \cdot \rVert$ represents the 2-norm).

### Experimentation

For this question, keep $m$ fixed at $m=512$ and take increasingly larger values of $n$. For each value of $n$, generate the matrix $\mathbf{A}$ on the CPU by filling it with normally distributed random numbers of zero mean and unit variance. Gather the following timing data for each $n$. Repeat the experiment a few times for each $n$ and record mean results. 

- Time taken to normalize the rows of $\mathbf{A}$ on the CPU via `numpy`.
- Time taken to transfer the matrix $\mathbf{A}$ from the CPU to the GPU.
- Time taken to perform the normalzation on the GPU via `CuPy`.

#### Suggestions

- Please use appropriate timing routines when timing your code on the GPU. Check [CuPy Performance Best Practices](https://docs.cupy.dev/en/stable/user_guide/performance.html) for guidance on how to time GPU code. 
- You will find the functions [`numpy.linlag.norm`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html) and [`cupy.linalg.norm`](https://docs.cupy.dev/en/latest/reference/generated/cupy.linalg.norm.html) useful.
- The exact values of $n$ you take is up to you. You should take sufficiently large values so that the performance trend becomes clear. Please also be wary of resource limitations of your compute environment.



### Visualization

Produce visualizations that show how the performance scales as $n$ increases. Choose appropriate charts that clearly show the relative relationship between the three timing measurements. 

Comment on your findings. Is there always an advantage to using the GPU over the CPU for this problem? What can you say about the CPU-to-GPU data transfer time as compared to the computation time on the GPU?