In [None]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json



Thought process:

1. The code begins by setting up Kaggle API credentials, which are necessary to download the dataset from Kaggle.

2. It then downloads and extracts the NYC Yellow Taxi Trip dataset using the Kaggle CLI.

3. Dask is installed and imported along with other necessary libraries. Dask is used because it's efficient for handling large datasets that may not fit into memory.

4. The dataset is loaded using Dask's read_csv function, which can handle multiple CSV files at once.

5. The coordinates of the Crate and Barrel store are defined.

6. A function `is_near_crate_and_barrel` is created to check if a dropoff location is near the Crate and Barrel store. It uses numpy's `isclose` function to allow for some tolerance in the coordinate matching.

7. This function is applied to the dataframe to create a new boolean column 'near_crate_and_barrel'.

8. The dataframe is then filtered to include only the trips near Crate and Barrel.

9. Pickup and dropoff times are converted to datetime objects for easier manipulation.

10. Hour and minute are extracted from the dropoff time and added as new columns.

11. The final dataframe is computed (this is necessary with Dask to actually perform the operations).

12. Finally, the processed data is saved to a CSV file.

In [None]:
!kaggle datasets download -d elemento/nyc-yellow-taxi-trip-data
!unzip nyc-yellow-taxi-trip-data.zip -d data


Dataset URL: https://www.kaggle.com/datasets/elemento/nyc-yellow-taxi-trip-data
License(s): U.S. Government Works
Downloading nyc-yellow-taxi-trip-data.zip to /content
100% 1.78G/1.78G [00:18<00:00, 137MB/s]
100% 1.78G/1.78G [00:18<00:00, 101MB/s]
Archive:  nyc-yellow-taxi-trip-data.zip
  inflating: data/yellow_tripdata_2015-01.csv  
  inflating: data/yellow_tripdata_2016-01.csv  
  inflating: data/yellow_tripdata_2016-02.csv  
  inflating: data/yellow_tripdata_2016-03.csv  


In [None]:
!pip install dask
# Install and import required libraries
import dask.dataframe as dd
import pandas as pd

# Load dataset using Dask
df = dd.read_csv('data/yellow_tripdata_*.csv')




In [None]:
import numpy as np

crate_and_barrel_coords = (-73.974785, 40.750618)
# Define the coordinates of Crate and Barrel store
def is_near_crate_and_barrel(row):
    return np.isclose(row['dropoff_longitude'], crate_and_barrel_coords[0], atol=0.001) and np.isclose(row['dropoff_latitude'], crate_and_barrel_coords[1], atol=0.001)
# Apply the function to create a new column 'near_crate_and_barrel'
df['near_crate_and_barrel'] = df.apply(is_near_crate_and_barrel, axis=1, meta=('near_crate_and_barrel', 'bool'))
# Filter the dataframe to include only trips near Crate and Barrel
filtered_df = df[df['near_crate_and_barrel']]


In [None]:
df['tpep_pickup_datetime'] = dd.to_datetime(df['tpep_pickup_datetime'])
df['tpep_dropoff_datetime'] = dd.to_datetime(df['tpep_dropoff_datetime'])
df['dropoff_hour'] = df['tpep_dropoff_datetime'].dt.hour
df['dropoff_minute'] = df['tpep_dropoff_datetime'].dt.minute

# Compute the filtered dataframe
df = df.compute()


In [None]:
df.to_csv('final_filtered_data.csv', index=False)