# Exploring the NYC taxi data

In Project 2, you will work on the [NYC taxi trip data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). Every month, the city of New York publishes open data which contains a record of every taxi ride taken that month in the city.

The function `get_taxi_data()` is provided for you in `utils.py` to easily download and read data for a particular month and type of taxi. You should use it in your project.

Open `utils.py` in VSCode, study it carefully, and try the example below. If you are not sure how it works, ask a tutor!

In [6]:
import pandas as pd

# Import the function get_taxi_data() from utils.py
from utils import get_taxi_data

In [7]:
# Example: get yellow taxi data for January 2022
cols_to_read = ['tpep_pickup_datetime',
                'tpep_dropoff_datetime',
                'passenger_count',
                'trip_distance',
                'fare_amount']

# Download the data and get the specified columns, save the file locally
df = get_taxi_data('2022', '01', 'yellow', columns=cols_to_read, save=True)
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,fare_amount
0,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,14.5
1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,8.0
2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,7.5
3,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,8.0
4,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,23.5


In [3]:
# Now, get the data only for those 3 columns.
# We have the file already saved from the previous command, so this should be faster!
cols_to_read = ['tpep_pickup_datetime',
                'tpep_dropoff_datetime',
                'trip_distance']

# We also don't need to save this as it's a subset of the file we already have.
df = get_taxi_data('2022', '01', 'yellow', columns=cols_to_read)
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance
0,2022-01-01 00:35:40,2022-01-01 00:53:29,3.8
1,2022-01-01 00:33:43,2022-01-01 00:42:07,2.1
2,2022-01-01 00:53:21,2022-01-01 01:02:19,0.97
3,2022-01-01 00:25:21,2022-01-01 00:35:23,1.09
4,2022-01-01 00:36:48,2022-01-01 01:14:20,4.3


In [4]:
# Now, I want the same data, but I need a new column 'total_amount' which is not in my current file.
cols_to_read = ['fare_amount',
                'total_amount']

# The function tries to get the columns from the existing data file,
# but can't find them, so it automatically re-downloads the data.
df = get_taxi_data('2022', '01', 'yellow', columns=cols_to_read)
df.head()

File is in current folder, but may not contain all required columns.
Re-downloading data...


Unnamed: 0,fare_amount,total_amount
0,14.5,21.95
1,8.0,13.3
2,7.5,10.56
3,8.0,11.8
4,23.5,30.3


In [39]:
cols_to_read = ['tpep_pickup_datetime',
                'tpep_dropoff_datetime',
                'passenger_count',
                'trip_distance',
                'fare_amount','payment_type']

# Download the data and get the specified columns, save the file locally
df = get_taxi_data('2022', '01', 'yellow', columns=cols_to_read, save=True)
df.head()

File is in current folder, but may not contain all required columns.
Re-downloading data...


Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,fare_amount,payment_type
0,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,14.5,1
1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,8.0,1
2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,7.5,1
3,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,8.0,2
4,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,23.5,1


In [45]:
popular_payment_method = df['payment_type'].value_counts().idxmax()
print(popular_payment_method)

1


As it is seem from the results the most popular payment method is 1 in January 2022 for the yellow taxi

In [50]:
import numpy as np
print(df.isnull().sum())
df['passenger_count'].isnull().sum()
df['passenger_count'].fillna(np.mean(df['passenger_count']), inplace=True) #replace NA values with mean values in 'passenger_count' column
df['passenger_count'].isnull().sum()


tpep_pickup_datetime     0
tpep_dropoff_datetime    0
passenger_count          0
trip_distance            0
fare_amount              0
payment_type             0
dtype: int64


0

In [56]:
from utils import get_taxi_data

In [60]:
df.nunique()

tpep_pickup_datetime     1423522
tpep_dropoff_datetime    1424266
passenger_count               11
trip_distance               4305
fare_amount                 6403
payment_type                   6
dtype: int64

In [61]:
df.describe()

Unnamed: 0,passenger_count,trip_distance,fare_amount,payment_type
count,2463931.0,2463931.0,2463931.0,2463931.0
mean,1.389453,5.372751,12.94648,1.194449
std,0.9686008,547.8714,255.8149,0.5001778
min,0.0,0.0,-480.0,0.0
25%,1.0,1.04,6.5,1.0
50%,1.0,1.74,9.0,1.0
75%,1.389453,3.13,14.0,1.0
max,9.0,306159.3,401092.3,5.0


In [63]:
from geopy.distance import great_circle
def cal_distance(pickup_lat,pickup_long,dropoff_lat,dropoff_long):
 
 start_coordinates=(pickup_lat,pickup_long)
 stop_coordinates=(dropoff_lat,dropoff_long)
 
 return great_circle(start_coordinates,stop_coordinates).km

# Example: get yellow taxi data for January 2022
cols_to_read = ['tpep_pickup_datetime',
                'tpep_dropoff_datetime',
                'passenger_count',
                'trip_distance',
                'fare_amount','payment_type']

# Download the data and get the specified columns, save the file locally
df = get_taxi_data('2022', '01', 'green', columns=cols_to_read, save=True)
df.head()

Now, choose another month, a type of vehicle, use `get_taxi_data()` to obtain the data, and start exploring the dataset!

---

## Important tips about memory usage

Some of the data files are very heavy (several gigabytes!). Depending on your computer's RAM (memory), you may not be able to read entire data files at once, in a single data frame.

### Specify `columns`

The `columns` input argument is provided for you to select which columns you want to include in your dataframe. You should always specify which columns you need when you read data, to avoid loading unnecessary data into memory.

### Save your processed data into CSV files

To create your report, you will be selecting specific parts of the data, and likely performing some cleaning and/or aggregation on this data. You may wish to save your data at intermediate steps of your processing into CSV files, so that you can load these directly the next time you start your notebook (instead of having to re-do all the processing every time you restart Jupyter).

---