# Homework 1 : 
The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

In [1]:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

### Q1. Downloading the data
We'll use the same NYC taxi dataset, but instead of "Green Taxi Trip Records", we'll use "For-Hire Vehicle Trip Records".

Download the data for January and February 2021.

Note that you need "For-Hire Vehicle Trip Records", not "High Volume For-Hire Vehicle Trip Records".

Read the data for January. How many records are there?

### Answer:
1154112

In [37]:
jan_data = pd.read_parquet("data/fhv_tripdata_2021-01.parquet")

In [38]:
jan_data

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009
2,B00013,2021-01-01 00:01:00,2021-01-01 01:51:00,,,,B00013
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037
...,...,...,...,...,...,...,...
1154107,B03266,2021-01-31 23:43:03,2021-01-31 23:51:48,7.0,7.0,,B03266
1154108,B03284,2021-01-31 23:50:27,2021-02-01 00:48:03,44.0,91.0,,
1154109,B03285,2021-01-31 23:13:46,2021-01-31 23:29:58,171.0,171.0,,B03285
1154110,B03285,2021-01-31 23:58:03,2021-02-01 00:17:29,15.0,15.0,,B03285


### Q2. Computing duration
Now let's compute the duration variable. It should contain the duration of a ride in minutes.

What's the average trip duration in January?

### Answer:
19.16

In [4]:
jan_data["duration"] = jan_data["dropOff_datetime"] - jan_data["pickup_datetime"]

In [5]:
jan_data['duration']

0         0 days 00:17:00
1         0 days 00:17:00
2         0 days 01:50:00
3         0 days 00:08:17
4         0 days 00:15:13
                ...      
1154107   0 days 00:08:45
1154108   0 days 00:57:36
1154109   0 days 00:16:12
1154110   0 days 00:19:26
1154111   0 days 00:36:00
Name: duration, Length: 1154112, dtype: timedelta64[ns]

In [6]:
jan_data['duration'] = jan_data['duration'].apply(lambda val: val.total_seconds() / 60)

In [7]:
avg_trip_time = sum(jan_data['duration'])/jan_data['duration'].count()
avg_trip_time

19.1672240937939

### Data preparation
Check the distribution of the duration variable. There are some outliers.

Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

How many records did you drop?

In [8]:
jan_data["duration"].describe()

count    1.154112e+06
mean     1.916722e+01
std      3.986922e+02
min      1.666667e-02
25%      7.766667e+00
50%      1.340000e+01
75%      2.228333e+01
max      4.233710e+05
Name: duration, dtype: float64

In [9]:
# Original no. of records in the dataset
original_records = jan_data["duration"].count()

In [10]:
jan_data = jan_data[(jan_data["duration"] >= 1) & (jan_data["duration"] <= 60)]

In [11]:
jan_data

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,duration
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009,17.000000
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009,17.000000
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037,8.283333
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037,15.216667
5,B00037,2021-01-01 00:59:02,2021-01-01 01:08:05,,71.0,,B00037,9.050000
...,...,...,...,...,...,...,...,...
1154107,B03266,2021-01-31 23:43:03,2021-01-31 23:51:48,7.0,7.0,,B03266,8.750000
1154108,B03284,2021-01-31 23:50:27,2021-02-01 00:48:03,44.0,91.0,,,57.600000
1154109,B03285,2021-01-31 23:13:46,2021-01-31 23:29:58,171.0,171.0,,B03285,16.200000
1154110,B03285,2021-01-31 23:58:03,2021-02-01 00:17:29,15.0,15.0,,B03285,19.433333


In [12]:
# Records after dropping records
records_after_dropping = jan_data["duration"].count()

In [13]:
total_records_dropped = original_records - records_after_dropping
total_records_dropped

44286

### Q3. Missing values

The features we'll use for our model are the pickup and dropoff location IDs.

But they have a lot of missing values there. Let's replace them with "-1".

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs.

### Answer:
83%

In [14]:
# Fill missing values for columns PUlocationID and DOlocationID
jan_data["PUlocationID"].fillna(-1, inplace=True)
jan_data["DOlocationID"].fillna(-1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


In [15]:
jan_data

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,duration
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,-1.0,-1.0,,B00009,17.000000
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,-1.0,-1.0,,B00009,17.000000
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,-1.0,72.0,,B00037,8.283333
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,-1.0,61.0,,B00037,15.216667
5,B00037,2021-01-01 00:59:02,2021-01-01 01:08:05,-1.0,71.0,,B00037,9.050000
...,...,...,...,...,...,...,...,...
1154107,B03266,2021-01-31 23:43:03,2021-01-31 23:51:48,7.0,7.0,,B03266,8.750000
1154108,B03284,2021-01-31 23:50:27,2021-02-01 00:48:03,44.0,91.0,,,57.600000
1154109,B03285,2021-01-31 23:13:46,2021-01-31 23:29:58,171.0,171.0,,B03285,16.200000
1154110,B03285,2021-01-31 23:58:03,2021-02-01 00:17:29,15.0,15.0,,B03285,19.433333


In [16]:
# Fraction of missing values for PUlocationID
frac_of_missing_values = (jan_data[jan_data["PUlocationID"] == -1]["PUlocationID"].count() / jan_data["PUlocationID"].count()) * 100
frac_of_missing_values

83.52732770722618

### Q4. One-hot encoding
Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

Turn the dataframe into a list of dictionaries. Fit a dictionary vectorizer. Get a feature matrix from it. What's the dimensionality of this matrix? (The number of columns).

### Answer:
525


In [17]:
features = ["PUlocationID", "DOlocationID"]
train_data = jan_data[features].astype(str)

In [18]:
# Convert the data to dictonary
train_dict = train_data[features].to_dict(orient="records")

In [19]:
# Get the feature matrix
vectorizer = DictVectorizer()
X_train = vectorizer.fit_transform(train_dict)

In [20]:
X_train.get_shape()

(1109826, 525)

### Q5. Training a model
Now let's use the feature matrix from the previous step to train a model.

Train a plain linear regression model with default parameters. Calculate the RMSE of the model on the training data.

### Answer:
10.52

In [21]:
train_data['duration'] = jan_data["duration"]
y_train = train_data["duration"].values

In [22]:
y_train

array([17.        , 17.        ,  8.28333333, ..., 16.2       ,
       19.43333333, 36.        ])

In [23]:
# Fit a linear regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

LinearRegression()

In [24]:
# Predict on the train data
y_pred = lr_model.predict(X_train)

In [25]:
# RMSE
mean_squared_error(y_train, y_pred, squared=False)

10.52851910720242

### Q6. Evaluating the model
Now let's apply this model to the validation dataset (Feb 2021).

What's the RMSE on validation?

### Answer:

In [26]:
# Read feb data
feb_data = pd.read_parquet("data/fhv_tripdata_2021-02.parquet")

In [33]:
# Perform the same preprocessing on feb data
# Get the duration
feb_data["duration"] = feb_data["dropOff_datetime"] - feb_data["pickup_datetime"]
feb_data['duration'] = feb_data['duration'].apply(lambda val: val.total_seconds() / 60)

# Subset only data with duration between 1 and 60 mins
feb_data = feb_data[(feb_data["duration"] >= 1) & (feb_data["duration"] <= 60)]

# Fill missing values
feb_data["PUlocationID"].fillna(-1, inplace=True)
feb_data["DOlocationID"].fillna(-1, inplace=True)

# One hot encoding
features = ["PUlocationID", "DOlocationID"]
valid_data = feb_data[features].astype(str)

# Convert the data to dictonary
valid_dict = valid_data[features].to_dict(orient="records")

In [34]:
# Fit the vectorizer and set the target
X_valid = vectorizer.transform(valid_dict)

y_valid = feb_data["duration"].values
y_valid

array([10.66666667, 14.56666667,  7.95      , ..., 25.38333333,
       18.05      , 16.        ])

In [35]:
# Get the prediction for validation data
y_pred = lr_model.predict(X_valid)

In [36]:
# RMSE for validataion data
mean_squared_error(y_valid, y_pred, squared=False)

11.014283129211448

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=a9bcf353-e2b1-41fb-bcee-411bfba26a01' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>