## Import Libraries

We need to import all the necessary libraries need for this task.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt



import pickle



from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
# from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error

In [2]:
df = pd.read_parquet("fhv_tripdata_2021-01.parquet")
df2 = pd.read_parquet("fhv_tripdata_2021-02.parquet")

## 1. Number of records in Jan 2021 FHV data

Here we need to find the rows and columns of the 'df' data defined above which is the Han 2021 FHV data

In [3]:
df.shape

(1154112, 7)

In [4]:
def read_dataframe(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.dropOff_datetime - df.pickup_datetime # let get the duration of thetrip, this we can get by substracting  the dropoff_time from pick_time
    df.duration = df.duration.apply(lambda td:td.total_seconds() / 60) # # convert to seconds
    
    df = df[((df.duration >= 1) & (df.duration <= 60))]
    
    return df

In [5]:
train_df = read_dataframe('fhv_tripdata_2021-01.parquet')
df_val = read_dataframe('fhv_tripdata_2021-02.parquet')

## 2. Average duration in Jan 2021 FHV

To calculate the duration in Jan 2021, i substracted the dropoff from the pickup datatime. The convert it to minutes by dividing the total_second by 60. Thereafter, i used the pandas function mean to find the mean of duration.

In [6]:
train_df.duration.mean()

16.247253368247375

## 3. Fraction of missing values

I found the null values of `PUlocationID` and `DOlocationID` features not withstanding that we have more than two missing features in train_df dataset. For the purpose of this homework, we will work with these two features (`PUlocationID` and `DOlocationID`).

To get the missing values i found the sum of the missing values of the categorical variable divided by the len of the categorical variable as defined below.

In [7]:
categorical = ['PUlocationID', 'DOlocationID']
train_df[categorical ].isnull().sum() / len(train_df[categorical])

PUlocationID    0.835273
DOlocationID    0.133270
dtype: float64

In [8]:
categorical = ['PUlocationID', 'DOlocationID']

# train_df[categorical ].isnull().sum() / len(train_df[categorical])
train_df[categorical] = train_df[categorical].astype(str)   

In [9]:
dict_train = train_df[categorical].to_dict(orient = 'records')
dv = DictVectorizer()
X_train = dv.fit_transform(dict_train)

target = 'duration'
y_train = train_df[target].values

## 4. Dimensionality after OHE

Using DictVectorizer as a feature extraction that uses One hot Encoder, we fit_transform our categorical features in it, saving it in X_train variable. Thereafter, i found shape of the X_train OHE data

In [10]:
X_train.shape

(1109826, 525)

## 5.  RMSE on Train Data

I trained the data with LinerRegression, then find the RMSE of the data

In [11]:
lr = LinearRegression()
lr.fit(X_train, y_train)

x_pred = lr.predict(X_train) # Make prediciton on the train
mean_squared_error(y_train, x_pred, squared=False)

10.528519107211325

In [None]:
la = Lasso(alpha = 0.0001)
la.fit(X_train, y_train)

x_pred = la.predict(X_train) # Make prediciton on the train
mean_squared_error(y_train, x_pred, squared=False)

## 6. RMSE on Validation Data

For validation, i used the second `df2` data, which the model have not seen. To test how good the data will behave with data it hasn't seen before.

In [None]:
val_train = df_val[categorical].to_dict(orient = 'records')

val_df = dv.transform(val_train)

target = 'duration'
y_val = df_val[target].values

In [None]:
y_pred = lr.predict(val_df) # Make prediciton on the train
mean_squared_error(y_val, y_pred, squared=False)

In [None]:
y_pred = la.predict(val_df) # Make prediciton on the train
mean_squared_error(y_val, y_pred, squared=False)

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
with open ('models/lin_reg.bin', 'wb') as f_out:
    pickle.dump((dv, lr),f_out)