Name: Isaac Ndirangu Muturi
Email: ndirangumuturi749@gmail.com

### Week 1 homework @MLOps zoomcamp

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.


## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2023.

Read the data for January. How many columns are there?

In [1]:
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet

--2024-05-14 17:05:39--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 3.160.203.81, 3.160.203.184, 3.160.203.173, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|3.160.203.81|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47673370 (45M) [application/x-www-form-urlencoded]
Saving to: ‘yellow_tripdata_2023-01.parquet’


2024-05-14 17:05:42 (19.7 MB/s) - ‘yellow_tripdata_2023-01.parquet’ saved [47673370/47673370]



In [2]:
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet

--2024-05-14 17:05:42--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 3.160.203.173, 3.160.203.81, 3.160.203.53, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|3.160.203.173|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47748012 (46M) [application/x-www-form-urlencoded]
Saving to: ‘yellow_tripdata_2023-02.parquet’


2024-05-14 17:05:46 (15.9 MB/s) - ‘yellow_tripdata_2023-02.parquet’ saved [47748012/47748012]



In [3]:
!mv yellow_tripdata_2023-01.parquet yellow_tripdata_2023-02.parquet data/

In [4]:
import pandas as pd

df = pd.read_parquet('./data/yellow_tripdata_2023-01.parquet')
df.shape

(3066766, 19)

In [5]:
# 19 columns

## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?


In [6]:
# Compute duration in minutes
df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
df.duration = df.duration.apply(lambda timedelta: timedelta.total_seconds() / 60)

# Compute standard deviation of trip durations
df['duration'].std()


42.594351241920904

In [7]:
# 42.594351241920904 standard deviation

## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?


In [8]:
# Filter records where duration is between 1 and 60 minutes (inclusive)
df_train = df[(df['duration'] >= 1) & (df['duration'] <= 60)]

# Calculate fraction of records left after dropping outliers
len(df_train) / len(df)

0.9812202822125979

In [9]:
# 98.12%

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will 
  label encode them)
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?


In [10]:
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee', 'duration'],
      dtype='object')

In [11]:
from sklearn.feature_extraction import DictVectorizer

categorical_cols = ['PULocationID', 'DOLocationID']

# Convert IDs to strings
df_train[categorical_cols] = df_train[categorical_cols].astype(str)

# Convert DataFrame to list of dictionaries
train_dicts = df_train[categorical_cols].to_dict(orient='records')

# Initialize and fit a dictionary vectorizer
vectorizer = DictVectorizer()
vectorizer.fit(train_dicts)

# Get feature matrix
X_train = vectorizer.transform(train_dicts)
X_train


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train[categorical_cols] = df_train[categorical_cols].astype(str)


<3009173x515 sparse matrix of type '<class 'numpy.float64'>'
	with 6018346 stored elements in Compressed Sparse Row format>

In [12]:
X_train.shape

(3009173, 515)

In [13]:
# 515

## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

In [14]:
target = 'duration'
y_train = df_train[target].values
y_train

array([ 8.43333333,  6.31666667, 12.75      , ..., 24.51666667,
       13.        , 14.4       ])

In [15]:
X_train.shape

(3009173, 515)

In [16]:
y_train.shape

(3009173,)

In [17]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Instantiate and fit the linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predict on the training data
y_preds = lr.predict(X_train)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_train, y_preds))
rmse


7.649262060255514

In [18]:
# 7.649262060255514

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2023). 

What's the RMSE on validation?

In [19]:
df_val = pd.read_parquet('./data/yellow_tripdata_2023-02.parquet')

df_val['duration'] = df_val.tpep_dropoff_datetime - df_val.tpep_pickup_datetime
df_val.duration = df_val.duration.apply(lambda timedelta: timedelta.total_seconds() / 60)


In [20]:
df_val = df_val[(df_val['duration'] >= 1) & (df_val['duration'] <= 60)]

df_val[categorical_cols] = df_val[categorical_cols].astype(str)


In [21]:
val_dicts = df_val[categorical_cols].to_dict(orient='records')

# vectorizer = DictVectorizer()
vectorizer.fit(val_dicts)

X_val = vectorizer.transform(val_dicts)
X_val

<2855951x514 sparse matrix of type '<class 'numpy.float64'>'
	with 5711902 stored elements in Compressed Sparse Row format>

In [22]:
y_val = df_val[target].values

y_pred = lr.predict(X_val)

rmse = np.sqrt(mean_squared_error(y_val, y_pred))
rmse

ValueError: X has 514 features, but LinearRegression is expecting 515 features as input.