# Homework 1 - Due 23rd May 2023
This notebook fulfills the requirements of Homework number 1 which is due on the 23rd of May 2023. The questions posed within the homework have been answered in markdown cells below. 

The details for the homework can be found [here](https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2023/01-intro/homework.md) and a brief outline can be seen below:

```{admonition} Homework 1 Outline
The goal of this homework is to train a simple model for predicting the duration of a ride.
We'll use the same NYC taxi dataset, but instead of "Green Taxi Trip Records", we'll use "Yellow Taxi Trip Records".
The data to be used is January and February 2022
```

In [1]:
import sys
sys.path.append('/home/ubuntu/sh-mlops-zoomcamp/mlops_jupyter_book')
import pandas as pd
import pandera as pa
from pandera.typing import Series, DateTime
import seaborn as sns
import plotly.graph_objs as go
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
from pathlib import Path
from sklearn.metrics import mean_squared_error
from datetime import datetime
from enum import Enum
import os
from utils.utils import ROOT_DIR, render_itable, init_jb_table_style
from itables import init_notebook_mode

init_jb_table_style()
init_notebook_mode(all_interactive=True, connected=True)

In [2]:
COLS_TO_STRINGS = ['PULocationID', 'DOLocationID']
DURATION_FILTER = {'duration': {'start':1, 'end': 60, 'inclusive':'both'}}
INDEP_VARS = ['PULocationID', 'DOLocationID']
TARGET_VAR = 'duration'

In [3]:
class FileTypes(Enum):
    PARQUET = {"type": "parquet", "function": pd.read_parquet}
    CSV = {"type": "csv", "function": pd.read_csv}

In [4]:
def _import_data(data_path: Path) -> pd.DataFrame:
    """Function that will check import the data. This function will check the type of the file and if it is not in a prescribed list of 
    files (See FileTypes class) then an exception will be thrown.

    Args:
        data_path (Path): Path to the data to be imported

    Raises:
        Exception: Exception if the filetype to be imported is now a valid FileTypes class value

    Returns:
        pd.DataFrame: pandas DataFrame containing the data that was imported from the data path
    """
    file_type = str(data_path).split('.')[-1]
    if file_type not in [e.value['type'] for e in FileTypes]:
        raise Exception(f"file type {file_type} is not currently supported, please use one of the following {[e.value for e in FileTypes]}")
    else:
        df = FileTypes[file_type.upper()].value['function'](data_path)
        return df

In [5]:
data_base_path = Path().resolve()
jan_data = Path(str(ROOT_DIR) + "/mlops_jupyter_book/homeworks/data/yellow_tripdata_2022-01.parquet").resolve()
feb_data = Path(str(ROOT_DIR) + "/mlops_jupyter_book/homeworks/data/yellow_tripdata_2022-02.parquet").resolve()

```{admonition} Data Validation
:class: tip
I have included a small section below on data validation. This is extremely important as we would like our data to be as robust as possible and have the form expected before we begin training an ML model (dates not in the future, no negative values for fare or journey length etc.).

Creating a validation schema using [Pandera](https://pandera.readthedocs.io/en/stable/index.html) is extremely easy and can be very powerful when it comes to data validation. Creating a schema that I thought was reasonable threw up some errors that I thought were interesting but investigating / remediating are outside the scope of this MLOps course:
- null values in passenger count
- negative values for all amounts (fare, mta, tolls etc.)
```

In [6]:
class TaxiSchema(pa.DataFrameModel):
    """Schema to validate our data against.
    This will ensure that any data we use for training and predictions has the correct form
    """

    VendorID: Series[int] = pa.Field(ge=0)
    tpep_pickup_datetime: Series[DateTime] = pa.Field(le=datetime.now())
    tpep_dropoff_datetime: Series[DateTime] = pa.Field(le=datetime.now())
    passenger_count: Series[float] = pa.Field(ge=0, nullable=True, coerce=True)
    trip_distance: Series[float] = pa.Field(ge=0.0, nullable=True)
    RatecodeID: Series[float] = pa.Field(ge=0, nullable=True, coerce=True)
    store_and_fwd_flag: Series[str] = pa.Field(isin=["N", "Y"], nullable=True)
    PULocationID: Series[str] = pa.Field(nullable=True)
    DOLocationID: Series[str] = pa.Field(nullable=True)
    payment_type: Series[int] = pa.Field(ge=0, le=5, nullable=True)
    fare_amount: Series[float] = pa.Field() # fair amounts <0 were identified. Going to assume they were refunds but should be investigated
    extra: Series[float] = pa.Field()
    mta_tax: Series[float] = pa.Field()
    tip_amount: Series[float] = pa.Field()
    tolls_amount: Series[float] = pa.Field()
    improvement_surcharge: Series[float] = pa.Field()
    total_amount: Series[float] = pa.Field()
    congestion_surcharge: Series[float] = pa.Field(nullable=True)
    airport_fee: Series[float] = pa.Field(nullable=True)
    duration: Series[float] = pa.Field(nullable=True)

    pa.dataframe_check
    def check_dates(cls, df:pd.DataFrame) -> Series[bool]:
        return df['tpep_pickup_datetime'] <= df['tpep_dropoff_datetime']

In [7]:
def import_transform_validate(df_path:Path, validation_model:TaxiSchema = TaxiSchema) -> pd.DataFrame:
    import_df = _import_data(df_path)
    # Converting PULocationID and DOLocationID to strings
    import_df[COLS_TO_STRINGS] = import_df[COLS_TO_STRINGS].applymap(str)
    import_df['duration'] = (import_df['tpep_dropoff_datetime'] - import_df['tpep_pickup_datetime'])
    import_df['duration'] = import_df['duration'].apply(lambda x: x.total_seconds()/60)

    validated_df = validation_model.validate(import_df)
    return validated_df

## Q1. Downloading the data
Read the data for January. How many columns are there? 

__Answer__ 

We can see in the table below there are 20 columns. I added the `duration` column so the answer is __19 columns are read from the data__

In [8]:
jan_df = import_transform_validate(jan_data)
feb_df = import_transform_validate(feb_data)
render_itable(jan_df.head())

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,duration
Loading... (need help?),,,,,,,,,,,,,,,,,,,



## Q2. Computing duration
Now let's compute the duration variable. It should contain the duration of a ride in minutes.

What's the standard deviation of the trips duration in January?

__Answer__ 

Using the describe method we can see that the standard deviation od the `duration` column is __46.45 minutes.__


In [9]:
jan_df.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,duration
Loading... (need help?),,,,,,,,,,,,,,,


## Q3. Dropping outliers
Next, we need to check the distribution of the duration variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

__Answer__ 

The fraction of records remaining after excluding those outliers is __98.28%__

In [10]:
def remove_outliers(df: pd.DataFrame, filters: dict[str,dict[str,float]]) -> pd.DataFrame:
    """Function to remove outliers of a column based on two values

    Args:
        df (pd.DataFrame): Dataframe to remove outliers from
        filters (dict[str,dict[str,float]]): Dictionary with column to filter with start and end values
    Returns:
        pd.DataFrame: filtered_dataframe
    """

    df_filtered = df.copy()

    for key, val in filters.items():
        df_filtered = df_filtered[df_filtered[key].between(val['start'], val['end'], inclusive = val['inclusive'])]

    return df_filtered

In [11]:
jan_df_no_outliers = remove_outliers(jan_df, DURATION_FILTER)
print(f'Number of Rows Before: {len(jan_df)}')
print(f'Number of Rows After: {len(jan_df_no_outliers)}')
print(f'Fraction of Records Remain: {len(jan_df_no_outliers)/ len(jan_df):.2%}')

Number of Rows Before: 2463931
Number of Rows After: 2421440
Fraction of Records Remain: 98.28%



## Q4. One-hot encoding
Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

Turn the dataframe into a list of dictionaries
Fit a dictionary vectorizer
Get a feature matrix from it
What's the dimensionality of this matrix (number of columns)?

__Answer__ 

The dimensionality of the DictVectorizer after applying `fit_transform` is __515__

In [12]:
df_indep = jan_df_no_outliers[INDEP_VARS].to_dict(orient = 'records')
df_dep = jan_df_no_outliers[TARGET_VAR]

d_vect = DictVectorizer()
df_indep_trans = d_vect.fit_transform(df_indep)

print(f'Number of features after using the DictVectorizer class is: {len(d_vect.get_feature_names_out())}')

Number of features after using the DictVectorizer class is: 515


## Q5. Training a model
Now let's use the feature matrix from the previous step to train a model.

Train a plain linear regression model with default parameters
Calculate the RMSE of the model on the training data
What's the RMSE on train?

__Answer__ 


In [13]:
lr = LinearRegression()
lr.fit(df_indep_trans, df_dep)
print(f'The RMSE of the trained model on the training set is {mean_squared_error(df_dep, lr.predict(df_indep_trans), squared=False)}')

The RMSE of the trained model on the training set is 6.986190836477672


## Q6. Evaluating the model
Now let's apply this model to the validation dataset (February 2022).

What's the RMSE on validation?

__Answer__ 

The RMSE on the validation set is __7.79__

In [14]:
feb_df_filtered = remove_outliers(feb_df, DURATION_FILTER)
df_val_indep_trans_feb = d_vect.transform(feb_df_filtered[INDEP_VARS].to_dict(orient='records'))
df_dep_feb = feb_df_filtered[TARGET_VAR]
val_preds = lr.predict(df_val_indep_trans_feb)
print(f'The RMSE of the trained model on the training set is {mean_squared_error(df_dep_feb, val_preds, squared=False)}')

The RMSE of the trained model on the training set is 7.78640879016696
