## Homework

The goal of this homework is to train a simple model for predicting the duration of a ride 

In [1]:
import pandas as pd
import seaborn as sns

import pickle
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression, Lasso, Ridge

from sklearn.metrics import mean_squared_error

## Q1. Downloading the data

We'll use "**Yellow** Taxi Trip Records" [link](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

In [2]:
january_data_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet"
february_data_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet"

In [3]:
def read_and_process_dataframe(url: str) -> pd.DataFrame:
    """
    Read and process the dataframe from the given URL.
    """
    dataframe = pd.read_parquet(url)
    df = dataframe.copy()
    categorical_vars = ["PULocationID", "DOLocationID"]
    df[categorical_vars] = df[categorical_vars].astype(str)

    return df

In [4]:
df_january = read_and_process_dataframe(january_data_url)
df_february = read_and_process_dataframe(february_data_url)

#### Read the data for January. How many columns are there?

In [5]:
print(f"Answer: There are {df_january.shape[1]} columns in the January dataset.")

Answer: There are 19 columns in the January dataset.


### Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

In [6]:
def calculate_ride_duration(dataframe: pd.DataFrame) -> pd.Series:
    """
    Calculate the ride duration in seconds.
    """

    pickup_datetime = pd.to_datetime(dataframe.tpep_pickup_datetime)
    dropoff_datetime = pd.to_datetime(dataframe.tpep_dropoff_datetime)

    duration = (dropoff_datetime - pickup_datetime).dt.total_seconds()/60
    return duration


In [7]:
df_january["duration"] = calculate_ride_duration(df_january)
df_february["duration"] = calculate_ride_duration(df_february)

#### What's the standard deviation of the trips duration in January?

In [8]:
df_january[["duration"]].describe()

Unnamed: 0,duration
count,3066766.0
mean,15.669
std,42.59435
min,-29.2
25%,7.116667
50%,11.51667
75%,18.3
max,10029.18


### Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

In [9]:
def remove_outliers(dataframe: pd.DataFrame) -> pd.DataFrame:
    """
    Remove outliers from the dataframe.
    """
    return dataframe[(dataframe["duration"] >= 1) & (dataframe["duration"] <= 60)]

#### What fraction of the records left after you dropped the outliers?


In [10]:
df_train = remove_outliers(df_january)
df_val = remove_outliers(df_february)

In [11]:
print(f"Answer: The fraction of records left after dropping outliers is {(df_train.shape[0]/df_january.shape[0])*100:.2f}%")

Answer: The fraction of records left after dropping outliers is 98.12%


## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will 
  label encode them)
* Fit a dictionary vectorizer 
* Get a feature matrix from it


In [12]:
categorical_vars = ["PULocationID", "DOLocationID"]

In [13]:
df_train[categorical_vars] = df_train[categorical_vars].astype(str)
df_val[categorical_vars] = df_val[categorical_vars].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train[categorical_vars] = df_train[categorical_vars].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_val[categorical_vars] = df_val[categorical_vars].astype(str)


In [14]:
dv = DictVectorizer()

train_dicts = df_train[categorical_vars].to_dict(orient="records")
X_train = dv.fit_transform(train_dicts)

val_dicts = df_val[categorical_vars].to_dict(orient="records")
X_val = dv.transform(val_dicts)

#### What's the dimensionality of this matrix (number of columns)?

In [15]:
print(f"Answer: The dimentionality of this matrix is {X_train.shape[1]}")

Answer: The dimentionality of this matrix is 515


# Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters, where duration is the response variable
* Calculate the RMSE of the model on the training data

In [16]:
target = "duration"

In [17]:
y_train = df_train[target].values
y_val = df_val[target].values

In [18]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_train)
rmse = mean_squared_error(y_train, y_pred, squared=False)

#### What's the RMSE on train?

In [19]:
print(f"Answer: The RMSE on training set is {rmse:.2f}")

Answer: The RMSE on training set is 7.65


## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2023). 

In [20]:
y_pred = lr.predict(X_val)
rmse = mean_squared_error(y_val, y_pred, squared=False)

#### What's the RMSE on validation?

In [21]:
print(f"Answer: The RMSE on validation set is {rmse:.2f}")

Answer: The RMSE on validation set is 7.81
