## Homework

The goal of this homework is to train a simple model for predicting the duration of a ride 

In [None]:
import pandas as pd
import seaborn as sns

import pickle
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression, Lasso, Ridge

from sklearn.metrics import mean_squared_error

## Q1. Downloading the data

We'll use "**Yellow** Taxi Trip Records" [link](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

In [21]:
january_data_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet"
february_data_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet"

In [24]:
df_january = pd.read_parquet(january_data_url)
df_february = pd.read_parquet(february_data_url)

#### Read the data for January. How many columns are there?

In [23]:
print(f"Answer: There are {df_january.shape[1]} columns in the January dataset.")

Answer: There are 19 columns in the January dataset.


### Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

In [28]:
def calculate_ride_duration(dataframe: pd.DataFrame) -> pd.Series:
    """
    Calculate the ride duration in seconds.
    """

    pickup_datetime = pd.to_datetime(dataframe.tpep_pickup_datetime)
    dropoff_datetime = pd.to_datetime(dataframe.tpep_dropoff_datetime)

    duration = (dropoff_datetime - pickup_datetime).dt.total_seconds()/60
    return duration


In [29]:
df_january["duration"] = calculate_ride_duration(df_january)

#### What's the standard deviation of the trips duration in January?

In [30]:
df_january[["duration"]].describe()

Unnamed: 0,duration
count,3066766.0
mean,15.669
std,42.59435
min,-29.2
25%,7.116667
50%,11.51667
75%,18.3
max,10029.18


### Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

In [31]:
def remove_outliers(dataframe: pd.DataFrame) -> pd.DataFrame:
    """
    Remove outliers from the dataframe.
    """
    return dataframe[(dataframe["duration"] >= 1) & (dataframe["duration"] <= 60)]

#### What fraction of the records left after you dropped the outliers?


In [32]:
df_transformed = remove_outliers(df_january)

In [41]:
print(f"Answer: The fraction of records left after dropping outliers is {(df_transformed.shape[0]/df_january.shape[0])*100:.2f}%")

Answer: The fraction of records left after dropping outliers is 98.12%


## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will 
  label encode them)
* Fit a dictionary vectorizer 
* Get a feature matrix from it
