In [7]:
# !pip install pyarrow

## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2023.  
Read the data for January. How many columns are there?

In [1]:
import pandas as pd

In [2]:
%%time
df_j = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet')
df_f = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet')

CPU times: user 1.47 s, sys: 719 ms, total: 2.19 s
Wall time: 2.08 s


In [3]:
len(df_j.columns)

19

In [4]:
df_j.dtypes

VendorID                          int64
tpep_pickup_datetime     datetime64[us]
tpep_dropoff_datetime    datetime64[us]
passenger_count                 float64
trip_distance                   float64
RatecodeID                      float64
store_and_fwd_flag               object
PULocationID                      int64
DOLocationID                      int64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
airport_fee                     float64
dtype: object

## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January

In [5]:
df_j['duration']= (df_j['tpep_dropoff_datetime']-df_j['tpep_pickup_datetime']).dt.total_seconds()/60

In [6]:
df_j['duration'].std()

42.594351241920904

## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers

In [7]:
((df_j['duration']>=1) & (df_j['duration']<=60)).mean()

0.9812202822125979

In [10]:
df_j = df_j[(df_j['duration']>=1) & (df_j['duration']<=60)]

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will 
  label encode them)
* Fit a dictionary vectorizer 
* Get a feature matrix from it


In [8]:
from sklearn.feature_extraction import DictVectorizer

In [11]:
categorical = ['PULocationID', 'DOLocationID']
df_j[categorical] = df_j[categorical].astype(str)


In [12]:
train_dicts = df_j[categorical].to_dict(orient='records')

In [13]:
dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

In [14]:
X_train.shape

(3009173, 515)


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

In [18]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [16]:
target = 'duration'
y_train = df_j[target].values

In [20]:
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

mean_squared_error(y_train, y_pred, squared=False)

7.649261929771859

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2023). 

What's the RMSE on validation?

In [22]:
df_f['duration']= (df_f['tpep_dropoff_datetime']-df_f['tpep_pickup_datetime']).dt.total_seconds()/60
df_f = df_f[(df_f['duration']>=1) & (df_f['duration']<=60)]
df_f[categorical] = df_f[categorical].astype(str)
valid_dicts = df_f[categorical].to_dict(orient='records')
X_valid = dv.transform(valid_dicts)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_f[categorical] = df_f[categorical].astype(str)


In [23]:
y_valid = df_f[target].values

In [24]:
y_pred_valid = lr.predict(X_valid)

mean_squared_error(y_valid, y_pred_valid, squared=False)

7.811818933419717