# Homework 01-intro
## Training a simple model for predicting the duration of a ride.
### Dataset

The [NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) describes taxi trip records in New York (USA). For this homework we are using the Yellow records.

#### Columns description:

| **Column**         | **Description**        |
| :----------- | :-------------- |
| VendorID     | A code indicating the TPEP provider that provided the record.  |
| tpep_pickup_datetime    | The date and time when the meter was engaged.   |
|tpep_dropoff_datetime| The date and time when the meter was disengaged.|
|Passenger_count| The number of passengers in the vehicle. |
|Trip_distance | The elapsed trip distance in miles reported by the taximeter.|
| PULocationID| TLC Taxi Zone in which the taximeter was engaged.|
| DOLocationID| TLC Taxi Zone in which the taximeter was disengaged.|
| RateCodeID| The final rate code in effect at the end of the trip.  *(1= Standard rate, 2=JFK, 3=Newark, 4=Nassau or Westchester, 5=Negotiated fare, 6=Group ride)*|
| Store_and_fwd_flag| This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka“store and forward,” because the vehicle did not have a connection to the server.  *(Y= store and forward trip, N= not a store and forward trip)*|
| Payment_type| A numeric code signifying how the passenger paid for the trip. *(1= Credit card, 2= Cash,  3= No charge,  4= Dispute,  5= Unknown,  6= Voided trip)*|
| Fare_amount| The time-and-distance fare calculated by the meter.|
| Extra| Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.|
| MTA_tax| $0.50 MTA tax that is automatically triggered based on the metered rate in use.|
| Improvement_surcharge| $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.|
| Tip_amount| Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.|
| Tolls_amount| Total amount of all tolls paid in trip.|
| Total_amount| The total amount charged to passengers. Does not include cash tips.|
| Congestion_Surcharge| Total amount collected in trip for NYS congestion surcharge.|
| Airport_fee| $1.25 for pick up only at LaGuardia and John F. Kennedy Airports|

In [1]:
# Importing libraries

import pandas as pd 
import numpy as np
import pyarrow.parquet as pq

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression

from sklearn.metrics import root_mean_squared_error

import warnings
warnings.filterwarnings('ignore')

## Q1. Downloading the data

In [2]:
# Reading the data and transforming into dataset
file1 = pq.ParquetFile('./data/yellow_tripdata_2023-01.parquet')
file2 = pq.ParquetFile('./data/yellow_tripdata_2023-02.parquet')

table1 = file1.read()
table2 = file2.read()

df_jan = table1.to_pandas()
df_feb = table2.to_pandas()

In [3]:
df_jan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3066766 entries, 0 to 3066765
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[us]
 2   tpep_dropoff_datetime  datetime64[us]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  airport_fee           

There are 19 columns

## Q2. Computing duration column

In [4]:
# For January
df_jan['duration'] = (df_jan['tpep_dropoff_datetime']) - (df_jan['tpep_pickup_datetime'])
df_jan['duration'] = np.round((df_jan['duration']) / np.timedelta64(1, 'm'), decimals=2)

# For February
df_feb['duration'] = (df_feb['tpep_dropoff_datetime']) - (df_feb['tpep_pickup_datetime'])
df_feb['duration'] = np.round((df_feb['duration']) / np.timedelta64(1, 'm'), decimals=2)

In [5]:
df_jan[['duration']].describe().round(3)

Unnamed: 0,duration
count,3066766.0
mean,15.669
std,42.594
min,-29.2
25%,7.12
50%,11.52
75%,18.3
max,10029.18


The standard deviation of the trips duration in January is 42.59

## Q3. Dropping outliers

In [6]:
# For January
df_jan_clean = df_jan[(df_jan['duration'] >= 1.00) & (df_jan['duration'] <= 60.00)]

# For February
df_feb_clean = df_feb[(df_feb['duration'] >= 1.00) & (df_feb['duration'] <= 60.00)]

#### Calculation of the record's fraction left

In [7]:
# Total of records
total_with_outliers = df_jan['VendorID'].count()

# Total of records after removing outliers
total_without_outliers = df_jan_clean['VendorID'].count()

In [8]:
100 - (((total_with_outliers - total_without_outliers) * 100) / total_with_outliers)

98.1220282212598

The fraction of the records left after dropping outliers is 98%

## Q4. One-hot encoding
#### Features to use: pickup and dropoff location IDs

In [9]:
# Creating dictionary
train_feat = df_jan_clean[['PULocationID', 'DOLocationID']].astype(str)
train_dicts = train_feat.to_dict(orient='records')

# Vectorizing dictionary
dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

In [10]:
X_train.shape

(3009173, 515)

The dimensionality of this matrix is 515

## Q5. Training a model

In [11]:
# Setting the target
y_train = df_jan_clean['duration'].values

In [12]:
# Training the model with linear regression
lr = LinearRegression()

lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

root_mean_squared_error(y_train, y_pred)

7.649261720998863

The RMSE on train is 7.64

## Q6. Evaluating the model

In [13]:
#One-Hot Encoding (dic and vectorization)
val_feat = df_feb_clean[['PULocationID', 'DOLocationID']].astype(str)
val_dicts = val_feat.to_dict(orient='records')

X_val = dv.transform(val_dicts)

In [14]:
# Prediction on validation
y_val = df_feb_clean['duration'].values

y_val_pred = lr.predict(X_val)

root_mean_squared_error(y_val, y_val_pred)

7.811819484670213

The RMSE on validation is 7.81