## Model Development Overview

- The target variable is `Trip_Price`
- Rows with missing target values will be removed, as decided during the EDA

## Regression Models

Since this is a regression problem, the following models will be evaluated:
1. Linear Regression
2. KNN Regression
3. Random Forest Regression

The models will be compared using the following evaluation metrics:
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)

The model with the best overall performance will be selected for the final prediction task.

   

## Let's start with the cleaning part of the row that has nan valors.
I will use the function clean_data defined in the data_processing.py in the backend folder

In [33]:
from taxipred.backend.data_processing import load_csv, clean_data 
df = load_csv("taxi_trip_pricing.csv")
df.head()

Unnamed: 0,Trip_Distance_km,Time_of_Day,Day_of_Week,Passenger_Count,Traffic_Conditions,Weather,Base_Fare,Per_Km_Rate,Per_Minute_Rate,Trip_Duration_Minutes,Trip_Price
0,19.35,Morning,Weekday,3.0,Low,Clear,3.56,0.8,0.32,53.82,36.2624
1,47.59,Afternoon,Weekday,1.0,High,Clear,,0.62,0.43,40.57,
2,36.87,Evening,Weekend,1.0,High,Clear,2.7,1.21,0.15,37.27,52.9032
3,30.33,Evening,Weekday,4.0,Low,,3.48,0.51,0.15,116.81,36.4698
4,,Evening,Weekday,3.0,High,Clear,2.93,0.63,0.32,22.64,15.618


In [34]:
df_clean = clean_data(df)

In [35]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 562 entries, 0 to 998
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Trip_Distance_km       562 non-null    float64
 1   Time_of_Day            562 non-null    object 
 2   Day_of_Week            562 non-null    object 
 3   Passenger_Count        562 non-null    float64
 4   Traffic_Conditions     562 non-null    object 
 5   Weather                562 non-null    object 
 6   Base_Fare              562 non-null    float64
 7   Per_Km_Rate            562 non-null    float64
 8   Per_Minute_Rate        562 non-null    float64
 9   Trip_Duration_Minutes  562 non-null    float64
 10  Trip_Price             562 non-null    float64
dtypes: float64(7), object(4)
memory usage: 52.7+ KB


In [36]:
df_clean.shape, df.shape

((562, 11), (1000, 11))

## Categorical Feature Encoding

The dataset contains several categorical features such as `Time_of_Day`, `Day_of_Week`, `Traffic_Conditions`, and `Weather`.
Since regression models require numerical inputs, these features are encoded using one-hot encoding (`get_dummies`).

This allows the model to use all available information without introducing artificial numerical ordering.

In [37]:
import pandas as pd 

df_encoded = pd.get_dummies(df_clean, drop_first=True)

df_encoded.head()

Unnamed: 0,Trip_Distance_km,Passenger_Count,Base_Fare,Per_Km_Rate,Per_Minute_Rate,Trip_Duration_Minutes,Trip_Price,Time_of_Day_Evening,Time_of_Day_Morning,Time_of_Day_Night,Day_of_Week_Weekend,Traffic_Conditions_Low,Traffic_Conditions_Medium,Weather_Rain,Weather_Snow
0,19.35,3.0,3.56,0.8,0.32,53.82,36.2624,False,True,False,False,True,False,False,False
2,36.87,1.0,2.7,1.21,0.15,37.27,52.9032,True,False,False,True,False,False,False,False
5,8.64,2.0,2.55,1.71,0.48,89.33,60.2028,False,False,False,True,False,True,False,False
12,41.79,3.0,4.6,1.77,0.11,86.95,88.1328,False,False,True,True,False,False,False,False
14,9.91,2.0,2.32,1.26,0.34,41.72,28.9914,True,False,False,False,False,False,False,False


In [38]:
df_encoded.shape


(562, 15)

In [39]:
df_encoded.isnull().sum().sum()


np.int64(0)

In [40]:
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
Index: 562 entries, 0 to 998
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Trip_Distance_km           562 non-null    float64
 1   Passenger_Count            562 non-null    float64
 2   Base_Fare                  562 non-null    float64
 3   Per_Km_Rate                562 non-null    float64
 4   Per_Minute_Rate            562 non-null    float64
 5   Trip_Duration_Minutes      562 non-null    float64
 6   Trip_Price                 562 non-null    float64
 7   Time_of_Day_Evening        562 non-null    bool   
 8   Time_of_Day_Morning        562 non-null    bool   
 9   Time_of_Day_Night          562 non-null    bool   
 10  Day_of_Week_Weekend        562 non-null    bool   
 11  Traffic_Conditions_Low     562 non-null    bool   
 12  Traffic_Conditions_Medium  562 non-null    bool   
 13  Weather_Rain               562 non-null    bool   
 14 

## Split
The dataset is split into training and test sets (80/20) to evaluate model performance on unseen data.


In [41]:
X = df_encoded.drop(columns=["Trip_Price"])
y = df_encoded["Trip_Price"]


## Train / Test split 

In [42]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size = 0.2,
    random_state= 42
)
X_train.shape, X_test.shape

((449, 14), (113, 14))

## Now we can start the real model development with the same data for all the models

### Model: Linear Regression


In [43]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()

lin_reg.fit(X_train, y_train)

print(f"Parameters: {lin_reg.coef_}")
print(f"Intercept parameter: {lin_reg.intercept_}")

Parameters: [ 1.7473637  -0.38362902 -0.50938979 25.34405718 60.67285456  0.29936423
 -1.94039299  2.06511074 -0.67438288  1.81845669 -4.44546331 -4.33718321
 -1.40973211  1.21214312]
Intercept parameter: -52.09238969012227


In [44]:
y_pred_lr = lin_reg.predict(X_test)

In [45]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

mae_lr = mean_absolute_error(y_test , y_pred_lr) 
mse_lr = mean_squared_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mse_lr)

mae_lr, mse_lr, rmse_lr

(9.485349730081861, 289.9922871891947, np.float64(17.029159908497974))

The MAE (9.48) and RMSE (17.02) show a relatively large gap between them.
This suggests that larger errors have a significant impact on the model evaluation.

As already observed in the EDA, the relationship between features and the target variable may not be purely linear.
Therefore, Linear Regression may not be the most suitable model for this dataset.

Let's move on with other regression models, and the final model will be selected based on their overall performance for the application.


### Model: KNN


For perform the knn we have to scale our data, knn is a model that evaluate the similarity of data and if the valor are not scaled the metod will consider the larger number more important.

In [46]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [47]:
X_train_scaled = scaler.fit_transform(X_train) # fit only train data
X_test_scaled = scaler.fit_transform(X_test)


Now we can proced with KNN 

In [48]:
from sklearn.neighbors import KNeighborsRegressor

knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train_scaled, y_train)

y_pred_knn = knn_reg.predict(X_test_scaled)

In [49]:
mae_knn = mean_absolute_error(y_test, y_pred_knn)
mse_knn = mean_squared_error(y_test, y_pred_knn)
rmse_knn = np.sqrt(mse_knn)

mae_knn, mse_knn, rmse_knn


(16.723369167583474, 958.7952545308307, np.float64(30.96441916992519))

## KNN Regression Results

Despite applying feature scaling, KNN Regression shows higher error values compared to Linear Regression.
We will move on with the Random forest regression and we will see if it will be the method that have the best resault.