### Load Cleaned Datasets

In this step, we load the cleaned and processed datasets produced during the EDA phase:

- **df_train**: Used for training and evaluating machine learning models  
- **df_predict**: Contains trips without a known price and will be used for final predictions

Successful loading confirms that the data pipeline from EDA to model development is working as intended.


In [None]:
import pandas as pd

df_train = pd.read_csv('../data/df_train.csv')
df_predict = pd.read_csv('../data/df_predict.csv')

print(f'✅ Success: Training data loaded with {df_train.shape[0]} rows.')
print(f'✅ Success: Prediction data loaded with {df_predict.shape[0]} rows.')

### Sanity Check: Training Data Validation

Before starting model training, we perform a final sanity check on the cleaned training dataset.  
The purpose of this step is to ensure that:

- The data has been loaded correctly
- The dataset structure matches expectations
- There are no remaining missing values that could break model training
- All features are ready for use in a machine learning pipeline

This validation step helps confirm that the output from the EDA and data cleaning phase is reliable and suitable for modeling.


In [None]:
display(df_train.head())
display(df_train.info())

print('\nMissing values in training data:')
print(df_train.isna().sum())

### Define Features and Target Variable

In this step, we separate the dataset into:

- **Features (X):** All input variables used by the model except the target variable (label)
- **Target (y):** The variable we want to predict

We use a log-transformed version of the trip price (`Trip_Price_log`) as the target variable to reduce skewness and improve model stability.


In [None]:
X = df_train.drop(columns=['Trip_Price', 'Trip_Price_log'])
y = df_train['Trip_Price_log']

print('Feature matrix shape:', X.shape)
print('Target vector shape:', y.shape)

### Train–Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print(f'{X_train.shape = }')
print(f'{X_test.shape = }')
print(f'{y_train.shape = }')
print(f'{y_test.shape = }')

### “Feature Scaling (StandardScaler)”

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print('Scaled shapes:', X_train_scaled.shape, X_test_scaled.shape)

### Baseline Model (Median Predictor)

Before testing machine learning models, we establish a simple baseline.
The baseline predicts the **median** value of the target variable for all trips.

Any trained model must outperform this baseline to be considered useful.

In [None]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

baseline_pred = np.full(shape=len(y_test), fill_value=np.median(y_train))

mae = mean_absolute_error(y_test, baseline_pred)
rmse = np.sqrt(mean_squared_error(y_test, baseline_pred))
r2 = r2_score(y_test, baseline_pred)

print('Baseline (Median) performance on log target:')
print(f'MAE:  {mae:.4f}')
print(f'RMSE: {rmse:.4f}')
print(f'R2:   {r2:.4f}')


## Linear Regression Model

Linear Regression is used as the first machine learning model due to its simplicity
and interpretability. It serves as a reference point before applying more complex models.


### Train the model

In [None]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)

### Predict & evaluate

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

y_preds = lr_model.predict(X_test_scaled)

rmse = np.sqrt(mean_squared_error(y_test, y_preds))
mae = mean_absolute_error(y_test, y_preds)
r2 = r2_score(y_test, y_preds)

print('--- Linear Regression ---')
print(f'RMSE: {rmse:.4f}')
print(f'MAE:  {mae:.4f}')
print(f'R2:   {r2:.4f}')


### Final Predictions on Unseen Data

After training and evaluating the Linear Regression model using a train–test split on the training dataset, the model is applied to a separate prediction dataset (`df_predict`).

This dataset does not contain target values and is therefore not used for model evaluation. Instead, it represents unseen data for which the trained model generates final trip price predictions.

Predictions are produced on the log-transformed scale and then converted back to the original price scale before being stored in the dataset.


In [None]:
predict_log = lr_model.predict(df_predict[X.columns])
predict_price = np.exp(predict_log)
df_predict['Trip_Price_pred'] = predict_price

### Linear Regression Interpretation

Linear Regression significantly outperformed the baseline model.
The positive R² score indicates that the model explains a substantial portion
of the variance in the log-transformed trip price.

This confirms that the engineered features capture meaningful pricing patterns.


### Random Forest Regressor

Random Forest is evaluated to capture non-linear relationships and feature interactions.
Because it is tree-based, it does not require feature scaling, so we train it on the original
(unscaled) feature values.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

X_train_rf = X_train
X_test_rf = X_test

rf_model = RandomForestRegressor(
    n_estimators=300,
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_train_rf, y_train)
rf_preds = rf_model.predict(X_test_rf)

mae = mean_absolute_error(y_test, rf_preds)
rmse = np.sqrt(mean_squared_error(y_test, rf_preds))
r2 = r2_score(y_test, rf_preds)

print('--- Random Forest ---')
print(f'MAE:  {mae:.4f}')
print(f'RMSE: {rmse:.4f}')
print(f'R2:   {r2:.4f}')


### Random Forest Interpretation

Random Forest was evaluated to capture non-linear pricing effects and feature
interactions. Compared to Linear Regression, the model achieved lower error
metrics and a higher R² score.

This indicates that non-linear relationships contribute to improved predictive
performance. Based on these results, Random Forest was selected as the final
model due to its superior overall accuracy.


### Final Model Selection

Linear Regression was used as a reference model due to its simplicity and
interpretability. Random Forest achieved the best overall performance with
lower error metrics and a higher R² score.

Therefore, Random Forest was selected as the final model.


In [None]:
from sklearn.ensemble import RandomForestRegressor
import joblib


rf_final = RandomForestRegressor(
    n_estimators=300,
    random_state=42,
    n_jobs=-1
)

rf_final.fit(X, y)


joblib.dump(rf_final, '../backend/random_forest_model.joblib')
print('Saved model to: ../backend/random_forest_model.joblib')