### Load Cleaned Datasets

In this step, we load the cleaned and processed datasets produced during the EDA phase:

- **df_train**: Used for training and evaluating machine learning models  
- **df_predict**: Contains trips without a known price and will be used for final predictions

Successful loading confirms that the data pipeline from EDA to model development is working as intended.


In [1]:
import pandas as pd

df_train = pd.read_csv("../data/df_train.csv")
df_predict = pd.read_csv("../data/df_predict.csv")

print(f"✅ Success: Training data loaded with {df_train.shape[0]} rows.")
print(f"✅ Success: Prediction data loaded with {df_predict.shape[0]} rows.")

✅ Success: Training data loaded with 916 rows.
✅ Success: Prediction data loaded with 32 rows.


### Sanity Check: Training Data Validation

Before starting model training, we perform a final sanity check on the cleaned training dataset.  
The purpose of this step is to ensure that:

- The data has been loaded correctly
- The dataset structure matches expectations
- There are no remaining missing values that could break model training
- All features are ready for use in a machine learning pipeline

This validation step helps confirm that the output from the EDA and data cleaning phase is reliable and suitable for modeling.


In [None]:
display(df_train.head())
display(df_train.info())

print("\nMissing values in training data:")
print(df_train.isna().sum())


### Define Features and Target Variable

In this step, we separate the dataset into:

- **Features (X):** All input variables used by the model except the target variable (label)
- **Target (y):** The variable we want to predict

We use a log-transformed version of the trip price (`Trip_Price_log`) as the target variable to reduce skewness and improve model stability.


In [5]:
X = df_train.drop(columns=['Trip_Price', 'Trip_Price_log'])
y = df_train['Trip_Price_log']

print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)

Feature matrix shape: (916, 14)
Target vector shape: (916,)


### Train–Test Split

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print(f"{X_train.shape = }")
print(f"{X_test.shape = }")
print(f"{y_train.shape = }")
print(f"{y_test.shape = }")

X_train.shape = (732, 14)
X_test.shape = (184, 14)
y_train.shape = (732,)
y_test.shape = (184,)


### Baseline Model (Median Predictor)

Before testing machine learning models, we establish a simple baseline.
The baseline predicts the **median** value of the target variable for all trips.

Any trained model must outperform this baseline to be considered useful.

In [19]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

baseline_pred = np.full(shape=len(y_test), fill_value=np.median(y_train))

mae = mean_absolute_error(y_test, baseline_pred)
rmse = np.sqrt(mean_squared_error(y_test, baseline_pred))
r2 = r2_score(y_test, baseline_pred)

print("Baseline (Median) performance on log target:")
print(f"MAE:  {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R2:   {r2:.4f}")


Baseline (Median) performance on log target:
MAE:  0.3770
RMSE: 0.4984
R2:   -0.0255


## Linear Regression Model

Linear Regression is used as the first machine learning model due to its simplicity
and interpretability. It serves as a reference point before applying more complex models.


### Train the model

In [None]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

lr_preds = lr_model.predict(X_test)

### Predict & evaluate

In [18]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

lr_preds = lr_model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, lr_preds))
mae = mean_absolute_error(y_test, lr_preds)
r2 = r2_score(y_test, lr_preds)

print("--- Linear Regression ---")
print(f"RMSE: {rmse:.4f}")
print(f"MAE:  {mae:.4f}")
print(f"R2:   {r2:.4f}")


--- Linear Regression ---
RMSE: 0.3003
MAE:  0.2399
R2:   0.6277


### Linear Regression Interpretation

Linear Regression significantly outperformed the baseline model.
The positive R² score indicates that the model explains a substantial portion
of the variance in the log-transformed trip price.

This confirms that the engineered features capture meaningful pricing patterns.
