### Load Cleaned Datasets

In this step, we load the cleaned and processed datasets produced during the EDA phase:

- **df_train**: Used for training and evaluating machine learning models  
- **df_predict**: Contains trips without a known price and will be used for final predictions

Successful loading confirms that the data pipeline from EDA to model development is working as intended.


In [1]:
import pandas as pd

df_train = pd.read_csv("../data/df_train.csv")
df_predict = pd.read_csv("../data/df_predict.csv")

print(f"✅ Success: Training data loaded with {df_train.shape[0]} rows.")
print(f"✅ Success: Prediction data loaded with {df_predict.shape[0]} rows.")

✅ Success: Training data loaded with 916 rows.
✅ Success: Prediction data loaded with 32 rows.


### Sanity Check: Training Data Validation

Before starting model training, we perform a final sanity check on the cleaned training dataset.  
The purpose of this step is to ensure that:

- The data has been loaded correctly
- The dataset structure matches expectations
- There are no remaining missing values that could break model training
- All features are ready for use in a machine learning pipeline

This validation step helps confirm that the output from the EDA and data cleaning phase is reliable and suitable for modeling.


In [4]:
display(df_train.head())
display(df_train.info())

print("\nMissing values in training data:")
print(df_train.isna().sum())


Unnamed: 0,Trip_Distance_km,Trip_Duration_Minutes,Trip_Price,Trip_Price_log,Time_of_Day_Evening,Time_of_Day_Morning,Time_of_Day_Night,Time_of_Day_Unknown,Day_of_Week_Weekday,Day_of_Week_Weekend,Traffic_Conditions_Low,Traffic_Conditions_Medium,Traffic_Conditions_Unknown,Weather_Rain,Weather_Snow,Weather_Unknown
0,19.35,53.82,36.2624,3.617985,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,36.87,37.27,52.9032,3.98719,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,30.33,116.81,36.4698,3.623535,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,8.64,22.64,15.618,2.810486,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,8.64,89.33,60.2028,4.114193,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 916 entries, 0 to 915
Data columns (total 16 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Trip_Distance_km            916 non-null    float64
 1   Trip_Duration_Minutes       916 non-null    float64
 2   Trip_Price                  916 non-null    float64
 3   Trip_Price_log              916 non-null    float64
 4   Time_of_Day_Evening         916 non-null    float64
 5   Time_of_Day_Morning         916 non-null    float64
 6   Time_of_Day_Night           916 non-null    float64
 7   Time_of_Day_Unknown         916 non-null    float64
 8   Day_of_Week_Weekday         916 non-null    float64
 9   Day_of_Week_Weekend         916 non-null    float64
 10  Traffic_Conditions_Low      916 non-null    float64
 11  Traffic_Conditions_Medium   916 non-null    float64
 12  Traffic_Conditions_Unknown  916 non-null    float64
 13  Weather_Rain                916 non

None


Missing values in training data:
Trip_Distance_km              0
Trip_Duration_Minutes         0
Trip_Price                    0
Trip_Price_log                0
Time_of_Day_Evening           0
Time_of_Day_Morning           0
Time_of_Day_Night             0
Time_of_Day_Unknown           0
Day_of_Week_Weekday           0
Day_of_Week_Weekend           0
Traffic_Conditions_Low        0
Traffic_Conditions_Medium     0
Traffic_Conditions_Unknown    0
Weather_Rain                  0
Weather_Snow                  0
Weather_Unknown               0
dtype: int64
