# Understand the nature of the target variable:

Target Variable: Trip_Price (continuous)

Since the target is continuous, this is a regression problem.


Since this is a regression problem, we can try a few common models like:


Linear Regression: A simple yet powerful model for continuous variables.


Random Forest Regressor: A more powerful, non-linear model that can capture complex relationships.


Gradient Boosting Regressor: Another ensemble model that is typically very strong for regression tasks.


Apply Hyperparameter Tuning:
For each model, we'll perform hyperparameter tuning to improve their performance using GridSearchCV or RandomizedSearchCV.


Evaluation Metrics:


We will use regression-specific metrics to evaluate the models:


Mean Absolute Error (MAE)


Mean Squared Error (MSE)


R-squared (R²)

In [None]:
#imports libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [None]:
# Load data
df = pd.read_csv('taxi_trip_pricing.csv')
df.head()

Unnamed: 0,Trip_Distance_km,Time_of_Day,Day_of_Week,Passenger_Count,Traffic_Conditions,Weather,Base_Fare,Per_Km_Rate,Per_Minute_Rate,Trip_Duration_Minutes,Trip_Price
0,19.35,Morning,Weekday,3.0,Low,Clear,3.56,0.8,0.32,53.82,36.2624
1,47.59,Afternoon,Weekday,1.0,High,Clear,,0.62,0.43,40.57,
2,36.87,Evening,Weekend,1.0,High,Clear,2.7,1.21,0.15,37.27,52.9032
3,30.33,Evening,Weekday,4.0,Low,,3.48,0.51,0.15,116.81,36.4698
4,,Evening,Weekday,3.0,High,Clear,2.93,0.63,0.32,22.64,15.618


In [None]:
# Check data types
print("\nData types of columns:")
print(df.dtypes)


Data types of columns:
Trip_Distance_km         float64
Time_of_Day               object
Day_of_Week               object
Passenger_Count          float64
Traffic_Conditions        object
Weather                   object
Base_Fare                float64
Per_Km_Rate              float64
Per_Minute_Rate          float64
Trip_Duration_Minutes    float64
Trip_Price               float64
dtype: object


In [None]:
df.describe()

Unnamed: 0,Trip_Distance_km,Passenger_Count,Base_Fare,Per_Km_Rate,Per_Minute_Rate,Trip_Duration_Minutes,Trip_Price
count,950.0,950.0,950.0,950.0,950.0,950.0,951.0
mean,27.070547,2.476842,3.502989,1.233316,0.292916,62.118116,56.874773
std,19.9053,1.102249,0.870162,0.429816,0.115592,32.154406,40.469791
min,1.23,1.0,2.01,0.5,0.1,5.01,6.1269
25%,12.6325,1.25,2.73,0.86,0.19,35.8825,33.74265
50%,25.83,2.0,3.52,1.22,0.29,61.86,50.0745
75%,38.405,3.0,4.26,1.61,0.39,89.055,69.09935
max,146.067047,4.0,5.0,2.0,0.5,119.84,332.043689


In [None]:
#Create dummy features for categorical variables
categorical_cols = ['Time_of_Day', 'Day_of_Week', 'Traffic_Conditions', 'Weather']

# Apply one-hot encoding to categorical columns
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

In [None]:
# Define target and features
target = 'Trip_Price'
X = df_encoded.drop(columns=[target])
y = df_encoded[target]

# Identify numeric columns in X
numeric_cols = X.select_dtypes(include=['float64', 'int64']).columns.tolist()

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform numeric features
X[numeric_cols] = scaler.fit_transform(X[numeric_cols])


In [None]:
print(X.isnull().sum())

Trip_Distance_km             50
Passenger_Count              50
Base_Fare                    50
Per_Km_Rate                  50
Per_Minute_Rate              50
Trip_Duration_Minutes        50
Time_of_Day_Evening           0
Time_of_Day_Morning           0
Time_of_Day_Night             0
Day_of_Week_Weekend           0
Traffic_Conditions_Low        0
Traffic_Conditions_Medium     0
Weather_Rain                  0
Weather_Snow                  0
dtype: int64


In [None]:
# Define target and features
target = 'Trip_Price'
X = df_encoded.drop(columns=[target])
y = df_encoded[target]

# Combine X and y for imputation
df_combined = pd.concat([X, y], axis=1)

# Handle missing values (Imputation)
# Apply SimpleImputer to impute missing values with the mean
imputer = SimpleImputer(strategy='mean')
df_imputed = imputer.fit_transform(df_combined)

# Convert the imputed array back to a DataFrame to maintain column names
df_imputed = pd.DataFrame(df_imputed, columns=df_combined.columns)

# Separate X and y after imputation
X_imputed = df_imputed.drop(columns=[target])
y_imputed = df_imputed[target]


# Split data into training and testing sets (20% test, 80% train)
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y_imputed, test_size=0.2, random_state=42)

# Model 1: Linear Regression

In [None]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predict and evaluate
lr_pred = lr_model.predict(X_test)
lr_mae = mean_absolute_error(y_test, lr_pred)
lr_mse = mean_squared_error(y_test, lr_pred)
lr_r2 = r2_score(y_test, lr_pred)

print("Linear Regression Results:")
print(f"MAE: {lr_mae:.4f}")
print(f"MSE: {lr_mse:.4f}")
print(f"R-squared: {lr_r2:.4f}")


Linear Regression Results:
MAE: 9.8352
MSE: 193.9021
R-squared: 0.7665


# Model 2: Random Forest Regressor

In [None]:
rf_model = RandomForestRegressor(random_state=42)

# Hyperparameter tuning with GridSearchCV
rf_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

rf_grid_search = GridSearchCV(estimator=rf_model, param_grid=rf_param_grid, cv=5, scoring='neg_mean_squared_error')
rf_grid_search.fit(X_train, y_train)

# Best model
rf_best_model = rf_grid_search.best_estimator_

# Predict and evaluate
rf_pred = rf_best_model.predict(X_test)
rf_mae = mean_absolute_error(y_test, rf_pred)
rf_mse = mean_squared_error(y_test, rf_pred)
rf_r2 = r2_score(y_test, rf_pred)

# Model 3: Gradient Boosting Regressor

In [None]:
gb_model = GradientBoostingRegressor(random_state=42)

# Hyperparameter tuning with GridSearchCV
gb_param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}

gb_grid_search = GridSearchCV(estimator=gb_model, param_grid=gb_param_grid, cv=5, scoring='neg_mean_squared_error')
gb_grid_search.fit(X_train, y_train)

# Best model
gb_best_model = gb_grid_search.best_estimator_

# Predict and evaluate
gb_pred = gb_best_model.predict(X_test)
gb_mae = mean_absolute_error(y_test, gb_pred)
gb_mse = mean_squared_error(y_test, gb_pred)
gb_r2 = r2_score(y_test, gb_pred)

# Model Evaluation

In [None]:
print("Linear Regression Performance:")
print(f"MAE: {lr_mae}, MSE: {lr_mse}, R²: {lr_r2}")

print("\nRandom Forest Performance:")
print(f"MAE: {rf_mae}, MSE: {rf_mse}, R²: {rf_r2}")

print("\nGradient Boosting Performance:")
print(f"MAE: {gb_mae}, MSE: {gb_mse}, R²: {gb_r2}")

Linear Regression Performance:
MAE: 9.835191540739011, MSE: 193.90214139306588, R²: 0.7664853822298097

Random Forest Performance:
MAE: 5.40104341697513, MSE: 59.84833450951979, R²: 0.9279251850610414

Gradient Boosting Performance:
MAE: 4.954309891419601, MSE: 56.611804173952585, R²: 0.9318229096492374


# Model Comparison: Choose the Best

In [None]:
models_performance = {
    'Linear Regression': [lr_mae, lr_mse, lr_r2],
    'Random Forest': [rf_mae, rf_mse, rf_r2],
    'Gradient Boosting': [gb_mae, gb_mse, gb_r2]
}

# Convert to DataFrame for better display
performance_df = pd.DataFrame(models_performance, index=["MAE", "MSE", "R²"]).T
print("\nModel Comparison:")
print(performance_df)

# Identifying the best model based on R² (higher is better)
best_model = performance_df['R²'].idxmax()
print(f"\nThe best model based on R² score is: {best_model}")


Model Comparison:
                        MAE         MSE        R²
Linear Regression  9.835192  193.902141  0.766485
Random Forest      5.401043   59.848335  0.927925
Gradient Boosting  4.954310   56.611804  0.931823

The best model based on R² score is: Gradient Boosting


Conclusion

No time series or forecasting is involved here.

The train-test split is correctly implemented for this non-sequential data.


My response variable is Trip_Price, which is a continuous variable since it represents a numerical price value. This makes my task a regression problem, where am  trying to predict a continuous target based on features like trip distance, time of day and weather conditions.