# Regression Modeling — Predicting Uber Ride Fare Amount

This notebook focuses on building machine learning regression models to predict the fare amount of Uber rides using the cleaned and processed dataset created in the previous notebook(Data_Preprocessing_and_Exploration_Gouthami).

### Key Steps:
1. Load processed training and test datasets  
2. Select suitable regression models  
3. Train models using the training data  
4. Evaluate model performance using multiple metrics  
5. Fine-tune the best model using hyperparameter tuning  
6. Select the final model based on evaluation results  


### Step 1: Import Required Libraries

We import all necessary libraries for:
- Regression models  
- Model evaluation  
- Hyperparameter tuning  


In [1]:
# Data handling
import pandas as pd
import numpy as np

# Regression models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Evaluation metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")


### Step 2: Load the Prepared Training and Testing Data

These files were generated in Notebook 1.

We load:
- `X_train_processed.csv`  
- `X_test_processed.csv`  
- `y_train.csv`  
- `y_test.csv`  


In [2]:
# Load processed datasets
X_train = pd.read_csv("X_train_processed.csv")
X_test = pd.read_csv("X_test_processed.csv")
y_train = pd.read_csv("y_train.csv")
y_test = pd.read_csv("y_test.csv")

# Convert y values to 1D arrays
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

X_train.head()

Unnamed: 0,passenger_count,pickup_hour,pickup_dayofweek,distance_km,time_of_day_Evening,time_of_day_Morning,time_of_day_Night,distance_category_Medium,distance_category_Long
0,2.527292,-0.227766,-1.565663,-0.045001,False,False,False,True,False
1,-0.528878,0.539294,-1.052182,-0.054485,True,False,False,False,False
2,-0.528878,-1.915299,1.001743,-0.054166,False,False,True,False,False
3,-0.528878,-0.074354,1.001743,-0.053925,False,False,False,False,False
4,-0.528878,0.99953,-0.538701,-0.052404,True,False,False,False,False


### Step 3: Define Evaluation Function

This function will calculate:
- Mean Absolute Error (MAE)  
- Mean Squared Error (MSE)  
- Root Mean Squared Error (RMSE)  
- R-squared (R²)

We will use this function for all models.


In [3]:
def evaluate_model(model, X_test, y_test):
    """
    Function to evaluate regression model performance.
    """
    predictions = model.predict(X_test)

    mae = mean_absolute_error(y_test, predictions)
    mse = mean_squared_error(y_test, predictions)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, predictions)

    print(f"MAE:  {mae:.3f}")
    print(f"MSE:  {mse:.3f}")
    print(f"RMSE: {rmse:.3f}")
    print(f"R²:   {r2:.3f}")

    return {"MAE": mae, "MSE": mse, "RMSE": rmse, "R2": r2}


### Step 4: Model 1 — Linear Regression

Linear Regression is the simplest baseline model.  
We use it as a starting point to compare with other models.


In [4]:
# Initialize and train Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

In [5]:
print("Linear Regression Performance:")
lr_results = evaluate_model(lr_model, X_test, y_test)

Linear Regression Performance:
MAE:  3.644
MSE:  41.835
RMSE: 6.468
R²:   0.570


### Step 5: Model 2 — Decision Tree Regression

Decision Trees capture non-linear relationships in the data.  
They often perform better than linear regression for real-world noisy datasets.

In [6]:
dt_model = DecisionTreeRegressor(max_depth=10, random_state=42)
dt_model.fit(X_train, y_train)

In [7]:
print("Decision Tree Performance:")
dt_results = evaluate_model(dt_model, X_test, y_test)


Decision Tree Performance:
MAE:  2.347
MSE:  24.248
RMSE: 4.924
R²:   0.750


### Step 6: Model 3 — Random Forest Regression

Random Forest is an ensemble model that:
- Reduces overfitting  
- Handles non-linear patterns  
- Typically provides high accuracy  

This model is usually the best performer on this kind of dataset.


In [8]:
rf_model = RandomForestRegressor(
    n_estimators=150,
    random_state=42,
    max_depth=None,
    min_samples_split=2
)

rf_model.fit(X_train, y_train)

In [9]:
print("Random Forest Performance:")
rf_results = evaluate_model(rf_model, X_test, y_test)

Random Forest Performance:
MAE:  2.514
MSE:  24.747
RMSE: 4.975
R²:   0.745


### Step 7: Compare Model Performances

Let's compare MAE, RMSE, and R² for all 3 models.


In [10]:
results_df = pd.DataFrame({
    "Linear Regression": lr_results,
    "Decision Tree": dt_results,
    "Random Forest": rf_results
})

results_df

Unnamed: 0,Linear Regression,Decision Tree,Random Forest
MAE,3.643682,2.346513,2.513832
MSE,41.83469,24.248222,24.747246
RMSE,6.467974,4.924248,4.97466
R2,0.569508,0.750478,0.745343


### Step 8: Hyperparameter Tuning for Random Forest

We use **RandomizedSearchCV**, which:
- Tests random combinations of hyperparameters  
- Runs faster than Grid Search  
- Often gives equally good results

We tune parameters such as:
- Number of trees  
- Maximum depth  
- Minimum samples per split  
- Minimum samples per leaf  


In [11]:
# Define parameter grid
param_grid = {
    "n_estimators": [100, 150, 200, 250],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4]
}


In [12]:
# Setup RandomizedSearchCV
rf_search = RandomizedSearchCV(
    estimator=RandomForestRegressor(random_state=42),
    param_distributions=param_grid,
    n_iter=10,
    cv=3,
    random_state=42,
    n_jobs=-1,
    scoring="neg_mean_squared_error"
)

In [13]:
# Fit the randomized search
rf_search.fit(X_train, y_train)

In [14]:
rf_search.best_params_

{'n_estimators': 150,
 'min_samples_split': 5,
 'min_samples_leaf': 4,
 'max_depth': 10}

### Step 9: Evaluate the Tuned Random Forest Model

We now train a model with the best parameters from RandomizedSearchCV and evaluate it.


In [15]:
best_rf = rf_search.best_estimator_

print("Tuned Random Forest Performance:")
best_rf_results = evaluate_model(best_rf, X_test, y_test)


Tuned Random Forest Performance:
MAE:  2.297
MSE:  22.170
RMSE: 4.708
R²:   0.772


### Step 10: Final Model Selection

Based on evaluation metrics:

- **Linear Regression** → baseline model  
- **Decision Tree** → may overfit  
- **Random Forest (tuned)** → usually best performer  

Therefore, the **tuned Random Forest model** will be selected as the final model for predicting fare amounts.

### Step 11: Save Final Model 
We can save the final trained model to a file so it can be loaded later without retraining.


In [16]:
import joblib

joblib.dump(best_rf, "final_fare_prediction_model.pkl")

print("Model saved successfully.")


Model saved successfully.
