## Milestone 5: Advanced Model Development

Objective: Train ML models for dynamic pricing.

 Deliverables:

o XGBoost and LightGBM models trained and evaluated.

o Backtesting with historical data.

 Evaluation:

o Simulated revenue lift achieved and validated.



In [50]:
import pandas as pd
from tabulate import tabulate
df = pd.read_csv("C:\\Users\priya\\Downloads\\dynamic_pricing.csv")
print(tabulate(df.head(10), headers='keys', tablefmt='psql'))

+----+--------------------+---------------------+---------------------+---------------------------+------------------------+-------------------+-------------------+----------------+--------------------------+---------------------------+---------------+
|    |   Number_of_Riders |   Number_of_Drivers | Location_Category   | Customer_Loyalty_Status   |   Number_of_Past_Rides |   Average_Ratings | Time_of_Booking   | Vehicle_Type   |   Expected_Ride_Duration |   Historical_Cost_of_Ride |   Unnamed: 10 |
|----+--------------------+---------------------+---------------------+---------------------------+------------------------+-------------------+-------------------+----------------+--------------------------+---------------------------+---------------|
|  0 |                 90 |                  45 | Urban               | Silver                    |                     13 |              4.47 | Night             | Premium        |                       90 |                   284.257 |     

1. Train XGBoost and LightGBM models:
    - You'll need to prepare your data for training, including feature engineering and splitting into training and testing sets.
    - Train XGBoost and LightGBM models on your training data, tuning hyperparameters as needed.
2. Evaluate model performance:
    - Use metrics such as mean absolute error (MAE) or mean squared error (MSE) to evaluate the performance of both models.
    - Compare the performance of the two models and select the best one.
3. Backtesting with historical data:
    - Use historical data to simulate the performance of your model over time.
    - Evaluate the model's performance on unseen data and calculate the simulated revenue lift.
4. Simulated revenue lift:
    - Calculate the revenue lift achieved by using your dynamic pricing model compared to a baseline (e.g., fixed pricing).
    - Validate the results to ensure they are statistically significant.

    ## Objective: Train ML models for dynamic pricing.
 
 XG Boost: https://xgboost.readthedocs.io/en/stable/tutorials/model.html 

 # XGBoost (Extreme Gradient Boosting)

XGBoost is a popular, open-source machine learning algorithm that uses gradient boosting to build predictive models. It's particularly effective for:

1. Handling complex relationships: XGBoost can capture non-linear relationships between features, making it suitable for dynamic pricing.
2. Handling large datasets: XGBoost is designed to handle large datasets efficiently.
3. Interpretability: XGBoost provides feature importance scores, helping you understand which factors drive pricing decisions.

# LightGBM (Light Gradient Boosting Machine)

LightGBM is another popular gradient boosting algorithm that's known for its:

1. Speed: LightGBM is optimized for speed and can handle large datasets quickly.
2. Efficiency: LightGBM uses novel techniques like Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to reduce computational complexity.
3. Handling categorical features: LightGBM has built-in support for categorical features.

 # why we need to use XGBoost and LightGBM in your dynamic pricing project because they:

1. Handle complex relationships between features.
2. Provide high accuracy and efficiency.
3. Offer feature importance insights.

# loading the libraries

In [51]:
import pandas as pd
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from tabulate import tabulate

# Step 1: Data Preprocessing

- The code loads the dataset from a CSV file.
- It drops the Unnamed: 10 column, which is likely an empty column.
- It encodes categorical variables (Location_Category, Customer_Loyalty_Status, Time_of_Booking, and Vehicle_Type) using LabelEncoder, which converts categorical values into numerical values.


In [52]:
# Load data
import pandas as pd
from tabulate import tabulate
df = pd.read_csv("C:\\Users\priya\\Downloads\\dynamic_pricing.csv")

# Drop the Unnamed column
df = df.drop('Unnamed: 10', axis=1)
print(tabulate(df.head(10), headers='keys', tablefmt='psql'))

+----+--------------------+---------------------+---------------------+---------------------------+------------------------+-------------------+-------------------+----------------+--------------------------+---------------------------+
|    |   Number_of_Riders |   Number_of_Drivers | Location_Category   | Customer_Loyalty_Status   |   Number_of_Past_Rides |   Average_Ratings | Time_of_Booking   | Vehicle_Type   |   Expected_Ride_Duration |   Historical_Cost_of_Ride |
|----+--------------------+---------------------+---------------------+---------------------------+------------------------+-------------------+-------------------+----------------+--------------------------+---------------------------|
|  0 |                 90 |                  45 | Urban               | Silver                    |                     13 |              4.47 | Night             | Premium        |                       90 |                   284.257 |
|  1 |                 58 |                  39 | Su

# Encode categorical variables 
-- It encodes categorical variables (Location_Category, Customer_Loyalty_Status, Time_of_Booking, and Vehicle_Type) using LabelEncoder, which converts categorical values into numerical values.

--if the Location_Category column has values "Rural", "Suburban", and "Urban", the encoding would be:

- Rural: 0
- Suburban: 1
- Urban: 2

In [53]:
# Encode categorical variables
import pandas as pd
from tabulate import tabulate
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv("C:\\Users\priya\\Downloads\\dynamic_pricing.csv")
le = LabelEncoder()
df['Location_Category'] = le.fit_transform(df['Location_Category'])
df['Customer_Loyalty_Status'] = le.fit_transform(df['Customer_Loyalty_Status'])
df['Time_of_Booking'] = le.fit_transform(df['Time_of_Booking'])
df['Vehicle_Type'] = le.fit_transform(df['Vehicle_Type'])
print(tabulate(df.head(10), headers='keys', tablefmt='psql'))

+----+--------------------+---------------------+---------------------+---------------------------+------------------------+-------------------+-------------------+----------------+--------------------------+---------------------------+---------------+
|    |   Number_of_Riders |   Number_of_Drivers |   Location_Category |   Customer_Loyalty_Status |   Number_of_Past_Rides |   Average_Ratings |   Time_of_Booking |   Vehicle_Type |   Expected_Ride_Duration |   Historical_Cost_of_Ride |   Unnamed: 10 |
|----+--------------------+---------------------+---------------------+---------------------------+------------------------+-------------------+-------------------+----------------+--------------------------+---------------------------+---------------|
|  0 |                 90 |                  45 |                   2 |                         2 |                     13 |              4.47 |                 3 |              1 |                       90 |                   284.257 |     

# Define features and target

In machine learning, we need to define two main components:

1. Features (X): These are the input variables that we use to predict the target variable. Features are also known as independent variables or predictors.
2. Target (y): This is the variable that we're trying to predict. The target is also known as the dependent variable or response variable.

In [54]:
# Define features and target
import pandas as pd
from tabulate import tabulate
df = pd.read_csv("C:\\Users\priya\\Downloads\\dynamic_pricing.csv")

X = df[['Number_of_Riders', 'Number_of_Drivers', 'Location_Category', 'Customer_Loyalty_Status',  
        'Number_of_Past_Rides', 'Average_Ratings', 'Time_of_Booking', 'Vehicle_Type', 'Expected_Ride_Duration']]
y = df['Historical_Cost_of_Ride']

print("Features (X):")
print(tabulate(X.head(10), headers='keys', tablefmt='psql'))

print("\nTarget (y):")
print(tabulate(y.head(10).to_frame(), headers='keys', tablefmt='psql'))



Features (X):
+----+--------------------+---------------------+---------------------+---------------------------+------------------------+-------------------+-------------------+----------------+--------------------------+
|    |   Number_of_Riders |   Number_of_Drivers | Location_Category   | Customer_Loyalty_Status   |   Number_of_Past_Rides |   Average_Ratings | Time_of_Booking   | Vehicle_Type   |   Expected_Ride_Duration |
|----+--------------------+---------------------+---------------------+---------------------------+------------------------+-------------------+-------------------+----------------+--------------------------|
|  0 |                 90 |                  45 | Urban               | Silver                    |                     13 |              4.47 | Night             | Premium        |                       90 |
|  1 |                 58 |                  39 | Suburban            | Silver                    |                     72 |              4.06 | Eveni

# Split data into training and testing sets
# 1. Training Dataset (Train Set): 
-This is the portion of the dataset used to train a machine learning model. The model learns patterns and relationships between the features and target variable from this data.
# 2. Testing Dataset (Test Set):
-This is the portion of the dataset used to evaluate the performance of a trained machine learning model. The model is tested on this data to see how well it generalizes to new, unseen data.
# How to split data?
The typical split ratio is:
- 80% for training (Train Set)
- 20% for testing (Test Set)
# The parameters used in the code
- X: The feature dataset.
- y: The target dataset.
- test_size=0.2: The proportion of the dataset to be used for testing. In this case, 20% of the data will be used for testing, and the remaining 80% will be used for training.
- random_state=42: A seed used to ensure that the split is reproducible. This means that if you run the code again with the same random state, you'll get the same split.

In [55]:
# Split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (800, 9)
X_test shape: (200, 9)
y_train shape: (800,)
y_test shape: (200,)


# conclusion
- X_train has 800 rows and 10 columns
- X_test has 200 rows and 10 columns
- y_train has 800 rows
- y_test has 200 rows

The actual values in X_train, X_test, y_train, and y_test will depend on the data in X and y, and the random_state parameter used in the train_test_split function.

In [56]:
from tabulate import tabulate
print("X_train:")
print(tabulate(X_train.head(10),headers='keys',tablefmt='psql'))

print("\nX_test:")
print(tabulate(X_test.head(10),headers='keys',tablefmt='psql'))



X_train:
+-----+--------------------+---------------------+---------------------+---------------------------+------------------------+-------------------+-------------------+----------------+--------------------------+
|     |   Number_of_Riders |   Number_of_Drivers | Location_Category   | Customer_Loyalty_Status   |   Number_of_Past_Rides |   Average_Ratings | Time_of_Booking   | Vehicle_Type   |   Expected_Ride_Duration |
|-----+--------------------+---------------------+---------------------+---------------------------+------------------------+-------------------+-------------------+----------------+--------------------------|
|  29 |                 57 |                  38 | Suburban            | Silver                    |                     54 |              4.1  | Night             | Premium        |                      131 |
| 535 |                 21 |                   8 | Urban               | Regular                   |                     57 |              4.78 | Night

In [8]:
import pandas as pd
from tabulate import tabulate

print("y_train:")
print(tabulate(y_train.head(10).to_frame(), headers='keys', tablefmt='psql'))

print("\ny_test:")
print(tabulate(y_test.head(10).to_frame(), headers='keys', tablefmt='psql'))

y_train:
+-----+---------------------------+
|     |   Historical_Cost_of_Ride |
|-----+---------------------------|
|  29 |                   537.785 |
| 535 |                   293.56  |
| 695 |                   212.741 |
| 557 |                   638.045 |
| 836 |                   304.526 |
| 596 |                   650.299 |
| 165 |                   731.782 |
| 918 |                   766.919 |
| 495 |                   328.458 |
| 824 |                   383.318 |
+-----+---------------------------+

y_test:
+-----+---------------------------+
|     |   Historical_Cost_of_Ride |
|-----+---------------------------|
| 521 |                   470.269 |
| 737 |                   286.409 |
| 740 |                   552.269 |
| 660 |                   267.74  |
| 411 |                   111.113 |
| 678 |                   359.128 |
| 626 |                   173.887 |
| 513 |                   196.315 |
| 859 |                   555.402 |
| 136 |                   163.215 |
+-----+---

# StandardScaler?

StandardScaler is a technique used to scale numerical data to have a mean of 0 and a standard deviation of 1.

-This can help improve the performance and stability of machine learning models.

# Scale Data?
-By scaling the data, the model can treat all features equally, which can lead to better performance and more accurate predictions.

In [57]:
# Scale numerical variables
from sklearn.discriminant_analysis import StandardScaler
from tabulate import tabulate

scaler = StandardScaler()
X_train[['Number_of_Riders', 'Number_of_Drivers', 'Number_of_Past_Rides', 'Average_Ratings', 'Expected_Ride_Duration']] = scaler.fit_transform(X_train[['Number_of_Riders', 'Number_of_Drivers', 'Number_of_Past_Rides', 'Average_Ratings', 'Expected_Ride_Duration']])
X_test[['Number_of_Riders', 'Number_of_Drivers', 'Number_of_Past_Rides', 'Average_Ratings', 'Expected_Ride_Duration']] = scaler.transform(X_test[['Number_of_Riders', 'Number_of_Drivers', 'Number_of_Past_Rides', 'Average_Ratings', 'Expected_Ride_Duration']])
print(tabulate(X_train.head(10), headers='keys', tablefmt='psql'))
print(tabulate(X_test.head(10), headers='keys', tablefmt='psql'))


+-----+--------------------+---------------------+---------------------+---------------------------+------------------------+-------------------+-------------------+----------------+--------------------------+
|     |   Number_of_Riders |   Number_of_Drivers | Location_Category   | Customer_Loyalty_Status   |   Number_of_Past_Rides |   Average_Ratings | Time_of_Booking   | Vehicle_Type   |   Expected_Ride_Duration |
|-----+--------------------+---------------------+---------------------+---------------------------+------------------------+-------------------+-------------------+----------------+--------------------------|
|  29 |         -0.161036  |           0.568451  | Suburban            | Silver                    |              0.0984401 |         -0.35     | Night             | Premium        |                 0.659185 |
| 535 |         -1.70031   |          -1.00784   | Urban               | Regular                   |              0.202107  |          1.23179  | Night         

# conclusion
- The scaled values can be 
    - A value of 0 means the actual value is equal to the mean of the feature.
    - A positive value means the actual value is above the mean.
    - A negative value means the actual value is below the mean.

-- if the value for Number_of_Riders is -0.161036, it means that the actual number of riders is 0.161036 standard deviations below the mean number of riders.

--- The output suggests that the data has been preprocessed and scaled for machine learning modeling.
- The features seem to be relevant for predicting ride-related outcomes, such as ride duration, driver availability, or customer satisfaction.

#  step-2 Train XGBoost model
- xgb_model.fit(X_train, y_train): The model is trained on the training data X_train and y_train.
- The XGBoost model learns patterns in the training data to predict continuous values.
- It combines multiple decision trees to improve prediction accuracy and robustness.

In [58]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
import numpy as np
import pandas as pd
from tabulate import tabulate
from sklearn.model_selection import train_test_split
df = pd.read_csv("C:\\Users\priya\\Downloads\\dynamic_pricing.csv")


X = df[['Number_of_Riders', 'Number_of_Drivers', 'Location_Category', 'Customer_Loyalty_Status',  
        'Number_of_Past_Rides', 'Average_Ratings', 'Time_of_Booking', 'Vehicle_Type', 'Expected_Ride_Duration']]
y = df['Historical_Cost_of_Ride']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
# Ensure your training data is numeric and has no NaNs
X_train = X_train.select_dtypes(include=[np.number]).fillna(0)
y_train = np.array(y_train)

# Define the XGBoost model
xgb_model = xgb.XGBRegressor(
    objective='reg:squarederror',
    random_state=42,
    tree_method='hist'  # faster
)

# Define a smaller parameter grid
param_grid = {
    'max_depth': [3, 5],
    'learning_rate': [0.1, 0.3],
    'n_estimators': [100, 200],
    'gamma': [0, 0.5],
    'subsample': [0.8, 1],
    'colsample_bytree': [0.8, 1],
    'reg_alpha': [0, 0.5],
    'reg_lambda': [1, 1.5]
}

# Perform grid search
grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    verbose=2
)

grid_search.fit(X_train, y_train)

# Get the best XGBoost model
best_xgb_model = grid_search.best_estimator_

print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)


Fitting 5 folds for each of 256 candidates, totalling 1280 fits
Best Parameters:  {'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100, 'reg_alpha': 0.5, 'reg_lambda': 1.5, 'subsample': 1}
Best Score:  0.8488844376071892


# Conclusion:
-This code performs a grid search over a range of hyperparameters for an XGBoost regressor model. 
# -Best Parameters: 

-The output will display the optimal hyperparameters for the XGBoost model, such as:
- max_depth
- learning_rate
- n_estimators
- gamma
- subsample
- colsample_bytree
- reg_alpha
- reg_lambda
# Best Score: 

-The output will display the best cross-validated score (R-squared or mean squared error, depending on the scoring metric used) achieved by the model with the optimal hyperparameters.

- overall The code output indicates that the XGBoost model has been optimized for the given dataset, and the best combination of hyperparameters has been identified. The Best Score value represents the model's performance on the training data.

In [66]:
# Train XGBoost model
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', max_depth=5, learning_rate=0.1, n_estimators=100)
xgb_model.fit(X_train, y_train) 


0,1,2
,objective,'reg:squarederror'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


# conclusion:

-This code is used to make predictions on the test data X_test using the trained XGBoost model xgb_model. Here's a breakdown:

1. y_pred = xgb_model.predict(X_test):
    - xgb_model.predict() is a method that uses the trained model to make predictions on new data.
    - X_test is the test data that you want to make predictions on.
    - The predicted values are stored in the y_pred variable.

    #  Train LightGBM Model

In [7]:
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.model_selection import train_test_split
df = pd.read_csv("C:\\Users\priya\\Downloads\\dynamic_pricing.csv")


X = df[['Number_of_Riders', 'Number_of_Drivers', 'Location_Category', 'Customer_Loyalty_Status',  
        'Number_of_Past_Rides', 'Average_Ratings', 'Time_of_Booking', 'Vehicle_Type', 'Expected_Ride_Duration']]
y = df['Historical_Cost_of_Ride']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
# Ensure your training data is numeric and has no NaNs
X_train = X_train.select_dtypes(include=[np.number]).fillna(0)
y_train = np.array(y_train)

# Define the LightGBM model
lgb_model = lgb.LGBMRegressor(
    objective='regression',
    random_state=42,
    verbosity=-1
)

# Define a smaller parameter grid
param_grid = {
    'max_depth': [3, 5],
    'learning_rate': [0.1, 0.3],
    'n_estimators': [100, 200],
    'num_leaves': [31, 62],
    'min_child_samples': [10, 20],
    'subsample': [0.8, 1],
    'colsample_bytree': [0.8, 1],
    'reg_alpha': [0, 0.5],
    'reg_lambda': [0, 0.5]
}

# Perform grid search
grid_search = GridSearchCV(
    estimator=lgb_model,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    verbose=2
)

grid_search.fit(X_train, y_train)

# Get the best LightGBM model
best_lgb_model = grid_search.best_estimator_
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)


Fitting 5 folds for each of 512 candidates, totalling 2560 fits
Best Parameters:  {'colsample_bytree': 1, 'learning_rate': 0.1, 'max_depth': 3, 'min_child_samples': 20, 'n_estimators': 100, 'num_leaves': 31, 'reg_alpha': 0, 'reg_lambda': 0.5, 'subsample': 0.8}
Best Score:  0.851712129402436


-This code trains a LightGBM regression model to predict continuous values.
- objective='regression': The model is trained for regression tasks, where the goal is to predict continuous values.
- max_depth=5: The maximum depth of each decision tree in the model.
- learning_rate=0.1: The step size of each iteration while moving toward a minimum of a loss function.
- n_estimators=100: The number of decision trees in the model.

# What the Model Does:

- The LightGBM model learns patterns in the training data to predict continuous values.
- It combines multiple decision trees to improve prediction accuracy and robustness.

Example Use Cases:

- Predicting house prices based on features like location, size, and number of bedrooms.
- Forecasting energy consumption based on historical data and weather patterns.
- Estimating student grades based on study habits and past performance.

# training lightgbm model

In [67]:
# Train LightGBM model
lgb_model = lgb.LGBMRegressor(objective='regression', max_depth=5, learning_rate=0.1, n_estimators=100)
lgb_model.fit(X_train, y_train)

0,1,2
,boosting_type,'gbdt'
,num_leaves,31
,max_depth,5
,learning_rate,0.1
,n_estimators,100
,subsample_for_bin,200000
,objective,'regression'
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001



1. XGBoost Model: xgb_pred = xgb_model.predict(X_test)
    - This line uses the trained XGBoost model xgb_model to make predictions on the test data X_test.
    - The predicted values are stored in the xgb_pred variable.

2. LightGBM Model: lgb_pred = lgb_model.predict(X_test)
    - This line uses the trained LightGBM model lgb_model to make predictions on the test data X_test.
    - The predicted values are stored in the lgb_pred variable.

Both models are predicting the same target variable, so xgb_pred and lgb_pred will contain similar values, but they might not be identical due to differences in the models' algorithms and training processes.

# Make predictions

-Prediction (in Machine Learning):

- A prediction is the output generated by a trained machine learning model when it is given new, unseen input data. It represents the model’s estimate or forecast of the target variable based on patterns it learned from the training data.

In [68]:

# Make predictions on the test set
X_test = X_test.select_dtypes(include=[np.number]).fillna(0)

y_pred_xgb = best_xgb_model.predict(X_test)
y_pred_lgb = best_lgb_model.predict(X_test)

# Print the predictions
print("XGBoost Predictions:")
print(y_pred_xgb[:10])
print("\nLightGBM Predictions:")
print(y_pred_lgb[:10])

XGBoost Predictions:
[378.07236 276.9862  649.757   322.0389  115.89743 411.17703 188.82324
 175.78423 402.01846 165.0813 ]

LightGBM Predictions:
[378.38040384 275.70283156 655.39604511 319.4856379  103.52285765
 411.22456842 179.49832834 174.43326996 398.87406511 173.66111028]


In [69]:


print("Difference between XGBoost and LightGBM Predictions:")
print(y_pred_xgb[:10]- y_pred_lgb[:10])

Difference between XGBoost and LightGBM Predictions:
[-0.30804666  1.28337449 -5.63902607  2.55327202 12.37457277 -0.04753595
  9.32491384  1.35095551  3.14439802 -8.57981145]


# conclusion


- You trained the model on a dataset, and it learned patterns and relationships between the input features and the target variable.

- When you give the model new data (the test data), it uses what it learned to make an educated guess about the target variable.

- The prediction values are the model's guesses about the target variable for each sample in the test data.

-In your case, the prediction values are numbers like 378.07236, 276.9862, 649.757, etc. These numbers represent the model's best estimate of the target variable for each sample in the test data.


# step-4  Evaluate the  models 

This code evaluates the performance of two machine learning models, XGBoost and LightGBM, using the Mean Squared Error (MSE) metric.



1. xgb_mse = mean_squared_error(y_test, xgb_pred):
    - This line calculates the MSE between the actual values y_test and the predicted values xgb_pred from the XGBoost model.
    - The mean_squared_error function from scikit-learn's metrics module is used to calculate the MSE.

2. lgb_mse = mean_squared_error(y_test, lgb_pred):
    - This line calculates the MSE between the actual values y_test and the predicted values lgb_pred from the LightGBM model.

What is Mean Squared Error (MSE)?

- MSE measures the average squared difference between predicted and actual values.
- Lower MSE values indicate better model performance.

In [74]:
# Make predictions and  Evaluate models
xgb_pred = xgb_model.predict(X_test)
lgb_pred = lgb_model.predict(X_test)

xgb_mse = mean_squared_error(y_test, xgb_pred)
lgb_mse = mean_squared_error(y_test, lgb_pred)


print(f'XGBoost MSE: {xgb_mse}')
print(f'LightGBM MSE: {lgb_mse}')


XGBoost MSE: 5589.842346661605
LightGBM MSE: 5359.510797742965


# conclusion
-In this code, we trained and evaluated two popular machine learning models, XGBoost and LightGBM, on a regression task. We used the Mean Squared Error (MSE) metric to compare the performance of both models.

-The results show that both models achieved competitive performance, but the LightGBM model slightly outperformed the XGBoost model with a lower MSE value. 


# Evaluate the models by MSE, MAE, R2
1. Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values. Lower values are better.
2. Mean Absolute Error (MAE): Measures the average absolute difference between predicted and actual values. Lower values are better.
3. R-squared (R2): Measures the proportion of variance in the actual values that's explained by the model. Higher values (closer to 1) are better.

In [75]:


# Evaluate the models by MSE, MAE, R2
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

mse_xgb = mean_squared_error(y_test, y_pred_xgb)
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

mse_lgb = mean_squared_error(y_test, y_pred_lgb)
mae_lgb = mean_absolute_error(y_test, y_pred_lgb)
r2_lgb = r2_score(y_test, y_pred_lgb)

print("\nXGBoost Model Performance:")
print(f"Mean Squared Error (MSE): {mse_xgb}")
print(f"Mean Absolute Error (MAE): {mae_xgb}")
print(f"R-squared (R2): {r2_xgb}")

print("\nLightGBM Model Performance:")
print(f"Mean Squared Error (MSE): {mse_lgb}")
print(f"Mean Absolute Error (MAE): {mae_lgb}")
print(f"R-squared (R2): {r2_lgb}")

# Compare actual and predicted values
import pandas as pd

results_df = pd.DataFrame({
    'Actual': y_test,
    'XGBoost Predicted': y_pred_xgb,
    'LightGBM Predicted': y_pred_lgb
})

print("\nActual vs Predicted Values:")
print(results_df)


XGBoost Model Performance:
Mean Squared Error (MSE): 6059.592500541319
Mean Absolute Error (MAE): 59.98992872841108
R-squared (R2): 0.833805291864141

LightGBM Model Performance:
Mean Squared Error (MSE): 5914.380073170836
Mean Absolute Error (MAE): 58.657987203658465
R-squared (R2): 0.8377879915229681

Actual vs Predicted Values:
         Actual  XGBoost Predicted  LightGBM Predicted
521  470.269024         378.072357          378.380404
737  286.409294         276.986206          275.702832
740  552.269375         649.757019          655.396045
660  267.740417         322.038910          319.485638
411  111.112715         115.897430          103.522858
..          ...                ...                 ...
408  258.014460         345.306519          338.659094
332  412.255607         379.407349          388.505542
208  552.639771         510.483185          518.104519
613  544.602781         553.768860          561.574247
78   151.580404         167.137924          162.950529

[200 

This code evaluates the performance of two machine learning models, XGBoost and LightGBM, using the Mean Squared Error (MSE) metric.

Here's a breakdown:

1. xgb_mse = mean_squared_error(y_test, xgb_pred):
    - This line calculates the MSE between the actual values y_test and the predicted values xgb_pred from the XGBoost model.
    - The mean_squared_error function from scikit-learn's metrics module is used to calculate the MSE.

2. lgb_mse = mean_squared_error(y_test, lgb_pred):
    - This line calculates the MSE between the actual values y_test and the predicted values lgb_pred from the LightGBM model.

# conclusion
- I compared the performance of XGBoost and LightGBM models on a regression task. After training and tuning both models, I evaluated their performance using metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R2).

-Both XGBoost and LightGBM models performed well, with R2 scores above 0.83.

- LightGBM slightly outperformed XGBoost, with a lower MSE and MAE, and a higher R2 score.







# step-5

# Backtesting with historical data.

-Backtesting is a process used to evaluate the performance of a predictive model or a trading strategy using historical data. 

-It involves simulating the model's predictions or the strategy's trades on past data to assess its potential performance in real-world scenarios.


In [78]:
# Backtesting
def backtest_model(model, X_test, y_test):
    predictions = model.predict(X_test)
    results = pd.DataFrame({'Actual': y_test, 'Predicted': predictions})
    return results

xgb_backtest_results = backtest_model(xgb_model, X_test, y_test)
lgb_backtest_results = backtest_model(lgb_model, X_test, y_test)

# Evaluate backtest results
xgb_backtest_mse = mean_squared_error(xgb_backtest_results['Actual'], xgb_backtest_results['Predicted'])
lgb_backtest_mse = mean_squared_error(lgb_backtest_results['Actual'], lgb_backtest_results['Predicted'])
print(f'✅ XGBoost Backtest MSE: {xgb_backtest_mse}')
print(f'✅ LightGBM Backtest MSE: {lgb_backtest_mse}')



✅ XGBoost Backtest MSE: 5589.842346661605
✅ LightGBM Backtest MSE: 5359.510797742965


# Conclusion:

- XGBoost Backtest MSE: 5589.84
- LightGBM Backtest MSE: 5359.51

- The code prints the MSE for both models, which can be used to compare their performance.

- [A lower MSE indicates better performance.]

-The LightGBM model performs slightly better than the XGBoost model, with a lower MSE of  5359.51 compared to 5589.84. However, the difference is relatively small, indicating that both models have similar performance.

-hence lightGBM  model has lower MSE so it performs well

In [84]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from tabulate import tabulate

# Backtest function with KPI calculation
def backtest_with_kpis(model, X_test, y_test, model_name):
    predictions = model.predict(X_test)
    results = pd.DataFrame({
        'Actual': y_test,
        'Predicted': predictions
    })
    
    # KPIs
    mse = mean_squared_error(results['Actual'], results['Predicted'])
    total_revenue_actual = results['Actual'].sum()
    total_revenue_predicted = results['Predicted'].sum()
    revenue_lift = (total_revenue_predicted - total_revenue_actual) / total_revenue_actual * 100
    avg_price_actual = results['Actual'].mean()
    avg_price_predicted = results['Predicted'].mean()
    
    kpi_summary = {
        'Model': model_name,
        'MSE': mse,
        'Historical_Revenue': total_revenue_actual,
        'Predicted_Revenue': total_revenue_predicted,
        'Revenue_Lift_%': revenue_lift,
        'Avg_Actual_Price': avg_price_actual,
        'Avg_Predicted_Price': avg_price_predicted
    }
    
    return results, kpi_summary

# Run backtesting for both models
xgb_results, xgb_kpis = backtest_with_kpis(xgb_model, X_test, y_test, "XGBoost")
lgb_results, lgb_kpis = backtest_with_kpis(lgb_model, X_test, y_test, "LightGBM")

# Combine KPI summaries
kpi_list = [xgb_kpis, lgb_kpis]

# Print table using tabulate
print(tabulate(kpi_list, headers="keys", tablefmt="grid", floatfmt=".2f"))


+----------+---------+----------------------+---------------------+------------------+--------------------+-----------------------+
| Model    |     MSE |   Historical_Revenue |   Predicted_Revenue |   Revenue_Lift_% |   Avg_Actual_Price |   Avg_Predicted_Price |
| XGBoost  | 5589.84 |             75918.98 |            77254.22 |             1.76 |             379.59 |                386.27 |
+----------+---------+----------------------+---------------------+------------------+--------------------+-----------------------+
| LightGBM | 5359.51 |             75918.98 |            77203.67 |             1.69 |             379.59 |                386.02 |
+----------+---------+----------------------+---------------------+------------------+--------------------+-----------------------+




#  Evaluation:

# o Simulated revenue lift achieved and validated.

-The simulated_revenue_lift function calculates the difference between the total predicted revenue and the total actual revenue.

- The revenue lift is calculated as the sum of the predicted values minus the sum of the actual values.

-Revenue Lift (%)=PredictedRevenue−Historical Revenue/Historical Revenue​×100

In [None]:
# Historical revenue
historical_revenue = y_test.sum()

# Predicted revenue
predicted_revenue_xgb = xgb_pred.sum()
predicted_revenue_lgb = lgb_pred.sum()

# Revenue lift (%)
revenue_lift_xgb = (predicted_revenue_xgb - historical_revenue) / historical_revenue * 100
revenue_lift_lgb = (predicted_revenue_lgb - historical_revenue) / historical_revenue * 100

print(f'XGBoost Simulated Revenue Lift: {revenue_lift_xgb:.2f}%')
print(f'LightGBM Simulated Revenue Lift: {revenue_lift_lgb:.2f}%')


XGBoost Simulated Revenue Lift: 1.76%
LightGBM Simulated Revenue Lift: 1.69%


Conclusion

- The simulated revenue lift calculation compares the total predicted revenue from the ML models (XGBoost and LightGBM) with the historical revenue from actual ride costs.

- XGBoost Revenue Lift: Shows how much revenue could increase if the dynamic pricing suggested by XGBoost were applied.

- LightGBM Revenue Lift: Shows the potential revenue increase using LightGBM predictions.

- Positive lift indicates the model could increase overall revenue, while negative lift would indicate a potential loss.