## Project 3:  Model Training with scikit-learn

### **Objective:**:

* #### Train a machine learning model to predict shipment times.

### **Some methods to apply**:

1. Feature Engineering
2. Split the dataset into training and testing sets.
3. Normalize numerical features.
4. Train a regression model.
5. Evaluate the model's performance.

## Import necessary labraries and frameworks.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

from lightgbm import LGBMRegressor
from xgboost import XGBRegressor 

# Read the CSV file.

In [5]:
shipment_df = pd.read_csv('shipment_df.csv')

In [6]:
#pd.get_dummies(shipment_df['transportation_modes'], drop_first=True)
#pd.get_dummies(shipment_df['location'], drop_first=True)
shipment_df = pd.get_dummies(shipment_df, columns=['transportation_modes', 'location', 'routes'], drop_first=True).astype('int64')

## Feature Engineering
#### Normalize numerical features

In [8]:
# 1) Identify your target(s) and features
shipmenty = 'shipping_times'
shipmentX = shipment_df.drop(columns=[shipmenty])

# 2) Select only the numeric columns in the features
num_cols = shipmentX.select_dtypes(include=['int64', 'float64']).columns
#    (this will pick up your distance, count, and one-hot dummy columns too)

# 3) Fit & transform only those numeric columns
scaler = StandardScaler()
shipmentX[num_cols] = scaler.fit_transform(shipmentX[num_cols])

shipmentX_scaled = shipmentX
shipmenty = shipment_df[shipmenty]

## Model training

### Train three regression models from sci-kit learn.
- LinearRegression
- RandomForestRegressor
- SVR

## Split the dataset

In [12]:
shipmentX_train, shipmentX_test, shipmenty_train, shipmenty_test = train_test_split(shipmentX_scaled, shipmenty, test_size=0.25, random_state=30)

## Train the Linear Regression mode

In [14]:
shipmentLinear_model = LinearRegression()
shipmentLinear_model.fit(shipmentX_train, shipmenty_train)
shipmentlinear_y_pred = shipmentLinear_model.predict(shipmentX_test)
shipment_linear_mse = mean_squared_error(shipmenty_test, shipmentlinear_y_pred)
shipment_linear_score = shipmentLinear_model.score(shipmentX_test, shipmenty_test)

## Random Forest Regression model

In [16]:
shipmentforest_model = RandomForestRegressor()
shipmentforest_model.fit(shipmentX_train, shipmenty_train)
shipmentforest_y_pred = shipmentforest_model.predict(shipmentX_test)
shipment_forest_mse = mean_squared_error(shipmenty_test, shipmentforest_y_pred)
shipment_forest_score = shipmentforest_model.score(shipmentX_test, shipmenty_test)

## SVR model

In [38]:
shipmentsvr_model = SVR(kernel='rbf', C=10, epsilon=0.1)
shipmentsvr_model.fit(shipmentX_train, shipmenty_train)
shipmenty_pred_svr = shipmentsvr_model.predict(shipmentX_test)
shipment_mse_svr = mean_squared_error(shipmenty_test, shipmenty_pred_svr)
shipmentsvr_score = shipmentsvr_model.score(shipmentX_test, shipmenty_test)

In [48]:
#from sklearn.svm import LinearSVR


#shipmentsvr_linear_model = LinearSVR(C=10, epsilon=0.1, max_iter=10000)
#shipmentsvr_linear_model.fit(shipmentX_train, shipmenty_train)
#shipmenty_pred_Lsvr = shipmentsvr_linear_model.predict(shipmentX_test)
#shipment_mse_Lsvr = mean_squared_error(shipmenty_test, shipmenty_pred_Lsvr)
#shipmentLsvr_score = shipmentsvr_linear_model.score(shipmentX_test, shipmenty_test)


## LGBMRegressor

In [20]:
shipmentlgbm_model = LGBMRegressor(n_estimators=5, learning_rate=0.6, max_depth=7, random_state=32)
shipmentlgbm_model.fit(shipmentX_train, shipmenty_train)
shipmenty_pred_lgbm = shipmentlgbm_model.predict(shipmentX_test)
shipment_mse_lgbm = mean_squared_error(shipmenty_test, shipmenty_pred_lgbm)
shipment_LightGBM_score = shipmentlgbm_model.score(shipmentX_test, shipmenty_test)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001625 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 537
[LightGBM] [Info] Number of data points in the train set: 18750, number of used features: 11
[LightGBM] [Info] Start training from score 3.298827


## XGBRegressor

In [22]:
shipmentxgb_model = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=32)
shipmentxgb_model.fit(shipmentX_train, shipmenty_train)
shipmenty_pred_xgb = shipmentxgb_model.predict(shipmentX_test)
shipment_mse_xgb = mean_squared_error(shipmenty_test, shipmenty_pred_xgb)
shipment_XGBoost_score = shipmentxgb_model.score(shipmentX_test, shipmenty_test)

## Evaluate the performance of the models

In [52]:
# Evaluate the model
print(f'Linear MSE: {shipment_linear_mse}')
print(f'Random Forest MSE: {shipment_forest_mse}')
print(f'SVR MSE: {shipment_mse_svr}\n')
#print(f'Linear SVR MSE: {shipment_mse_Lsvr}\n')
print(f'LightGBM MSE: {shipment_mse_lgbm}')
print(f'XGBoost MSE: {shipment_mse_xgb}')

print('-' * 64 + '\n')

print("Accuracy\n")
print(f'Linear Accuracy: {shipment_linear_score}')
print(f'Random Forest Accuracy: {shipment_forest_score}')
print(f'SVR Accuracy: {shipmentsvr_score}\n')
#print(f'Linear SVR MSE: {shipmentLsvr_score}\n')

print(f'LightGBM Accuracy: {shipment_LightGBM_score}')
print(f'XGBoost Accuracy: {shipment_XGBoost_score}')

Linear MSE: 0.7589074944754644
Random Forest MSE: 0.28235580107999997
SVR MSE: 0.3451066719021983

LightGBM MSE: 0.25328209362607923
XGBoost MSE: 0.25490039587020874
----------------------------------------------------------------

Accuracy

Linear Accuracy: 0.7794240092156309
Random Forest Accuracy: 0.9179334622067713
SVR Accuracy: 0.8996949606700918

LightGBM Accuracy: 0.9263837171773799
XGBoost Accuracy: 0.9259133338928223


## Conclusion

### Some methods Used for Project 3:

* Split data (75% train, 25% test)
* Trained LinearRegression, RandomForestRegressor, SVR models, XGBRegressor, and LGBMRegressor
* Evaluated with each model's MSE

### Results: Random Forest achieved MSE: 0.283

* #### Strong performance achieved.
* #### The Gradient Boosters are currently the best performer without hyperparameter tuning on the other models
* #### The low MSE for Random Forest and the gradient boosting regressors reflects a high level of predictive accuracy.
* The LGBMRegressor and XGBRegressor achieved the lowest errors (MSE = 0.2532 and 0.2549), indicating they best capture complex interactions.
* RandomForestRegressor and SVR followed with moderate errors (MSE = 0.2831 and 0.3518), showing some non-linear patterns learned.
* Linear Regression had the highest error (MSE = 0.7589), suggesting it underfits non-linear relationships.


### Possible methods for improvement:

* I strongly believe the selected features are impacting the results more than the algorithms.
* Random Forest model's performance improves with better feature selection
* I may try polynomial features for a non-linear relationship.
* Use cross-validation for more reliable metrics.
* Apply hyperparameter tuning using GridSearchCV or RandomSearchCV