<a href="https://colab.research.google.com/github/CaesarQuintero/MLProjectSupplyChain/blob/main/ML_Production_Model_for_late_delivery_detection%2C_Beta%2C_Production_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Overview
>**Author's note:** I recommend reading this initial part to understand this model, this beta a full production model offers a complete model of anomalous case prediction, while reading the code it has instructions and notes that help a better interpretation.

**Developed by:** Cesar Augusto Quintero Guerra 2023

**Advised and reviewed by:** Johan Sanchez y Juan Lopez

**Lynxus 2023**


>Below you will find the project repository, along with all the development and planning of the project in notion including the production model.

* [Github Repository](https://github.com/CaesarQuintero/MLProjectSupplyChain)

* [Project in Notion](https://www.notion.so/Machine-Learning-Project-abc63e69e99643cb9eb3a51428deb061?pvs=4)

* [Production Model](https://colab.research.google.com/drive/14ulBobu4QZ5tPRMG2uqvU0nkn2i1uO8L?usp=sharing)

* [Linkedin](https://www.linkedin.com/in/caesarquintero/)


## Purpose

*   Create a machine learning model to predict freight delivery timeliness (on-time or late).
*   Employ Random Forest classifier with SMOTE oversampling and Random
* Undersampling to address class imbalance and enhance performance.

## Target Audience
* Individuals or organizations involved in freight transportation seeking to improve delivery reliability.

## Additional Details: Oversampling and Undersampling

### Oversampling:
Creates additional samples of the minority class (SMOTE used in this project).
### Undersampling:
Removes samples from the majority class (Random Undersampling used in this project).
<br/><br/>

---


# Installation Instructions
## Prerequisites
Python 3
## Installation

```
pip install pandas numpy matplotlib seaborn statsmodels gradio
```
<br/><br/>

---


## Usage Instructions
1.   **Load the dataset:**

```
data = pd.read_excel("MLDatasetforTest.xlsx", usecols=desired_columns)

```
**Note**: Loading Excel files can be slow, so it is recommended to use CSV files instead.

2.   **Preprocess the data (refer to code for detailed steps):**
  * Handle null and duplicate values
  * Remove outliers
  * Create binary target variables for late deliveries

3. **Undersample the majority class and oversample the minority class:**

#### Undersampling
```
rus = RandomUnderSampler(sampling_strategy='majority')
x_train_ontime, y_train_ontime = rus.fit_resample(x_train, y_train)
```
#### Oversampling
```
sm = SMOTE(sampling_strategy='minority',random_state=123)
x_train_late, y_train_late = sm.fit_resample(x_train, y_train)
```
4. **Split data into training and testing sets:**

```
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)
```

5. **Combine oversampled and undersampled data:**


```
x_train_smote = pd.concat([x_train_late, x_train_ontime])
y_train_smote = pd.concat([y_train_late, y_train_ontime])
```

6. **Create and train the Random Forest classifier:**

```
random_forest_model_SMOTE = RandomForestClassifier(**hyperparameters)
random_forest_model_SMOTE.fit(x_train_smote, y_train_smote)
```

7. **Make predictions:**

```
predictions = random_forest_model_SMOTE.predict(x_test)
```

8. **Evaluate model performance:**

```
print(classification_report(y_test, predictions))
print('Accuracy:', accuracy_score(y_test, predictions))
```

9. **Deploy the model:**

* Command-line interface (1st Deployment Option)
* Gradio web interface (2nd Deployment Option)


## Additional Information

##**Feature Importance**
### Most influential features:
* FreightWeight
* Miles
* CustomerCharges

##**Hyperparameter Tuning**
###Tuned hyperparameters:
* n_estimators = 60
* max_depth = 4

## Troubleshooting

Missing libraries: pip install missing libraries.
Data format issues: Ensure dataset is in Excel format with specified columns.
Model errors: Double-check code syntax and hyperparameter values.


## Gradio Installation

In [None]:
!pip install -q gradio

## Import of dependencies

In [None]:
#Essentials
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import os
from google.colab import drive

#Sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, confusion_matrix,classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler

# SMOTE
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

#Fine tunning Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

## Dataset source | File route

In [None]:
#File variables
file_route = '/content/MLDatasetforTest.xlsx'
file_name = "MLDatasetforTest.xlsx"


In [None]:
#Checking if the file is in the drive folder
!ls

MLDatasetforTest.xlsx  sample_data


### Dataload

In [None]:
# Pick Desired Columns for the model
desired_columns = ['FreightWeight','miles','freightType','TrailerType','Broker Rep','BrokeredTime','Delivery Appointment End Time','BrokeredTime','Delivery Late Time (in Mins)','Pickup Late Time (in Mins)','OriginCity','OriginState','OriginZip','DestinationCity','DestinationState','DestinationZip','CustomerCharges']

In [None]:
#Import dataset
data = pd.read_excel(file_name,usecols =desired_columns)

### Check for null and duplicates values

In [None]:
#Null values
if data.isnull().values.any():
  print('The Dataset has null values')
else:
  print('The dataset has not null values')

The Dataset has null values


In [None]:
#Duplicate Values
if data.duplicated().any():
   print('The Dataset has duplicate values')
else:
   print('The dataset has not duplicate values')

The dataset has not duplicate values


In [None]:
# Porcentage of Duplicates values
pd.options.display.float_format = '{:.3f}%'.format
duplicated_values = round(data.duplicated().mean()*100,1)
print(f"The dataset has {duplicated_values}% duplicated values")

The dataset has 0.0% duplicated values


In [None]:
# Porcentage of null values
pd.options.display.float_format = '{:.3f}%'.format
data.isna().mean()*100

FreightWeight                   0.000%
miles                           0.030%
freightType                     0.000%
TrailerType                     0.000%
Broker Rep                      0.000%
BrokeredTime                    0.000%
Pickup Late Time (in Mins)      0.162%
Delivery Appointment End Time   0.000%
Delivery Late Time (in Mins)    0.000%
OriginCity                      0.000%
OriginState                     0.000%
OriginZip                       0.000%
DestinationCity                 0.000%
DestinationState                0.000%
DestinationZip                  0.000%
CustomerCharges                 0.000%
dtype: float64

### Data Procesing (ETL) | Statistical Normalization

---



---



Aim for remove the null values and duplicates to increase the relationships in the correlation matrix before to setting up the Linear regression machine learning model

In [None]:
#Columns category
columnslessthan1percentage = [ 'Delivery Late Time (in Mins)', 'OriginZip', 'DestinationZip','Pickup Late Time (in Mins)']
columnsgreaterthan1percentage = ['miles']

#For columns with less than 1% of null values, delete null values.
data.dropna(subset=['Delivery Late Time (in Mins)'],inplace = True)
data.dropna(subset=['OriginZip'],inplace = True)
data.dropna(subset=['DestinationZip'],inplace = True)
data.dropna(subset=['Pickup Late Time (in Mins)'],inplace = True)

#For columns with more than 1% of null values, fill null values with the average
data['miles'].fillna(data['miles'].median(),inplace = True)

#Removing duplicates
data.drop_duplicates(subset=None, keep="first", inplace=True)


In [None]:
#Null values
if data.isnull().values.any():
  print('The Dataset has null values')
else:
  print('The dataset has not null values')

The dataset has not null values


In [None]:
# Porcentage of null values
data.isna().mean()*100

FreightWeight                   0.000%
miles                           0.000%
freightType                     0.000%
TrailerType                     0.000%
Broker Rep                      0.000%
BrokeredTime                    0.000%
Pickup Late Time (in Mins)      0.000%
Delivery Appointment End Time   0.000%
Delivery Late Time (in Mins)    0.000%
OriginCity                      0.000%
OriginState                     0.000%
OriginZip                       0.000%
DestinationCity                 0.000%
DestinationState                0.000%
DestinationZip                  0.000%
CustomerCharges                 0.000%
dtype: float64

In [None]:
#Duplicate Values
if data.duplicated().any():
   print('The Dataset has duplicate values')
else:
  print('The dataset has not duplicate values')

The dataset has not duplicate values


In [None]:
# Porcentage of Duplicates values
pd.options.display.float_format = '{:.3f}%'.format
duplicated_values = round(data.duplicated().mean()*100,1)
print(f"{duplicated_values}%")

0.0%


### Checking and removing Outliers

In [None]:
#Picking integer columns
data = data.select_dtypes(include=["int64", "float64"])
# Filter values outside the range
desired_range = (-45000, 45000)
data = data[data["Delivery Late Time (in Mins)"].between(*desired_range)]

In [None]:
# Atipical Values | Outliers
for col in data.columns:
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    iqr = Q3 - Q1
    lower_bound = Q1 - 1.5 * iqr
    upper_bound = Q3 + 1.5 * iqr
    outliers = data[(data[col] < lower_bound) | (data[col] > upper_bound)]
    print("Outliers in the variable :", col, len(outliers))

Outliers in the variable : FreightWeight 1
Outliers in the variable : miles 667
Outliers in the variable : Pickup Late Time (in Mins) 1137
Outliers in the variable : Delivery Late Time (in Mins) 1074
Outliers in the variable : CustomerCharges 460


In [None]:
#Outliers Remover
data = data.drop(outliers.index)

In [None]:
# Atipical Values checker | Outliers
for col in data.columns:
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    iqr = Q3 - Q1
    lower_bound = Q1 - 1.5 * iqr
    upper_bound = Q3 + 1.5 * iqr
    outliers = data[(data[col] < lower_bound) | (data[col] > upper_bound)]
    print("Outliers in the variable :", col, len(outliers))

Outliers in the variable : FreightWeight 1
Outliers in the variable : miles 387
Outliers in the variable : Pickup Late Time (in Mins) 1099
Outliers in the variable : Delivery Late Time (in Mins) 1028
Outliers in the variable : CustomerCharges 319


## Dataset Transformation for classification models

In [None]:
# Create a new column pickup late
data['Pickup late(Y/N)'] = np.where(data['Pickup Late Time (in Mins)'] > 0, 1, 0)

In [None]:
# Create a new column Deliver late
data['Deliver late(Y/N)'] = np.where(data['Delivery Late Time (in Mins)'] > 0, 1, 0)

## Classification models

#### Define X and Y

In [None]:
# Define the target variable
y = data['Deliver late(Y/N)']

# Select the numerical columns
x = data[["FreightWeight", "miles","CustomerCharges"]]


#Revisar alternativas con carrier charges en vez de customer charges

#### Split the train Data from the dataset

In [None]:
# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)

## Oversample the minority class using SMOTE

In [None]:
sm = SMOTE(sampling_strategy='minority',random_state=123)
x_train_late, y_train_late = sm.fit_resample(x_train, y_train)

## Undersample the majority class using SMOTE

In [None]:
rus = RandomUnderSampler(sampling_strategy='majority')
x_train_ontime, y_train_ontime = rus.fit_resample(x_train, y_train)

## New Balanced Data

In [None]:
# Combine the oversampled late delivery data and undersampled on-time delivery data
x_train_smote = pd.concat([x_train_late, x_train_ontime])
y_train_smote = pd.concat([y_train_late, y_train_ontime])

## Random forest Classifier with hyperparameters with SMOTE Oversampling and Undersampling

In [None]:
#Using the finetuned model
hyperparameters = {
    "n_estimators": 60,
    "max_depth": 4,
}

In [None]:
# Create a Random Forest classifier
random_forest_model_SMOTE = RandomForestClassifier(**hyperparameters)

In [None]:
# Fit the model
random_forest_model_SMOTE.fit(x_train_smote, y_train_smote)

In [None]:
# Make predictions
predictions = random_forest_model_SMOTE.predict(x_test)

In [None]:
# Evaluate the model
print(classification_report(y_test, predictions))
print('Accuracy:', accuracy_score(y_test, predictions))

              precision    recall  f1-score   support

           0       0.92      0.92      0.92      1707
           1       0.23      0.24      0.23       173

    accuracy                           0.85      1880
   macro avg       0.57      0.58      0.58      1880
weighted avg       0.86      0.85      0.86      1880

Accuracy: 0.8537234042553191


## Deployment option 1

Offers a direct query on the model using the command terminal

In [None]:
print('Please enter the following information:')
freight_weight = float(input('Freight weight (in pounds): '))
miles = float(input('Miles to travel: '))
customer_charges = float(input('Customer charges: '))

# Create a dataframe with the input data
data = pd.DataFrame({
    'FreightWeight': [freight_weight],
    'miles': [miles],
    'CustomerCharges': [customer_charges]
})

# Make a prediction
prediction = random_forest_model_SMOTE.predict(data)

# Print the prediction
if prediction > 0:
  print('Load forecast : The load will be late')
else:
  print('Load forecast : The load will be on time')


Please enter the following information:
Freight weight (in pounds): 100000000
Miles to travel: 2
Customer charges: 1
Load forecast : The load will be late


## Deployment option 2

Offers a drop-down graphical interface for users

In [None]:
# Define the inputs and outputs AS
inputs = [
    gr.Textbox(label='Freight weight (in pounds)'),
    gr.Textbox(label='Miles to travel'),
    gr.Textbox(label='Customer charges')
]
outputs = gr.Textbox(label='Load forecast')

# Define the function that will be called when the user clicks the submit button
def predict(freight_weight, miles, customer_charges):
    # Convert the input strings to floats
    freight_weight = float(freight_weight)
    miles = float(miles)
    customer_charges = float(customer_charges)

    # Create a dataframe with the input data
    data_for_forecast = pd.DataFrame({
        'FreightWeight': [freight_weight],
        'miles': [miles],
        'CustomerCharges': [customer_charges]
    })

    # Make a prediction
    prediction = random_forest_model_SMOTE.predict(data_for_forecast)



    # Return the prediction
    return 'The load will be ' + ('late' if prediction > 0 else 'on time')
# Create the interface
interface = gr.Interface(predict, inputs, outputs)

# Launch the interface
interface.launch()
