# Scrap Probability Prediction

## Problem Statement

Develop a machine learning model to predict the probability of a product being scrapped. Thereby taking preventive actions by identifying the products with high Scrap Probability and identifying the likelihood of products being scrapped for planning better productive schedules, allocate resources more efficiently and improve quality control. 

### `CRISP-ML(Q)` process model describes six phases:

1. Business and Data Understanding
2. Data Preparation
3. Model Building
4. Model Evaluation
5. Deployment
6. Monitoring and Maintenance

### Business and Data Understanding

**Objective(s):** Minimize the Number of Scrapped products.

**Constraint(s):** Maximize the Business revenue by reducing the Scrap losses. 

**Success Criteria:**

- **Business Success Criteria:** Reduce total Scrap products by 25% within the first year of implementation.

- **Machine Learning Success Criteria:** Achieve good performance metrics with R-Squared value greater than 0.85

- **Economic Success Criteria:** Boost overall sales revenue by reducing the likelihood of products to be Scrapped

### Data Collection/Description

**Data:** This is the dummy data created for the testing purpose for the model building with some relating columns.

**Data Dictionary:**
- Dataset contains 8 columns/features
- Dataset contains 100 records

**Description:**
- **Item_ID** - Unique identifier for each Item
- **Item_Name** - Item Name or product Name 
- **Item_Category** - Category name into which category the following product/item falls.
- **Scrap_Description** - Reason for the Scrapping of the product/item
- **Quantity** - Number of Quantity Scrapped
- **Disposal_Method** - Type of Disposing the Scrapped products 
- **Cost** - Cost of the product/item for the given quantity
- **Scrap_Probability** - Probability of an item more likely to be Scrapped

#### Importing required dependencies 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib
import pickle 
import mlflow

#### Loading Dataset

In [1]:
df = pd.read_excel(r"C:\Users\USER\Desktop\Scrap Probability Prediction\scrap.xlsx")
df

NameError: name 'pd' is not defined

#### Exploratory Data Analysis

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Item_ID            100 non-null    int64  
 1   Item_Name          100 non-null    object 
 2   Item_Category      100 non-null    object 
 3   Scrap_Description  100 non-null    object 
 4   Quantity           100 non-null    int64  
 5   Disposal_Method    100 non-null    object 
 6   Cost               100 non-null    int64  
 7   Scrap_Probability  100 non-null    float64
dtypes: float64(1), int64(3), object(4)
memory usage: 6.4+ KB


In [4]:
df.describe()

Unnamed: 0,Item_ID,Quantity,Cost,Scrap_Probability
count,100.0,100.0,100.0,100.0
mean,50.5,51.5,5150.0,0.6935
std,29.011492,29.011492,2901.149198,0.11584
min,1.0,2.0,200.0,0.5
25%,25.75,26.75,2675.0,0.6
50%,50.5,51.5,5150.0,0.68
75%,75.25,76.25,7625.0,0.79
max,100.0,101.0,10100.0,0.91


In [5]:
df.shape

(100, 8)

In [4]:
df.isna().sum()

Item_ID              0
Item_Name            0
Item_Category        0
Scrap_Description    0
Quantity             0
Disposal_Method      0
Cost                 0
Scrap_Probability    0
dtype: int64

In [120]:
# pip install dtale
import dtale

d = dtale.show(df)
d.open_browser()

In [121]:
# Sweetviz
###########
# !pip install sweetviz
import sweetviz as sv

s = sv.analyze(df)
s.show_html()

                                             |                                             | [  0%]   00:00 ->…

Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


### Data Preparation/Preprocessing 

#### Dealing with Missing Values 

In [3]:
# check for missing values for each column if any
df.isnull().sum()
# there are no missing values in the dataset 

Item_ID              0
Item_Name            0
Item_Category        0
Scrap_Description    0
Quantity             0
Disposal_Method      0
Cost                 0
Scrap_Probability    0
dtype: int64

#### Duplicate check and removing duplicates

In [2]:
#Handling Duplicates
duplicates = df.duplicated()
sum(duplicates)

NameError: name 'df' is not defined

#### Dummy Variable Creation for Categorical Features

In [5]:
# Encode categorical features
label_encoders = {}
categorical_features = ['Item_Name', 'Item_Category', 'Scrap_Description', 'Disposal_Method']

In [6]:
for col in categorical_features:
    label_encoders[col] = LabelEncoder()
    df[col] = label_encoders[col].fit_transform(df[col])

In [7]:
df

Unnamed: 0,Item_ID,Item_Name,Item_Category,Scrap_Description,Quantity,Disposal_Method,Cost,Scrap_Probability
0,1,90,1,0,10,1,1000,0.85
1,2,53,7,2,5,2,500,0.65
2,3,45,0,1,8,1,800,0.75
3,4,25,2,3,12,0,1200,0.55
4,5,87,3,0,15,1,1500,0.90
...,...,...,...,...,...,...,...,...
95,96,6,5,3,96,2,9600,0.54
96,97,67,6,0,98,0,9800,0.88
97,98,77,4,2,100,1,10000,0.61
98,99,37,9,1,99,2,9900,0.69


#### Splitting the Dataset into Dependent and independent variables 

In [8]:
# Split the data into features and target
X = df.drop(columns='Scrap_Probability')
y = df['Scrap_Probability']

#### Splitting the data into Training and Test datasets 

In [9]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
X_test.to_csv('test1.csv', index = False)

In [11]:
X_test.to_excel("testset.xlsx", index = False)

#### Scaling the Data to improve Model Convergence and performance

In [12]:
# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Model Building

#### Random Forest Regression

In [15]:
# Train the model
RF_model = RandomForestRegressor(n_estimators=100, random_state=42)
RF_model.fit(X_train, y_train)

In [16]:
# Predict on the test set
y_pred_RF = RF_model.predict(X_test)

#### Evaluation Metrics for Random Forest Regression

In [17]:
# Evaluate the model
RF_mae = mean_absolute_error(y_test, y_pred_RF)
RF_mse = mean_squared_error(y_test, y_pred_RF)
RF_r2 = r2_score(y_test, y_pred_RF)

print(f'MAE: {RF_mae}')
print(f'MSE: {RF_mse}')
print(f'R^2: {RF_r2}')

MAE: 0.030785000000000028
MSE: 0.0013634795000000014
R^2: 0.8886864642011592


#### Support Vector Machine Regression

In [3]:
# Create and train the SVR model
SVR_model = SVR(kernel='linear', C=1.0)
SVR_model.fit(X_train, y_train)

NameError: name 'SVR' is not defined

In [19]:
# Predict on the test set
y_pred_SVR = SVR_model.predict(X_test)

#### Evaluation Metrics for Support Vector Machine Regression 

In [20]:
# Evaluate the model
SVR_mae = mean_absolute_error(y_test, y_pred_SVR)
SVR_mse = mean_squared_error(y_test, y_pred_SVR)
SVR_r2 = r2_score(y_test, y_pred_SVR)

print(f'MAE: {SVR_mae}')
print(f'MSE: {SVR_mse}')
print(f'R^2: {SVR_r2}')

MAE: 0.037225582241188306
MSE: 0.0021303483327275758
R^2: 0.8260798160888583


#### Linear Regression

In [139]:
# Create and fit the model
LR_model = LinearRegression()
LR_model.fit(X_train, y_train)

In [140]:
y_pred_LR = LR_model.predict(X_test)

In [141]:
y_pred_LR

array([0.53658051, 0.6373585 , 0.74027152, 0.65330982, 0.83278504,
       0.55201468, 0.73939044, 0.83448914, 0.73840784, 0.84899891,
       0.75036767, 0.75071256, 0.64292154, 0.64687766, 0.74980082,
       0.85091897, 0.84101938, 0.63618294, 0.84571435, 0.5445659 ])

#### Evaluation Metrics for Linear regression

In [142]:
# Evaluate the model
LR_mae = mean_absolute_error(y_test, y_pred_LR)
LR_mse = mean_squared_error(y_test, y_pred_LR)
LR_r2 = r2_score(y_test, y_pred_LR)

print(f'MAE: {LR_mae}')
print(f'MSE: {LR_mse}')
print(f'R^2: {LR_r2}')

MAE: 0.02247005371337879
MSE: 0.0007368257925120196
R^2: 0.9398460451863809


#### MLFlow 

In [18]:
# Set the experiment name
mlflow.set_experiment('Scrap_Probability')
 
# Start an MLflow run
with mlflow.start_run(run_name='LinearRegression_Model'):
    # Create and fit the model
    LR_model = LinearRegression()
    LR_model.fit(X_train, y_train)
 
    # Make predictions
    y_pred_LR = LR_model.predict(X_test)
    print(y_pred_LR)
 
    # Evaluate the model
    LR_mae = mean_absolute_error(y_test, y_pred_LR)
    LR_mse = mean_squared_error(y_test, y_pred_LR)
    LR_r2 = r2_score(y_test, y_pred_LR)
 
    print(f'MAE: {LR_mae}')
    print(f'MSE: {LR_mse}')
    print(f'R^2: {LR_r2}')
 
    # Log parameters, metrics, and the model
    mlflow.log_param('model', 'LinearRegression')
    mlflow.log_metric('mae', LR_mae)
    mlflow.log_metric('mse', LR_mse)
    mlflow.log_metric('r2', LR_r2)
   
    # Log the model
    mlflow.sklearn.log_model(LR_model, 'model')

[0.53658051 0.6373585  0.74027152 0.65330982 0.83278504 0.55201468
 0.73939044 0.83448914 0.73840784 0.84899891 0.75036767 0.75071256
 0.64292154 0.64687766 0.74980082 0.85091897 0.84101938 0.63618294
 0.84571435 0.5445659 ]
MAE: 0.022470053713378783
MSE: 0.0007368257925120195
R^2: 0.9398460451863809




### Model Dumping/Saving

In [102]:
# LR_model_and_metrics = {
#     'model': LR_model,
#     'Mean Squared Error': LR_mse,
#     'Mean Absolute Error': LR_mae,
#     'R-Squared Error' : LR_r2
# }

#### Joblib File

In [54]:
#saving the model using joblib
joblib_file = 'Scrap_Probability_Prediction.joblib'
joblib.dump(LR_model, joblib_file)

['Scrap_Probability_Prediction.joblib']

#### Pickle File 

In [55]:
with open('Scrap_Probability_Prediction.pkl', 'wb') as file:
    pickle.dump(LR_model, file)