# Manufacturing defect classification

## Problem Statement
The goal of this project is to develop a data-driven solution that can help the manufacturing company improve its defect detection and correction processes. The solution should leverage historical data on manufacturing defects to identify patterns, predict future defects, and optimize corrective actions. The solution should also be scalable and adaptable to the company's evolving manufacturing processes and products.

The solution should be able to demonstrate a significant reduction in defects, improved customer satisfaction, and increased revenue for the manufacturing company.

__DATASET DESCRIPTION__
1. __ProductionVolume__: Number of units produced per day.
2. __ProductionCost__: Cost incurred for production per day.
3. __SupplierQuality__: Quality ratings of suppliers.
4. __DeliveryDelay__: Average delay in delivery.
5. __DefectRate__: Defects per thousand units produced.
6. __QualityScore__: Overall quality assessment.
7. __MaintenanceHours__: Hours spent on maintenance per week.
8. __DowntimePercentage__: Percentage of production downtime.
9. __InventoryTurnover__: Ratio of inventory turnover.
10. __StockoutRate__: Rate of inventory stockouts.
11. __WorkerProductivity__: Productivity level of the workforce.
12. __SafetyIncidents__: Number of safety incidents per month.
13. __EnergyConsumption__: Energy consumed in kWh.
14. __EnergyEfficiency__: Efficiency factor of energy usage.
15. __AdditiveProcessTime__: Time taken for additive manufacturing.
16. __AdditiveMaterialCost__: Cost of additive materials per unit.
17. __DefectStatus__: Predicted defect status.

## 1. Data Preprocessing

In [1]:
# import userdefined class to load the data
from User_defined_Data_loader import DataLoader
import warnings
warnings.filterwarnings('ignore')

# initialize the userdefined class
data_loader = DataLoader('manufacturing_defect_dataset.csv')
# read the data using read data function
manufacturing_defect_data = data_loader.read_data()

In [2]:
# first 5 rows from the data
manufacturing_defect_data.head()

Unnamed: 0,ProductionVolume,ProductionCost,SupplierQuality,DeliveryDelay,DefectRate,QualityScore,MaintenanceHours,DowntimePercentage,InventoryTurnover,StockoutRate,WorkerProductivity,SafetyIncidents,EnergyConsumption,EnergyEfficiency,AdditiveProcessTime,AdditiveMaterialCost,DefectStatus
0,202,13175.403783,86.648534,1,3.121492,63.463494,9,0.052343,8.630515,0.081322,85.042379,0,2419.616785,0.468947,5.551639,236.439301,1
1,535,19770.046093,86.310664,4,0.819531,83.697818,20,4.908328,9.296598,0.038486,99.657443,7,3915.566713,0.119485,9.080754,353.957631,1
2,960,19060.820997,82.132472,0,4.514504,90.35055,1,2.464923,5.097486,0.002887,92.819264,2,3392.385362,0.496392,6.562827,396.189402,1
3,370,5647.606037,87.335966,5,0.638524,67.62869,8,4.692476,3.577616,0.055331,96.887013,8,4652.400275,0.183125,8.097496,164.13587,1
4,206,7472.222236,81.989893,3,3.867784,82.728334,9,2.746726,6.851709,0.068047,88.315554,7,1581.630332,0.263507,6.406154,365.708964,1


In [3]:
# dataset shape
manufacturing_defect_data.shape

(3240, 17)

In [4]:
# columns of the data
manufacturing_defect_data.columns

Index(['ProductionVolume', 'ProductionCost', 'SupplierQuality',
       'DeliveryDelay', 'DefectRate', 'QualityScore', 'MaintenanceHours',
       'DowntimePercentage', 'InventoryTurnover', 'StockoutRate',
       'WorkerProductivity', 'SafetyIncidents', 'EnergyConsumption',
       'EnergyEfficiency', 'AdditiveProcessTime', 'AdditiveMaterialCost',
       'DefectStatus'],
      dtype='object')

In [5]:
# variable types in data
manufacturing_defect_data.dtypes

ProductionVolume          int64
ProductionCost          float64
SupplierQuality         float64
DeliveryDelay             int64
DefectRate              float64
QualityScore            float64
MaintenanceHours          int64
DowntimePercentage      float64
InventoryTurnover       float64
StockoutRate            float64
WorkerProductivity      float64
SafetyIncidents           int64
EnergyConsumption       float64
EnergyEfficiency        float64
AdditiveProcessTime     float64
AdditiveMaterialCost    float64
DefectStatus              int64
dtype: object

In [6]:
# check for the noisy data in the object type columns
for i in manufacturing_defect_data.columns:
    print(i)
    print(manufacturing_defect_data[i].unique())
    print("-------------------------------------------")
    print("")

ProductionVolume
[202 535 960 370 206 171 800 120 714 221 566 314 430 558 187 472 199 971
 763 230 761 408 869 443 591 513 905 485 291 376 260 559 413 121 352 847
 956 660 574 158 610 781 575 799 882 289 786 662 975 666 343 931 604 584
 918 746 940 266 373 487 700 415 113 341 876 445 664 997 439 191 466 554
 527 608 875 134 305 180 661 101 489 665 205 871 921 576 802 501 829 655
 261 301 369 962 915 555 561 826 351 801 395 824 819 848 437 978 152 891
 316 863 287 479 592 140 256 114 912 164 938 620 228 747 571 162 238 598
 692 491 774 518 388 478 872 589 330 127 234 300 939 879 132 147 602 506
 673 827 904 198 783 825 646 838 712 742 868 104 317 866 497 970 894 492
 306 957 653 991 560 790 674 963 842 340 663 195 999 833 754 270 640 135
 624 259 798 342 185 895 677 656 745 719 897 739 605 447 572 324 484 476
 382 732 727 844 358 458 809 510 748 417 776 333 926 473 771 707 332 791
 212 929 596 541 367 609 906 486 724 741 319 954 835 502 737 229 515 346
 935 538 302 283 222 500 393 379 9

In [7]:
manufacturing_defect_data.isna().sum()

ProductionVolume        0
ProductionCost          0
SupplierQuality         0
DeliveryDelay           0
DefectRate              0
QualityScore            0
MaintenanceHours        0
DowntimePercentage      0
InventoryTurnover       0
StockoutRate            0
WorkerProductivity      0
SafetyIncidents         0
EnergyConsumption       0
EnergyEfficiency        0
AdditiveProcessTime     0
AdditiveMaterialCost    0
DefectStatus            0
dtype: int64

In [8]:
# Creating a function to replace null values with median for Numerical variables and Mode for Categorical variables.

def Handling_null_values(data,column_name,thresh_hold):
    if (data.isna().sum() > 0) and (data.isna().sum() < thresh_hold*len(data)):
        if data.dtype == 'int' or data.dtype == 'float':
            if  len(data[data>(data.mean()+(2*data.std()))]) > 0:
                filled_data = data.fillna(data.median())
                return filled_data
            else:
                filled_data = data.fillna(data.mean())
                return filled_data
        elif data.dtype == 'object':
            filled_data = data.fillna(data.mode())
            return filled_data
        else:
            print("Data is neither Numerical nor Categorical!.")
    elif data.isna().sum() > thresh_hold*len(data):
        print(f"Drop the column: {column_name}")
    else:
        return data

In [9]:
# loop over
for i in manufacturing_defect_data.columns:
    manufacturing_defect_data[i] = Handling_null_values(manufacturing_defect_data[i],i,thresh_hold=0.3)

In [10]:
manufacturing_defect_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3240 entries, 0 to 3239
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ProductionVolume      3240 non-null   int64  
 1   ProductionCost        3240 non-null   float64
 2   SupplierQuality       3240 non-null   float64
 3   DeliveryDelay         3240 non-null   int64  
 4   DefectRate            3240 non-null   float64
 5   QualityScore          3240 non-null   float64
 6   MaintenanceHours      3240 non-null   int64  
 7   DowntimePercentage    3240 non-null   float64
 8   InventoryTurnover     3240 non-null   float64
 9   StockoutRate          3240 non-null   float64
 10  WorkerProductivity    3240 non-null   float64
 11  SafetyIncidents       3240 non-null   int64  
 12  EnergyConsumption     3240 non-null   float64
 13  EnergyEfficiency      3240 non-null   float64
 14  AdditiveProcessTime   3240 non-null   float64
 15  AdditiveMaterialCost 

In [11]:
manufacturing_defect_data.describe()

Unnamed: 0,ProductionVolume,ProductionCost,SupplierQuality,DeliveryDelay,DefectRate,QualityScore,MaintenanceHours,DowntimePercentage,InventoryTurnover,StockoutRate,WorkerProductivity,SafetyIncidents,EnergyConsumption,EnergyEfficiency,AdditiveProcessTime,AdditiveMaterialCost,DefectStatus
count,3240.0,3240.0,3240.0,3240.0,3240.0,3240.0,3240.0,3240.0,3240.0,3240.0,3240.0,3240.0,3240.0,3240.0,3240.0,3240.0,3240.0
mean,548.523148,12423.018476,89.83329,2.558951,2.749116,80.134272,11.476543,2.501373,6.019662,0.050878,90.040115,4.591667,2988.494453,0.299776,5.472098,299.515479,0.840432
std,262.402073,4308.051904,5.759143,1.705804,1.310154,11.61175,6.872684,1.443684,2.329791,0.028797,5.7236,2.896313,1153.42082,0.1164,2.598212,116.379905,0.366261
min,100.0,5000.174521,80.00482,0.0,0.50071,60.010098,0.0,0.001665,2.001611,2e-06,80.00496,0.0,1000.720156,0.100238,1.000151,100.211137,0.0
25%,322.0,8728.82928,84.869219,1.0,1.598033,70.10342,5.75,1.264597,3.983249,0.0262,85.180203,2.0,1988.140273,0.200502,3.228507,194.922058,1.0
50%,549.0,12405.204656,89.704861,3.0,2.708775,80.265312,12.0,2.465151,6.022389,0.051837,90.125743,5.0,2996.822301,0.29747,5.437134,299.728918,1.0
75%,775.25,16124.462428,94.789936,4.0,3.904533,90.353822,17.0,3.774861,8.050222,0.075473,95.050838,7.0,3984.788299,0.402659,7.741006,403.178283,1.0
max,999.0,19993.365549,99.989214,5.0,4.998529,99.996993,23.0,4.997591,9.998577,0.099997,99.996786,9.0,4997.074741,0.4995,9.999749,499.982782,1.0


## 2. Exploratory Data Analysis - EDA

In [12]:
# import libraries for the Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as px
from plotly.subplots import make_subplots
import plotly.colors
# set the color palette
sns.set_palette('Pastel1')

### 2.1 Univariate Analysis

In [13]:
# funtion to visualize the numerical Variable types 
def univariate_plot_numerical(data,col_name):
    # initialize the figure with 2 subplots in 1 row
    fig = make_subplots(cols=2,rows=1,subplot_titles=(f'{col_name} - Box plot',f'{col_name} - Hist plot'))
    # set height and width of the figure
    fig.update_layout(autosize=False,width=800,height=350)
    # add box plot to the figure
    fig.add_trace(px.Box(x=data,name=str(col_name)),row=1,col=1)
    # add Histogram plot to the figure
    fig.add_trace(px.Histogram(x=data,nbinsx=15,name=str(col_name)),row=1,col=2)
    # show the figure
    fig.show()
    
# function to visualize the Object type variables
def univariate_plot_object(data,col_name):
    
    '''define a function to plot the data'''
    def plotting_function(plot_data):
        # add bar plot to the figure
        fig.add_trace(px.Bar(x=plot_data.value_counts().index.astype(str),
                         y=plot_data.value_counts().values,
                         text=plot_data.value_counts().values,
                         marker_color=plotly.colors.qualitative.Plotly
                        ),
                  row=1,col=1)  
        # add pie chart to the figure
        fig.add_trace(px.Pie(labels=plot_data.value_counts().index.astype(str), 
                         values=plot_data.value_counts().values,
                         domain={'x': [0.5, 1]}))
        # to show the figure
        fig.show()
        
    '''Initializing the plots'''
    # initialize the the figure with 2 subplots
    fig = make_subplots(cols=2,rows=1,
                        subplot_titles=(f'{col_name} - Count plot',f'{col_name} - Pie Chart')
                       )
    # update the height and width of the figure
    fig.update_layout(autosize=False,width=800,height=350)
    
    # check if the object column has more than 10 unique values
    if len(data.unique())>10:
        # select the top 10 unique values form the column
        data_keys = data.value_counts().head(10).keys()
        # store the data of top 10 column values in temp variable
        new_data = data[data.isin(data_keys)]
        # call the plotting function
        plotting_function(new_data)
        
    # if column doesnt have the more than 10 unique values    
    else:
        # call the plotting function
        plotting_function(data)

In [14]:
# loop over the all the variables in the data
for i in manufacturing_defect_data.columns:
    # check if the unique values are >10 or not confirm whether it is Nominal data or ordinal data
    if len(manufacturing_defect_data[i].unique()) > 10 and manufacturing_defect_data[i].dtype != 'object':
        # numerical plot function for ordinal data
        univariate_plot_numerical(manufacturing_defect_data[i],i)
    else:
        # object plot function for nominal data
        univariate_plot_object(manufacturing_defect_data[i].astype('category'),i)

### 2.2 Bivariate Analysis

In [15]:
import plotly.express as go
# creating function to plot numerical column with object column
def numerical_with_object_box_plot(data,numerical_column,category_column):
    fig = px.Figure()
    fig.update_layout(autosize=False,width=1000,height=500,xaxis_title=f'{category_column}', yaxis_title=f'{numerical_column}') 
    groups = data[category_column].value_counts().keys()[:10]
    colors = iter([
                    '#FFA500', '#800080', '#008000', '#000080',
                    '#A52A2A', '#808080','#FFD700', '#FF6347', 
                    '#808000', '#FF1493'
                  ])
    for i in groups:
        fig.add_trace(px.Box(y=data.loc[data[category_column]==i,numerical_column],name=i,marker_color=next(colors),showlegend=False))
    fig.show()
 
 
    
def object_with_object_countplot(data, category_column1, category_column2):
    fig = px.Figure()
    colors = [
                    '#FFA500', '#800080', '#008000', '#000080',
                    '#A52A2A', '#808080','#FFD700', '#FF6347', 
                    '#808000', '#FF1493'
                  ]
    fig.update_layout(autosize=False, width=1000, height=500,barmode='group',
                      xaxis_title=f'{category_column1}', yaxis_title='count')
    
    category_counts = data.groupby([category_column1, category_column2]).size().reset_index(name='count')

    for target_category in category_counts[category_column1].unique():
        subset = category_counts[category_counts[category_column2] == target_category]
        fig.add_trace(px.Bar(x=subset[category_column1].astype(str), y=subset['count'], name=str(target_category),text=subset['count'], textposition='auto'))

    fig.show()

    
    
            
# creating the function to plot numerical with numerical    
def numerical_with_numerical_scatterplot(data,numerical_column1,numerical_column2):
    fig,ax = plt.subplots(1,1,figsize=(7,4))
    sns.scatterplot(x=data[numerical_column1],y=data[numerical_column2])

In [16]:
target_column = 'DefectStatus'
for i in manufacturing_defect_data.columns.drop(target_column):
    if (manufacturing_defect_data[i].dtype in ['int32','int64','float32','float64']) and len(manufacturing_defect_data[i].unique())>10:
        numerical_with_object_box_plot(manufacturing_defect_data,i,target_column)
    elif manufacturing_defect_data[i].dtype in ['object','category'] or len(manufacturing_defect_data[i].unique())<=10:
        object_with_object_countplot(manufacturing_defect_data,i,target_column)
    else:
        numerical_with_numerical_scatterplot(manufacturing_defect_data,i,target_column)

In [17]:
# Calculate the correlation matrix
corr = manufacturing_defect_data.corr().round(3)
fig = px.Figure()
fig.update_layout(autosize=False, width=800, height=600)
# Create a heatmap using Plotly
fig.add_trace(px.Heatmap(
                   z=corr.values,
                   x=corr.columns,
                   y=corr.columns,
                   text=corr.values,
                   texttemplate="%{text}",
                   colorscale='Viridis'))

fig.show()

### 2.3 Statistical Analysis

In [18]:
from scipy.stats import ttest_ind
def two_sample_ttest(numeric_column,category_column,data):
    groups = data[category_column].unique().tolist()
    group1 = data[data[category_column]==groups[0]][numeric_column]
    group2 = data[data[category_column]==groups[1]][numeric_column]
    t_static_value, p_value = ttest_ind(group1, group2)
    print(f'T-static-Value:{t_static_value:.3f}, P-Value:{p_value:.3f}')
    if p_value > 0.05:
        print(f'Probably, means of the {numeric_column} between 2 groups in {category_column} are equal')
    else:
        print(f'Probably, means of the {numeric_column} between 2 groups in {category_column} are not equal')
        
        
from scipy.stats import pearsonr

def pearsons_correlation(data,column1,column2):
    
    t_static_value,p_value = pearsonr(data[column1],data[column2])
    print(f'T-static-Value:{t_static_value:.3f}, P-Value:{p_value:.3f}')
    if p_value > 0.05:
        print(f'Probably, {column1} and {column2} are Correlated')
    else:
        print(f'Probably, {column1} and {column2} may not have any correlation')

In [19]:
for i in manufacturing_defect_data.columns.drop('DefectStatus'):
    if len(manufacturing_defect_data[i].unique())>10:
        two_sample_ttest(i,'DefectStatus',manufacturing_defect_data)
        print()
    else:
        pearsons_correlation(manufacturing_defect_data,i,'DefectStatus')
        print()

T-static-Value:7.401, P-Value:0.000
Probably, means of the ProductionVolume between 2 groups in DefectStatus are not equal

T-static-Value:1.521, P-Value:0.128
Probably, means of the ProductionCost between 2 groups in DefectStatus are equal

T-static-Value:2.174, P-Value:0.030
Probably, means of the SupplierQuality between 2 groups in DefectStatus are not equal

T-static-Value:0.005, P-Value:0.758
Probably, DeliveryDelay and DefectStatus are Correlated

T-static-Value:14.426, P-Value:0.000
Probably, means of the DefectRate between 2 groups in DefectStatus are not equal

T-static-Value:-11.568, P-Value:0.000
Probably, means of the QualityScore between 2 groups in DefectStatus are not equal

T-static-Value:17.706, P-Value:0.000
Probably, means of the MaintenanceHours between 2 groups in DefectStatus are not equal

T-static-Value:0.235, P-Value:0.814
Probably, means of the DowntimePercentage between 2 groups in DefectStatus are equal

T-static-Value:0.383, P-Value:0.702
Probably, means of

## 3. Feature Engineering

In [20]:
import numpy as np
# defining the function to handle the outliers
def handling_Outliers(data):
    # 25 percentile of the data
    q1 = np.percentile(data,25)
    # 75 percentile of the data
    q3 = np.percentile(data,75)
    # Inter quatile range
    iqr = q3-q1
    # upper boundary 
    upper_boundary = q3+1.5*iqr
    # lower boundary
    lower_boundary = q1-1.5*iqr
    # replacing the values with nan which have greater the uper boudary value
    data[data>upper_boundary] = np.nan
    # replacing the values with nan whihc have less than the lower boundary
    data[data<lower_boundary] = np.nan
    return data

In [21]:

for i in manufacturing_defect_data.columns.drop('DefectStatus'):# fill the null values with median of the data
    manufacturing_defect_data[i] = handling_Outliers(manufacturing_defect_data[i])

In [22]:
X = manufacturing_defect_data.drop('DefectStatus',axis=1)
y = manufacturing_defect_data['DefectStatus']

In [23]:
# apply normalization to data
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
scaled_X = mms.fit_transform(X)

In [24]:
# split the data into train and test datasets
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(scaled_X,y,test_size=0.2,random_state=27)

## 4. Handling Imbalance Data

### 4.1 Over Sampling

In [25]:
from imblearn.over_sampling import RandomOverSampler,SMOTE,SMOTENC,ADASYN,SVMSMOTE

In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score,precision_score,recall_score


def logistic_regression_model(X_train,X_test,y_train,y_test):
    # initialize the Classifier
    logistic_regression_classifier = LogisticRegression()
    # fit the model
    logistic_regression_classifier.fit(X_train,y_train)

    # store the predictions of the model
    predictions_logistic_regression = logistic_regression_classifier.predict(X_test)

    # print the scores of the model
    print(f"Train Accuracy: {logistic_regression_classifier.score(X_train,y_train)*100:.3f}")
    print(f"Test Accuracy: {logistic_regression_classifier.score(X_test,y_test)*100:.3f}")
    print(f"Precision Value: {precision_score(predictions_logistic_regression,y_test)*100:.3f}")
    print(f"Recall Value: {recall_score(predictions_logistic_regression,y_test)*100:.3f}")
    print(f"F1 Score: {f1_score(predictions_logistic_regression,y_test)*100:.3f}")

#### 4.1.1 Random Over Sampling
Object to over-sample the minority class(es) by picking samples at random with replacement. The bootstrap can be generated in a smoothed manner.

In [27]:
# Oversampling using RandomOverSampler
oversample = RandomOverSampler(sampling_strategy='minority')
X_train_over, y_train_over = oversample.fit_resample(X_train, y_train)

logistic_regression_model(X_train_over,X_test,y_train_over,y_test)

Train Accuracy: 75.011
Test Accuracy: 74.537
Precision Value: 72.407
Recall Value: 96.069
F1 Score: 82.577


#### 4.1.2 SMOTE - Synthetic Minority Over-sampling Technique
SMOTE handles this issue by generating samples of minority classes to make the class distribution balanced. SMOTE works by generating synthetic examples in the feature space of the minority class.

In [28]:
over_sample_smote = SMOTE(random_state=47,sampling_strategy='minority')
X_train_smote,y_train_smote = over_sample_smote.fit_resample(X_train,y_train)

# initialize the Classifier
logistic_regression_model(X_train_smote,X_test,y_train_smote,y_test)

Train Accuracy: 78.997
Test Accuracy: 77.469
Precision Value: 76.481
Recall Value: 95.602
F1 Score: 84.979


#### 4.1.3 Synthetic Minority Over-sampling Technique for Nominal and Continuous - SMOTENC

Unlike SMOTE, SMOTE-NC for dataset containing numerical and categorical features. However, it is not designed to work with only categorical features.

In [29]:
over_sample_smotenc = SMOTENC(random_state=47,categorical_features=[3,11])
X_train_smotenc,y_train_smotenc = over_sample_smotenc.fit_resample(X_train,y_train)

# initialize the Classifier
logistic_regression_model(X_train_smotenc,X_test,y_train_smotenc,y_test)

Train Accuracy: 79.317
Test Accuracy: 77.006
Precision Value: 75.741
Recall Value: 95.785
F1 Score: 84.592


#### 4.1.4 Oversample using Adaptive Synthetic (ADASYN)
This method is similar to SMOTE but it generates different number of samples depending on an estimate of the local distribution of the class to be oversampled.

In [30]:
over_sample_adasyn = ADASYN(random_state=47,sampling_strategy='minority')
X_train_adasyn,y_train_adasyn = over_sample_adasyn.fit_resample(X_train,y_train)

# initialize the Classifier
logistic_regression_model(X_train_adasyn,X_test,y_train_adasyn,y_test)

Train Accuracy: 74.167
Test Accuracy: 73.765
Precision Value: 70.926
Recall Value: 96.717
F1 Score: 81.838


#### 4.1.5 Over-sampling using SVM (SVMSMOTE).
Variant of SMOTE algorithm which use an SVM algorithm to detect sample to use for generating new synthetic samples as proposed in.

In [31]:
over_sample_svmsmote = SVMSMOTE(random_state=47,sampling_strategy='minority')

X_train_svmsmote,y_train_svmsmote = over_sample_svmsmote.fit_resample(X_train,y_train)

# initialize the Classifier
logistic_regression_model(X_train_svmsmote,X_test,y_train_svmsmote,y_test)

Train Accuracy: 86.280
Test Accuracy: 81.481
Precision Value: 81.667
Recall Value: 95.455
F1 Score: 88.024


### 4.2 UnderSampling

In [32]:
from imblearn.under_sampling import InstanceHardnessThreshold,NearMiss,NeighbourhoodCleaningRule,OneSidedSelection,TomekLinks
from imblearn.under_sampling import RandomUnderSampler,CondensedNearestNeighbour,EditedNearestNeighbours,RepeatedEditedNearestNeighbours,AllKNN

#### 4.2.1 Random Under Sample
Class to perform random under-sampling. Under-sample the majority class(es) by randomly picking samples with or without replacement.

In [33]:
# Undersampling using RandomUnderSampler
undersample = RandomUnderSampler(sampling_strategy='majority')
X_train_under, y_train_under = undersample.fit_resample(X, y)

logistic_regression_model(X_train_under,X_test,y_train_under,y_test)

Train Accuracy: 76.015
Test Accuracy: 83.642
Precision Value: 100.000
Recall Value: 83.591
F1 Score: 91.062


#### 4.2.2 Condensed Nearest Neighbour
Undersample based on the condensed nearest neighbour method.

In [34]:
under_sample_condensednn = CondensedNearestNeighbour(random_state=42,n_neighbors=5,sampling_strategy='majority')

X_train_undercnn, y_train_undercnn = under_sample_condensednn.fit_resample(X_train, y_train)

logistic_regression_model(X_train_undercnn,X_test,y_train_undercnn,y_test)

Train Accuracy: 75.817
Test Accuracy: 86.420
Precision Value: 90.556
Recall Value: 92.966
F1 Score: 91.745


#### 4.2.3 Edited Nearest Neighbour
Undersample based on the edited nearest neighbour method. This method cleans the dataset by removing samples close to the decision boundary. It removes observations from the majority class or classes when any or most of its closest neighours are from a different class.

In [35]:
under_sample_editednn = EditedNearestNeighbours(n_neighbors=5,sampling_strategy='majority')

X_train_editednn, y_train_editednn = under_sample_editednn.fit_resample(X_train, y_train)

logistic_regression_model(X_train_editednn,X_test,y_train_editednn,y_test)

Train Accuracy: 89.093
Test Accuracy: 83.333
Precision Value: 85.370
Recall Value: 94.082
F1 Score: 89.515


#### 4.2.4 Repeated Edited Nearest Neighbour
Undersample based on the repeated edited nearest neighbour method. This method repeats the EditedNearestNeighbours algorithm several times. The repetitions will stop when:
1. the maximum number of iterations is reached, (or)
2. no more observations are being removed, (or) 
3. one of the majority classes becomes a minority class (or) 
4. one of the majority classes disappears during undersampling.

In [36]:
under_sample_repeatednn = RepeatedEditedNearestNeighbours(n_neighbors=5,sampling_strategy='majority')

X_train_repeatednn, y_train_repeatednn = under_sample_repeatednn.fit_resample(X_train, y_train)

logistic_regression_model(X_train_repeatednn,X_test,y_train_repeatednn,y_test)

Train Accuracy: 88.712
Test Accuracy: 59.568
Precision Value: 52.778
Recall Value: 97.603
F1 Score: 68.510


#### 4.2.5 All KNN
Undersample based on the AllKNN method. This method will apply EditedNearestNeighbours several times varying the number of nearest neighbours at each round. It begins by examining 1 closest neighbour, and it incrases the neighbourhood by 1 at each round. The algorithm stops when the maximum number of neighbours are examined or when the majority class becomes the minority class, whichever comes first.

In [37]:
under_sample_allknn = AllKNN(n_neighbors=5,sampling_strategy='majority')

X_train_allknn, y_train_allknn = under_sample_allknn.fit_resample(X_train, y_train)

logistic_regression_model(X_train_allknn,X_test,y_train_allknn,y_test)

Train Accuracy: 89.149
Test Accuracy: 75.463
Precision Value: 73.333
Recall Value: 96.350
F1 Score: 83.281


#### 4.2.6 Instance Hardness Threshold
Undersample based on the instance hardness threshold.

In [38]:
under_sample_Iht = InstanceHardnessThreshold(random_state=42,sampling_strategy='majority')

X_train_Iht, y_train_Iht = under_sample_Iht.fit_resample(X_train, y_train)

logistic_regression_model(X_train_Iht,X_test,y_train_Iht,y_test)

Train Accuracy: 84.754
Test Accuracy: 75.463
Precision Value: 73.333
Recall Value: 96.350
F1 Score: 83.281


#### 4.2.7 Near Miss
Class to perform under-sampling based on NearMiss methods.

In [39]:
under_sample_nearmiss = NearMiss(n_neighbors=5,sampling_strategy='majority')

X_train_nearmiss, y_train_nearmiss = under_sample_nearmiss.fit_resample(X_train, y_train)

logistic_regression_model(X_train_nearmiss,X_test,y_train_nearmiss,y_test)

Train Accuracy: 69.804
Test Accuracy: 78.704
Precision Value: 79.074
Recall Value: 94.469
F1 Score: 86.089


#### 4.2.8 Neighbourhood Cleaning Rule
Undersample based on the neighbourhood cleaning rule. This class uses ENN and a k-NN to remove noisy samples from the datasets.

In [40]:
under_sample_Ncleaningrule = NeighbourhoodCleaningRule(n_neighbors=5,sampling_strategy='majority')

X_train_Ncrule, y_train_Ncrule = under_sample_Ncleaningrule.fit_resample(X_train, y_train)

logistic_regression_model(X_train_Ncrule,X_test,y_train_Ncrule,y_test)

Train Accuracy: 86.991
Test Accuracy: 87.037
Precision Value: 93.889
Recall Value: 90.860
F1 Score: 92.350


#### 4.2.9 One Sided Selection
Class to perform under-sampling based on one-sided selection method.

In [41]:
under_sample_onesiderule = OneSidedSelection(n_neighbors=5,sampling_strategy='majority')

X_train_onesiderule, y_train_onesiderule = under_sample_onesiderule.fit_resample(X_train, y_train)

logistic_regression_model(X_train_onesiderule,X_test,y_train_onesiderule,y_test)

Train Accuracy: 87.600
Test Accuracy: 87.500
Precision Value: 97.778
Recall Value: 88.442
F1 Score: 92.876


#### 4.2.10 Tomek Links
Under-sampling by removing Tomek’s links.

In [42]:
under_sample_tomelinks = TomekLinks(sampling_strategy='majority')

X_train_tomelinks, y_train_tomelinks = under_sample_tomelinks.fit_resample(X_train, y_train)

logistic_regression_model(X_train_tomelinks,X_test,y_train_tomelinks,y_test)

Train Accuracy: 87.600
Test Accuracy: 87.500
Precision Value: 97.778
Recall Value: 88.442
F1 Score: 92.876


### 4.3 Ensemble Methods

In [43]:
from imblearn.ensemble import EasyEnsembleClassifier,BalancedBaggingClassifier,RUSBoostClassifier,BalancedRandomForestClassifier

#### 4.3.1 Easy Ensemble Classifier
Bag of balanced boosted learners also known as EasyEnsemble. This algorithm is known as EasyEnsemble. The classifier is an ensemble of AdaBoost learners trained on different balanced bootstrap samples. The balancing is achieved by random under-sampling.

In [44]:
# Create a easy ensemble classifier
easy_ensemble_classifier = EasyEnsembleClassifier(sampling_strategy='not majority',  # You can adjust this parameter
                                                        replacement=False,  # Whether to sample with or without replacement
                                                        random_state=42)

   
# Fit the model
easy_ensemble_classifier.fit(X_train, y_train)

# Make predictions
easy_ensemble_classifier_predictions = easy_ensemble_classifier.predict(X_test)
# print the scores of the model
print(f"Train Accuracy : {easy_ensemble_classifier.score(X_train,y_train)}")
print(f"Test Accuracy: {easy_ensemble_classifier.score(X_test,y_test)}")
print(f"Precision Score: {precision_score(easy_ensemble_classifier_predictions,y_test)}")
print(f"Recall Score: {recall_score(easy_ensemble_classifier_predictions,y_test)}")
print(f"F1 Score: {f1_score(easy_ensemble_classifier_predictions,y_test)}")

Train Accuracy : 0.9239969135802469
Test Accuracy: 0.9197530864197531
Precision Score: 0.9925925925925926
Recall Score: 0.9178082191780822
F1 Score: 0.9537366548042705


#### 4.3.2 RUS Boost Classifier
Random under-sampling integrated in the learning of AdaBoost. During learning, the problem of class balancing is alleviated by random under-sampling the sample at each iteration of the boosting algorithm.

In [45]:
# Create a Rus Boost classifier
rus_boost_classifier = RUSBoostClassifier(sampling_strategy='not majority',  # You can adjust this parameter
                                                        replacement=False,  # Whether to sample with or without replacement
                                                        random_state=42)

   
# Fit the model
rus_boost_classifier.fit(X_train, y_train)

# Make predictions
rus_boost_classifier_predictions = rus_boost_classifier.predict(X_test)
# print the scores of the model
print(f"Train Accuracy : {rus_boost_classifier.score(X_train,y_train)}")
print(f"Test Accuracy: {rus_boost_classifier.score(X_test,y_test)}")
print(f"Precision Score: {precision_score(rus_boost_classifier_predictions,y_test)}")
print(f"Recall Score: {recall_score(rus_boost_classifier_predictions,y_test)}")
print(f"F1 Score: {f1_score(rus_boost_classifier_predictions,y_test)}")

Train Accuracy : 0.9270833333333334
Test Accuracy: 0.9259259259259259
Precision Score: 0.9796296296296296
Recall Score: 0.9346289752650176
F1 Score: 0.9566003616636528


#### 4.3.3 Balanced Bagging Classifier
A Bagging classifier with additional balancing. This implementation of Bagging is similar to the scikit-learn implementation. It includes an additional step to balance the training set at fit time using a given sampler. This classifier can serves as a basis to implement various methods such as Exactly Balanced Bagging, Roughly Balanced Bagging, Over-Bagging, or SMOTE-Bagging.

In [46]:
# Create a Balanced Bagging Classifier
balanced_bagging_classifier = BalancedBaggingClassifier(LogisticRegression(),
                                                        sampling_strategy='not majority',  # You can adjust this parameter
                                                        replacement=False,  # Whether to sample with or without replacement
                                                        random_state=42)

   
# Fit the model
balanced_bagging_classifier.fit(X_train, y_train)

# Make predictions
balanced_bagging_classifier_predictions = balanced_bagging_classifier.predict(X_test)
# print the scores of the model
print(f"Train Accuracy : {balanced_bagging_classifier.score(X_train,y_train)}")
print(f"Test Accuracy: {balanced_bagging_classifier.score(X_test,y_test)}")
print(f"Precision Score: {precision_score(balanced_bagging_classifier_predictions,y_test)}")
print(f"Recall Score: {recall_score(balanced_bagging_classifier_predictions,y_test)}")
print(f"F1 Score: {f1_score(balanced_bagging_classifier_predictions,y_test)}")

Train Accuracy : 0.8777006172839507
Test Accuracy: 0.8734567901234568
Precision Score: 0.9796296296296296
Recall Score: 0.8816666666666667
F1 Score: 0.9280701754385965


#### 4.3.4 Balanced Random Forest Classifier
A balanced random forest classifier. A balanced random forest differs from a classical random forest by the fact that it will draw a bootstrap sample from the minority class and sample with replacement the same number of samples from the majority class.

In [47]:
# Create a Balanced Random forest Classifier
balanced_randomforest_classifier = BalancedRandomForestClassifier(
                                                        sampling_strategy='not majority',  # You can adjust this parameter
                                                        replacement=False,  # Whether to sample with or without replacement
                                                        random_state=42)

   
# Fit the model
balanced_randomforest_classifier.fit(X_train, y_train)

# Make predictions
balanced_randomforest_classifier_predictions = balanced_randomforest_classifier.predict(X_test)
# print the scores of the model
print(f"Train Accuracy : {balanced_randomforest_classifier.score(X_train,y_train)}")
print(f"Test Accuracy: {balanced_randomforest_classifier.score(X_test,y_test)}")
print(f"Precision Score: {precision_score(balanced_randomforest_classifier_predictions,y_test)}")
print(f"Recall Score: {recall_score(balanced_randomforest_classifier_predictions,y_test)}")
print(f"F1 Score: {f1_score(balanced_randomforest_classifier_predictions,y_test)}")

Train Accuracy : 1.0
Test Accuracy: 0.9675925925925926
Precision Score: 0.9925925925925926
Recall Score: 0.969258589511754
F1 Score: 0.9807868252516011
