## SECOM Data Set Information

A complex modern semi-conductor manufacturing process is normally under consistent surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. It is often the case  that useful information is buried in the latter two. Engineers typically have a much larger number of signals than are actually required. If we consider each type  of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs.

Numerical data are recorded values from a series of sensors in the production machines that are placed in specified locations to help identify the part of the production process which contributes to the faults.


# Objective
To minimize the rate at which faulty products leave the factory, the numerical data starts to make sense.

*   To enhance current business improvement techniques, we use feature selection techniques to rank features according to their impact on the overall yield for the product.

    *   Causal relationships may also be considered with a view to identifying the key features.

Dimensionality reduction techniques:

- Percent Missing Values
- Ammount of Variation
- Pairwise Correlation
- Correlation with Target
- Recursive feature elimination

<h2 id="importing_libraries">Install and import libraries</h2>


In [None]:
%pip install matplotlib --upgrade
%pip install fancyimpute
%pip install boruta
%pip install imblearn
%pip install xgboost
%pip install "numpy<1.24.0"
%pip install missingno

In [None]:
%pip install sklearn --upgrade

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import missingno as msno

from scipy.spatial.distance import cdist
from sklearn.preprocessing import LabelEncoder


from sklearn.model_selection import train_test_split,cross_val_score

# Import Scaler (normalizer)
from sklearn.preprocessing import MinMaxScaler

# Import Missing value imputers
from sklearn.impute import KNNImputer
from fancyimpute import IterativeImputer

# Import Feature selection methods
from boruta import BorutaPy
from sklearn.feature_selection import RFE, SelectFromModel
from sklearn.linear_model import Lasso

# Import balancing methods
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

# Import models
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

# Import Pipeline
from imblearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

# Grid Search
from sklearn.model_selection import GridSearchCV

# Import model performance metrics
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report, roc_curve, fbeta_score

import warnings
import os

warnings.filterwarnings('ignore')
%matplotlib inline

# Data Understanding / Descriptive Analysis
1. Histogram of percentage of missing values of features
2. Histogram of volatilites of features
3. Frequency distribution of target values
4. Correlation heatmap

# Manufacturing Operation Data (a.k.a Feature Data/Sensor Data)

In [None]:
# Read Manufacturing Operation Data (Feature Data/Sensor Data)
sensor_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data",sep=" ", header=None)
sensor_data

In [None]:
# Data types in Feature Data
type_dct_features = {str(k): len(list(v)) for k, v in sensor_data.groupby(sensor_data.dtypes, axis=1)}
type_dct_features

In [None]:
# Add prefix "feature" to each column
sensor_data = sensor_data.add_prefix("feature")

# Descriptive analysis of whole data

In [None]:
# create dataframe for descriptive analysis
descriptive_sensor = sensor_data.describe().transpose()

# add column for number of unique values of each column
descriptive_sensor["unique"] = sensor_data.nunique()

# add column for percentage of missing values of each column
descriptive_sensor["missing_percentage"] = sensor_data.isnull().sum() * 100 / len(sensor_data)

# Define outliers based on Z-score
def outliers_z_score(df,n):
    outliers_list = []
    threshold = n

    for i in df.columns:
        ys = df[i]
        try:
            mean_y = np.mean(ys)
            stdev_y = np.std(ys)
            z_scores = [(y - mean_y) / stdev_y for y in ys]
            idx_outliers = np.where(np.abs(z_scores) > threshold)
            outliers_list.append(len(idx_outliers[0]))
        except:
            outliers_list.append(np.NAN)
    return outliers_list

# add column for number of outliers of each column
outlierls3s = outliers_z_score(sensor_data,3)
descriptive_sensor["outliers(3s)"] = outlierls3s

outlierls4s = outliers_z_score(sensor_data,4)
descriptive_sensor["outliers(4s)"] = outlierls4s

# add column for variance of each column
descriptive_sensor["coeff_var"] = descriptive_sensor["std"]/np.absolute(descriptive_sensor["mean"])

In [None]:
descriptive_sensor

# Semiconductor Quality Data (a.k.a Target Data)

In [None]:
# Read semiconductor quality data (target)
target_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom_labels.data",sep=" ",header=None)
target_data

In [None]:
# Data types in Label Data
type_dct = {str(k): len(list(v)) for k, v in target_data.groupby(target_data.dtypes, axis=1)}
type_dct

In [None]:
# Change column names
target_data.columns = ["Label","Time"]

# Convert type of columns
target_data["Label"] = target_data["Label"].astype("category")

# Convert format of Time Column as datetime
target_data["Time"] = pd.to_datetime(target_data["Time"])

## Distribution of Target Labels

In [None]:
# Set size of chart
plt.figure(figsize = (10,10))

# Labels for data
keys = ['Pass','Fail']

# Plotting data on Pie chart
Piechart_Labels = plt.pie(target_data.Label.value_counts(), labels=keys, autopct='%.2f%%', textprops={'fontsize': 20})

# Add title to the chart
plt.title('Distribution of Target Labels (Whole Data)',fontdict={'size':24})

## Timeseries of Target Label Frequencies (Pass/Fail)

In [None]:
import datetime as dt

# Create a Date column from Time (timestamp) Column of Label Data
target_data["Date"] = target_data["Time"].dt.date

# check first and last dates of Label Data
print("first date = {}".format(target_data["Date"].min()))
print("last date = {}".format(target_data["Date"].max()))

# Create a Dataframe by Grouping Labels by Date and calculating the frequencies (count) of Label Values
timeseries_label_count = pd.DataFrame(target_data.groupby("Date")["Label"].value_counts())

# Rename Calculated column as "Count"
timeseries_label_count = timeseries_label_count.rename(columns={'Label': 'Count'})

# Reset index of grouped Dataframe
timeseries_label_count.reset_index(inplace=True)

# Rename Label values as Pass and Fail
timeseries_label_count["Label"] = timeseries_label_count["level_1"].replace(to_replace=[-1, 1], value=["Pass","Fail"])


In [None]:
# Set size of chart
plt.figure(figsize = (13,8))

# create the scatter plot
timeseries_label_scatterplot = sns.scatterplot(data=timeseries_label_count, x="Date", y="Count", size = "Count", hue="Label", sizes=(20, 200))

timeseries_label_scatterplot.set_title('Target Frequencies by Time', fontdict={'size':24})
timeseries_label_scatterplot.set_xlabel('Date',fontdict={'size':15})
timeseries_label_scatterplot.set_ylabel('Frequency', fontdict={'size':15})

# Create train and test dataset

1. First we merge the data
2. Then we drop the Date and Time columns since we don't need them anymore.
3. According to dataset's description, target values are highly imbalanced, so we split it in a stratified fashion.

In [None]:
# Merge sensor and label data
merged_df = pd.concat([target_data,sensor_data],axis=1)
merged_df.drop(["Date","Time"], axis=1, inplace=True)

# Convert labels into text categories
merged_df["Label"] = merged_df["Label"].replace({-1:"PASS", 1:"FAIL"})

# Create training and test datasets
X = merged_df.drop(["Label"],axis=1)
Y = merged_df["Label"]


# Split data into train and test by 80%-20% in a stratified fashion
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42, stratify=Y)

In [None]:
merged_df

# Descriptive Statistics of Target Train/Test Data

In [None]:
# Set size of chart
plt.figure(figsize = (10,10))

# Labels for data
keys = ['Pass','Fail']

# Plotting data on Pie chart
Piechart_Labels_train = plt.pie(Y_train.value_counts(), labels=keys, autopct='%.2f%%', textprops={'fontsize': 20})

# Add title to the chart
plt.title('Distribution of Target Labels (Train Set)',fontdict={'size':24})

In [None]:
# Set size of chart
plt.figure(figsize = (10,10))

# Labels for data
keys = ['Pass','Fail']

# Plotting data on Pie chart
Piechart_Labels_train = plt.pie(Y_test.value_counts(), labels=keys, autopct='%.2f%%', textprops={'fontsize': 20})

# Add title to the chart
plt.title('Distribution of Target Labels (Test Set)',fontdict={'size':24})

# Descriptive Statistics of Feature Train Set

In [None]:
print("shape of feature train set :{} and shape of feature test set: {}".format(X_train.shape, X_test.shape))
print("shape of target train set :{} and shape of target test set: {}".format(Y_train.shape, Y_test.shape))

### > <font color='green'>Descriptive Analysis of X_train</font>

In [None]:
# create dataframe for descriptive analysis
descriptive_train = X_train.describe().transpose()

# add column for number of unique values of each column
descriptive_train["unique"] = X_train.nunique()

# add column for percentage of missing values of each column
descriptive_train["missing_percentage"] = X_train.isnull().sum() * 100 / len(X_train)

# add column for number of outliers of each column
outlierls3s_ = outliers_z_score(X_train,3)
descriptive_train["outliers(3s)"] = outlierls3s_

outlierls4s_ = outliers_z_score(X_train,4)
descriptive_train["outliers(4s)"] = outlierls4s_

# add column for coefficient of variance of each column
descriptive_train["coeff_var"] = descriptive_train["std"]/np.absolute(descriptive_train["mean"])
descriptive_train["coeff_var"] = descriptive_train["coeff_var"].fillna(0)

In [None]:
descriptive_train[descriptive_train["coeff_var"]<=0.25]

In [None]:
descriptive_train[descriptive_train["unique"]==1]

## 1. Histogram of Missing Values of Feature Train Set (Percentage)

In [None]:
plt.figure(figsize = (15,8))

missingval_chart = sns.histplot(descriptive_train, x="missing_percentage", binwidth=5, stat='count',legend=True)
missingval_chart.set_title('Percentage of Missing Values of Feature Train Set (y-axis capped at 25)', fontdict={'size':24})
missingval_chart.set_xlabel('Missing Values (%)',fontdict={'size':15})
missingval_chart.set_ylabel('Frequency', fontdict={'size':15})
missingval_chart.set_xticks(range(0,100,5))

for c in missingval_chart.containers:

    # customize the label to account for cases when there might not be a bar section
    labels = [f'{h:0.0f}' if (h := v.get_height()) != 0 else '' for v in c ]

    # set the bar label
    missingval_chart.bar_label(c, labels=labels, fontsize=12, padding=3,label_type='center')

plt.ylim(0, 25)
plt.xlim(0,100)

In [None]:
# Number of Features having 50% or more missing values
missing_50 = descriptive_train[descriptive_train["missing_percentage"]>=50]
missing_50_cols = missing_50.index
len(missing_50)

In [None]:
missing_50_cols

## 2. Histogram of Volatilites of Feature Train Set

In [None]:
plt.figure(figsize = (15,8))
volatilities_chart1 = sns.histplot(descriptive_train, x="coeff_var", kde=True, binwidth=0.25)
volatilities_chart1.set_title('Volatilites of Feature Train Set (x-axis capped at 10)', fontdict={'size':24})
volatilities_chart1.set_xlabel('Coefficient of Variance',fontdict={'size':15})
volatilities_chart1.set_ylabel('Frequency', fontdict={'size':15})
volatilities_chart1.set_xticks(range(0,10,1))

for c in volatilities_chart1.containers:

    # customize the label to account for cases when there might not be a bar section
    labels = [f'{h:0.0f}' if (h := v.get_height()) != 0 else '' for v in c ]

    # set the bar label
    volatilities_chart1.bar_label(c, labels=labels, fontsize=12, padding=3)

plt.xlim(0,10)

In [None]:
# Select features having 0.25 or less coefficient of variance
coeff_variance_lessthan25percent = descriptive_train[descriptive_train["coeff_var"]<=0.25]
coeff_variance_lessthan25percent

In [None]:
# Plot histogram of features having 0.01 or less coefficient of variance
plt.figure(figsize = (15,8))
volatilities_chart2 = sns.histplot(coeff_variance_lessthan25percent, x="coeff_var", kde=True, binwidth=0.01)
volatilities_chart2.set_title('Volatilities of Feature Train Set', fontdict={'size':24})
volatilities_chart2.set_xlabel('Coefficient of Variance',fontdict={'size':15})
volatilities_chart2.set_ylabel('Frequency', fontdict={'size':15})

for c in volatilities_chart2.containers:

    # customize the label to account for cases when there might not be a bar section
    labels = [f'{h:0.0f}' if (h := v.get_height()) != 0 else '' for v in c ]

    # set the bar label
    volatilities_chart2.bar_label(c, labels=labels, fontsize=12, padding=3)

plt.ylim(0, 25)
plt.xlim(0,0.25)

## 3. Constant Features

In [None]:
# Select features having zero coefficient of variance
constant_columns = descriptive_train[descriptive_train["coeff_var"]==0].index
constant_columns

In [None]:
# Select features having zero coefficient of variance
constant_columns = descriptive_train[descriptive_train["std"]==0].index
len(constant_columns)

In [None]:
# Select features having zero coefficient of variance
constant_columns = descriptive_train[descriptive_train["unique"]==1].index
len(constant_columns)

## 4. Duplicated Columns

In [None]:
# Create Dataframe for duplicated columns in feature dataset(True/False)
duplicated_df = pd.DataFrame(X_train.transpose().duplicated())

# Change column name
duplicated_df.columns = ["duplicated"]

# Get only True values for duplicated columns
duplicated_columns = duplicated_df[duplicated_df["duplicated"]==True].index

print("Number of duplicated columns = {}".format(len(duplicated_columns)))

In [None]:
X_train[duplicated_columns]

In [None]:
# Check how much of duplicated features are contained in constant features
duplicated_columns.isin(constant_columns).sum()

## 5. Histogram of Number of Outliers

In [None]:
plt.figure(figsize = (15,8))
outliers_chart3s = sns.histplot(descriptive_train, x="outliers(3s)", binwidth=5)
outliers_chart3s.set_title('Histogram of Outliers of Feature Train Set (3s Rule)', fontdict={'size':24})
outliers_chart3s.set_xlabel('Number of Outliers',fontdict={'size':15})
outliers_chart3s.set_ylabel('Frequency', fontdict={'size':15})
for c in outliers_chart3s.containers:

    # customize the label to account for cases when there might not be a bar section
    labels = [f'{h:0.0f}' if (h := v.get_height()) != 0 else '' for v in c ]

    # set the bar label
    outliers_chart3s.bar_label(c, labels=labels, fontsize=12, padding=3,label_type='center')
plt.xlim(0)


In [None]:
plt.figure(figsize = (15,8))
outliers_chart4s = sns.histplot(descriptive_train, x="outliers(4s)", binwidth=5)
outliers_chart4s.set_title('Outliers of Feature Train Set Before Treatment (4s Rule)', fontdict={'size':24})
outliers_chart4s.set_xlabel('Number of Outliers',fontdict={'size':15})
outliers_chart4s.set_ylabel('Frequency', fontdict={'size':15})

for c in outliers_chart4s.containers:

    # customize the label to account for cases when there might not be a bar section
    labels = [f'{h:0.0f}' if (h := v.get_height()) != 0 else '' for v in c ]

    # set the bar label
    outliers_chart4s.bar_label(c, labels=labels, fontsize=12, padding=3,label_type='center')

plt.ylim(0,110)

plt.xlim(0)


## 5. Correlation Heatmap of Features

## Drop constant features

In [None]:
# Drop constant features by excluding them from train set
constant_columns_list = list(constant_columns)
constants_dropped = X_train.drop(constant_columns_list,axis=1)
print("{} columns were dropped".format(X_train.shape[1] - constants_dropped.shape[1]))

In [None]:
corr = constants_dropped.corr()
# Correlation Heatmap
plt.figure(figsize = (20,8))
correlation_heatmap_constants_dropped = sns.heatmap(corr)
correlation_heatmap_constants_dropped.set_title('Correlation Heatmap of Features', fontdict={'size':24})


# 6. Feature Removal

In [None]:
# Remove the features with more than 50% of Missing Values
missing_perc = pd.DataFrame(constants_dropped.isnull().sum()/len(constants_dropped)*100)
missing_perc.columns = ["percentage"]
missing_50_col_list = missing_perc[missing_perc["percentage"]>50].index

missing_50_col_list = list(missing_50_cols)
X_train = constants_dropped.drop(missing_50_col_list,axis=1)


In [None]:
X_train

In [None]:
len(missing_50_col_list)

In [None]:
# Function to control number of remaining features in every step
def remainingFeatures(df):
    name = [name for name, value in globals().items() if value is df][0]
    # To visualize dataframe name in printing
    print("Remaining Features of "+str(name)+": "+str(df.shape[1]))

# Function to control number of remaining nan´s in every step
def naCounter(df):
    # Count total NaN´s in a Dataframe
    na_count = df.isna().sum().sum()
    # To visualize dataframe name in printing
    name = [name for name, value in globals().items() if value is df][0]

    # Print total Nan´s
    print("Total NaN of "+str(name)+": "+str(na_count))

    return na_count


In [None]:
remainingFeatures(X_train)
naCounter(X_train)


### > <font color='green'>Descriptive Analysis of X_train after feature removal </font>

In [None]:
# create dataframe for descriptive analysis
descriptive_train_after_removal = X_train.describe().transpose()

# add column for number of unique values of each column
descriptive_train_after_removal["unique"] = X_train.nunique()

# add column for percentage of missing values of each column
descriptive_train_after_removal["missing_percentage"] = X_train.isnull().sum() * 100 / len(X_train)

# add column for number of outliers of each column
outlierls3s_ = outliers_z_score(X_train,3)
descriptive_train_after_removal["outliers(3s)"] = outlierls3s_

outlierls4s_ = outliers_z_score(X_train,4)
descriptive_train_after_removal["outliers(4s)"] = outlierls4s_

# add column for coefficient of variance of each column
descriptive_train_after_removal["coeff_var"] = descriptive_train_after_removal["std"]/np.absolute(descriptive_train_after_removal["mean"])
descriptive_train_after_removal["coeff_var"] = descriptive_train_after_removal["coeff_var"].fillna(0)

In [None]:
plt.figure(figsize = (15,8))
volatilities_chart1 = sns.histplot(descriptive_train, x="coeff_var", kde=True, binwidth=0.25)
volatilities_chart1.set_title('Volatilites of Feature Train Set Before Dimensionality Reduction (x-axis capped at 10)', fontdict={'size':24})
volatilities_chart1.set_xlabel('Coefficient of Variance',fontdict={'size':15})
volatilities_chart1.set_ylabel('Frequency', fontdict={'size':15})
volatilities_chart1.set_xticks(range(0,10,1))

for c in volatilities_chart1.containers:

    # customize the label to account for cases when there might not be a bar section
    labels = [f'{h:0.0f}' if (h := v.get_height()) != 0 else '' for v in c ]

    # set the bar label
    volatilities_chart1.bar_label(c, labels=labels, fontsize=12, padding=3)

plt.xlim(0,10)

In [None]:
plt.figure(figsize = (15,8))
volatilities_chart1 = sns.histplot(descriptive_train_after_removal, x="coeff_var", kde=True, binwidth=0.25)
volatilities_chart1.set_title('Volatilites of Feature Train Set After Dimensionality Reduction (x-axis capped at 10)', fontdict={'size':24})
volatilities_chart1.set_xlabel('Coefficient of Variance',fontdict={'size':15})
volatilities_chart1.set_ylabel('Frequency', fontdict={'size':15})
volatilities_chart1.set_xticks(range(0,10,1))

for c in volatilities_chart1.containers:

    # customize the label to account for cases when there might not be a bar section
    labels = [f'{h:0.0f}' if (h := v.get_height()) != 0 else '' for v in c ]

    # set the bar label
    volatilities_chart1.bar_label(c, labels=labels, fontsize=12, padding=3)

plt.xlim(0,10)

In [None]:
descriptive_train[descriptive_train["coeff_var"]<=0.25]

In [None]:
# Select features having 0.25 or less coefficient of variance
coeff_variance_lessthan25percent_after_removal = descriptive_train_after_removal[descriptive_train_after_removal["coeff_var"]<=0.25]
coeff_variance_lessthan25percent_after_removal

In [None]:
# Plot histogram of features having 0.01 or less coefficient of variance
plt.figure(figsize = (15,8))
volatilities_chart2 = sns.histplot(coeff_variance_lessthan25percent_after_removal, x="coeff_var", kde=True, binwidth=0.01)
volatilities_chart2.set_title('Volatilities of Feature Train Set After Dimensionality Reduction', fontdict={'size':24})
volatilities_chart2.set_xlabel('Coefficient of Variance',fontdict={'size':15})
volatilities_chart2.set_ylabel('Frequency', fontdict={'size':15})

for c in volatilities_chart2.containers:

    # customize the label to account for cases when there might not be a bar section
    labels = [f'{h:0.0f}' if (h := v.get_height()) != 0 else '' for v in c ]

    # set the bar label
    volatilities_chart2.bar_label(c, labels=labels, fontsize=12, padding=3)

plt.xlim(0,0.25)

In [None]:
# histogram of some of the features
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
    %matplotlib inline
    plt.rcParams.update({'figure.figsize':(8,8), 'figure.dpi':100})


    nunique = df.nunique()
    df = df[nunique[(nunique>1)&(nunique<50)].index] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = int((nCol + nGraphPerRow - 1) / nGraphPerRow)
    %matplotlib inline
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')


    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()

# 7. Outlier Treatment

In [None]:
# We define a function to count number of outliers
def outlierCounter(df,number_of_std):
    lower_limit = df.mean() - number_of_std * df.std()
    upper_limit = df.mean() + number_of_std * df.std()

    # Identify outliers using the limits defined by std number
    outliers = (df < lower_limit) | (df > upper_limit)

    # Here we count the total of "Trues" and "Falses"
    true_count = outliers.values.sum()
    false_count = np.logical_not(outliers.values).sum()

    # To visualize dataframe name in printing
    name = [name for name, value in globals().items() if value is df][0]

    # Imprimir los resultados
    print("Total Outliers Data Points of "+str(name)+": "+str(true_count))
    print("Total Data Points of "+str(name)+": "+str(false_count+true_count))
    print("Total Data Points of "+str(name)+": " +str(true_count/(false_count+true_count)))

    return outliers


In [None]:
outlierCounter(X_train,4)
naCounter(X_train)

In [None]:
def outliers_treatment(df,number_of_std):
    # Calculate inferior and superior limit following the n*s rules selected
    inf_limit = df.mean() - number_of_std * df.std()
    sup_limit = df.mean() + number_of_std * df.std()

    df = df.copy()

    for col in df.columns:
        #df[col] = df[col].fillna(df[col].median())
        # Identifying outliers using n*s rule selected
        outliers = (df[col] < inf_limit[col]) | (df[col] > sup_limit[col])
        # Imputation of outliers using the superior limit
        #df[col] = np.where(outliers, sup_limit[col], df[col])
        df.loc[outliers, col] = np.nan
        # This is an alternative if we want to use the median as replacement
        #df[col] = np.where(outliers, df[col].median(), df[col])

    return df

### > <font color='green'>outliers treated X_train</font>

In [None]:
X_train_4s = outliers_treatment(X_train,4)

In [None]:
outliers_before_treatment = outlierCounter(X_train,4)
outliers_after_treatment = outlierCounter(X_train_4s,4)

In [None]:
na_before_treatment = naCounter(X_train)
na_after_treatment = naCounter(X_train_4s)


In [None]:
na_percentage_before_treatment = na_before_treatment/(X_train.shape[0]*X_train.shape[1])*100
na_percentage_before_treatment

In [None]:
na_percentage_after_treatment = na_after_treatment/(X_train_4s.shape[0]*X_train_4s.shape[1])*100
na_percentage_after_treatment

### > <font color='green'>Descriptive Analysis of imputed X_train</font>

In [None]:
# create dataframe for descriptive analysis
descriptive_X_train_4s = X_train_4s.describe().transpose()

# add column for percentage of missing values of each column
descriptive_X_train_4s["missing_percentage"] = X_train_4s.isnull().sum() * 100 / len(X_train_4s)

# add column for number of unique values of each column
descriptive_X_train_4s["unique"] = X_train_4s.nunique()

# add column for outliers
outlierls_train_4s = outliers_z_score(X_train_4s,4)
descriptive_X_train_4s["outliers(4s)"] = outlierls_train_4s

# add column for variance of each column
descriptive_X_train_4s["coeff_var"] = descriptive_X_train_4s["std"]/np.absolute(descriptive_X_train_4s["mean"])

In [None]:
plt.figure(figsize = (15,8))
outliers_chart4s = sns.histplot(descriptive_train_after_removal, x="outliers(4s)", binwidth=5)
outliers_chart4s.set_title('Outliers of Feature Train Set Before Treatment (4s Rule)', fontdict={'size':24})
outliers_chart4s.set_xlabel('Number of Outliers',fontdict={'size':15})
outliers_chart4s.set_ylabel('Frequency', fontdict={'size':15})

for c in outliers_chart4s.containers:

    # customize the label to account for cases when there might not be a bar section
    labels = [f'{h:0.0f}' if (h := v.get_height()) != 0 else '' for v in c ]

    # set the bar label
    outliers_chart4s.bar_label(c, labels=labels, fontsize=12, padding=3)

plt.xlim(0)


In [None]:
plt.figure(figsize = (15,8))
outliers_chart_X_train_4s = sns.histplot(descriptive_X_train_4s, x="outliers(4s)", binwidth=5)
outliers_chart_X_train_4s.set_title('Outliers of Feature Train Set After Treatment (4s Rule)', fontdict={'size':24})
outliers_chart_X_train_4s.set_xlabel('Number of Outliers',fontdict={'size':15})
outliers_chart_X_train_4s.set_ylabel('Frequency', fontdict={'size':15})

for c in outliers_chart_X_train_4s.containers:

    # customize the label to account for cases when there might not be a bar section
    labels = [f'{h:0.0f}' if (h := v.get_height()) != 0 else '' for v in c ]

    # set the bar label
    outliers_chart_X_train_4s.bar_label(c, labels=labels, fontsize=12, padding=3)


plt.xlim(0)


In [None]:
plt.figure(figsize = (15,8))

missingval_chart = sns.histplot(descriptive_train_after_removal, x="missing_percentage", binwidth=5, stat='count',legend=True)
missingval_chart.set_title('Percentage of Missing Values of Features Train Set Before Outlier Treatment (y-axis capped at 25)', fontdict={'size':24})
missingval_chart.set_xlabel('Missing Values (%)',fontdict={'size':15})
missingval_chart.set_ylabel('Frequency', fontdict={'size':15})
missingval_chart.set_xticks(range(0,100,5))

for c in missingval_chart.containers:

    # customize the label to account for cases when there might not be a bar section
    labels = [f'{h:0.0f}' if (h := v.get_height()) != 0 else '' for v in c ]

    # set the bar label
    missingval_chart.bar_label(c, labels=labels, fontsize=12, padding=3)

plt.xlim(0,100)

In [None]:
plt.figure(figsize = (15,8))

missingval_chart_X_train_4s = sns.histplot(descriptive_X_train_4s, x="missing_percentage", binwidth=5, stat='count',legend=True)
missingval_chart_X_train_4s.set_title('Percentage of Missing Values of Features Train Set After Outlier Treatment (y-axis capped at 25)', fontdict={'size':24})
missingval_chart_X_train_4s.set_xlabel('Missing Values (%)',fontdict={'size':15})
missingval_chart_X_train_4s.set_ylabel('Frequency', fontdict={'size':15})
missingval_chart_X_train_4s.set_xticks(range(0,100,5))

for c in missingval_chart_X_train_4s.containers:

    # customize the label to account for cases when there might not be a bar section
    labels = [f'{h:0.0f}' if (h := v.get_height()) != 0 else '' for v in c ]

    # set the bar label
    missingval_chart_X_train_4s.bar_label(c, labels=labels, fontsize=12, padding=3)

plt.xlim(0,100)

In [None]:
X_train_4s

# 8. Missing Values Treatment

## X_train_4s normalization

In [None]:
def normalizer(df, scaler):
    scaler = scaler
    scaler.fit(df)
    scaled_df = pd.DataFrame(scaler.transform(df), index=df.index, columns=df.columns)

    return scaled_df

In [None]:
X_train_4s_normalized = normalizer(X_train_4s, MinMaxScaler())
X_train_4s_normalized

In [None]:
# create dataframe for descriptive analysis
descriptive_normalized_X_train_4s = X_train_4s_normalized.describe().transpose()

# add column for percentage of missing values of each column
descriptive_normalized_X_train_4s["missing_percentage"] = X_train_4s_normalized.isnull().sum() * 100 / len(X_train_4s_normalized)

# add column for number of unique values of each column
descriptive_normalized_X_train_4s["unique"] = X_train_4s_normalized.nunique()

# add column for outliers
outlierls_normalized_X_train_4s = outliers_z_score(X_train_4s_normalized,4)
descriptive_normalized_X_train_4s["outliers(4s)"] = outlierls_normalized_X_train_4s

# add column for variance of each column
descriptive_normalized_X_train_4s["coeff_var"] = descriptive_normalized_X_train_4s["std"]/np.absolute(descriptive_normalized_X_train_4s["mean"])

## 8.1.1 Imputation Method #1: HOTDECK

In [None]:
# We create a function to develop the Hotdeck imputation method
def imputeHOTDECK(df):

    # We copy the dataframe to preserve it
    df_imputed = df.copy()

    # We iterate over rows with missing values
    for i, row in df_imputed.iterrows():
        # Verify if there are na´s
        if row.isnull().any():
            # Calculate distances between current row and the other rows
            distances = cdist(row.values.reshape(1, -1), df_imputed.drop(i).values)
            # Find the closest row using Euclidean distance (standard option)
            most_similar_row = np.argmin(distances)
            # Imputate missing values with values from closest row values
            df_imputed.loc[i, row.isnull()] = df_imputed.iloc[most_similar_row][row.isnull()]

    return df_imputed

## 8.1.2 Imputation Method #2: kNN

In [None]:
def imputeKNN(df, nn):
    # We copy the dataframe to preserve it
    df_imputed = df.copy()
    # We create kNN object with "nn" neighbor number (5 as standard)
    knn_imputer = KNNImputer(n_neighbors=nn)
    # We made imputation fitting
    df_imputed = knn_imputer.fit_transform(df)
    # Turn into new dataframe
    df_imputed = pd.DataFrame(df_imputed, columns=df.columns)
    # Return imputed dataframe
    df_imputed.index = df.index
    return df_imputed

In [None]:
X_train_4s_normalized_KNN = imputeKNN(X_train_4s_normalized, 5)
naCounter(X_train_4s_normalized_KNN)

In [None]:
inverse_X_train_4s_normalized_KNN = pd.DataFrame(normalizer.inverse_transform(X_train_4s_normalized_KNN),index=X_train_4s_normalized_KNN.index, columns=X_train_4s_normalized_KNN.columns)
inverse_X_train_4s_normalized_KNN

## 8.1.3 Imputation Method #3: MICE

%conda install -c conda-forge imbalanced-learn

\%pip install fancyimpute

In [None]:
def imputeMICE(df):

    # We copy the dataframe to preserve it
    df_imputed = df.copy()
    # We create object IterativeImputer
    mice_imputer = IterativeImputer(sample_posterior=False, random_state=100)
    # We make MICE Imputation
    df_imputed.iloc[:, :] = mice_imputer.fit_transform(df)
    # Return imputed DataFrame
    return df_imputed

In [None]:
X_train_4s_MICE = imputeMICE(X_train_4s)

In [None]:
naCounter(X_train_4s_MICE)

In [None]:
Y_train

## 8.2 Evaluation of imputation

In [None]:
# create dataframe for descriptive analysis
descriptive_X_train_4s_normalized_KNN = X_train_4s_normalized_KNN.describe().transpose()

# add column for number of unique values of each column
descriptive_X_train_4s_normalized_KNN["unique"] = X_train_4s_normalized_KNN.nunique()

# add column for outliers
outlierls_X_train_4s_normalized_KNN = outliers_z_score(X_train_4s_normalized_KNN,4)
descriptive_X_train_4s_normalized_KNN["outliers(4s)"] = outlierls_X_train_4s_normalized_KNN

# add column for variance of each column
descriptive_X_train_4s_normalized_KNN["coeff_var"] = descriptive_X_train_4s_normalized_KNN["std"]/np.absolute(descriptive_X_train_4s_normalized_KNN["mean"])

In [None]:
# create dataframe for descriptive analysis
descriptive_inverse_X_train_4s_normalized_KNN = inverse_X_train_4s_normalized_KNN.describe().transpose()

# add column for number of unique values of each column
descriptive_inverse_X_train_4s_normalized_KNN["unique"] = X_train_4s_normalized_KNN.nunique()

# add column for outliers
outlierls_inverse_X_train_4s_normalized_KNN = outliers_z_score(X_train_4s_normalized_KNN,4)
descriptive_inverse_X_train_4s_normalized_KNN["outliers(4s)"] = outlierls_inverse_X_train_4s_normalized_KNN

# add column for variance of each column
descriptive_inverse_X_train_4s_normalized_KNN["coeff_var"] = descriptive_inverse_X_train_4s_normalized_KNN["std"]/np.absolute(descriptive_inverse_X_train_4s_normalized_KNN["mean"])

In [None]:
# create dataframe for descriptive analysis
descriptive_X_train_4s_MICE = X_train_4s_MICE.describe().transpose()

# add column for percentage of missing values of each column
descriptive_X_train_4s_MICE["missing_percentage"] = X_train_4s_MICE.isnull().sum() * 100 / len(X_train_4s_MICE)

# add column for number of unique values of each column
descriptive_X_train_4s_MICE["unique"] = X_train_4s_MICE.nunique()

# add column for outliers
outlierls_train_4s_mice = outliers_z_score(X_train_4s_MICE,4)
descriptive_X_train_4s_MICE["outliers(4s)"] = outlierls_train_4s_mice

# add column for variance of each column
descriptive_X_train_4s_MICE["coeff_var"] = descriptive_X_train_4s_MICE["std"]/np.absolute(descriptive_X_train_4s_MICE["mean"])

In [None]:
plt.figure(figsize = (15,8))
outliers_chart_X_train_4s_normalized_KNN = sns.histplot(descriptive_X_train_4s_normalized_KNN, x="outliers(4s)", binwidth=5)
outliers_chart_X_train_4s_normalized_KNN.set_title('Outliers of Feature Train Set After Outlier Treatment & Missing Value Imputation (KNN)', fontdict={'size':24})
outliers_chart_X_train_4s_normalized_KNN.set_xlabel('Number of Outliers',fontdict={'size':15})
outliers_chart_X_train_4s_normalized_KNN.set_ylabel('Frequency', fontdict={'size':15})

for c in outliers_chart_X_train_4s_normalized_KNN.containers:

    # customize the label to account for cases when there might not be a bar section
    labels = [f'{h:0.0f}' if (h := v.get_height()) != 0 else '' for v in c ]

    # set the bar label
    outliers_chart_X_train_4s_normalized_KNN.bar_label(c, labels=labels, fontsize=12, padding=3)


plt.xlim(0)


In [None]:
plt.figure(figsize = (15,8))
outliers_chart_X_train_4s_MICE = sns.histplot(descriptive_X_train_4s_MICE, x="outliers(4s)", binwidth=5)
outliers_chart_X_train_4s_MICE.set_title('Outliers of Feature Train Set After Outlier Treatment & Missing Value Imputation (MICE)', fontdict={'size':24})
outliers_chart_X_train_4s_MICE.set_xlabel('Number of Outliers',fontdict={'size':15})
outliers_chart_X_train_4s_MICE.set_ylabel('Frequency', fontdict={'size':15})

for c in outliers_chart_X_train_4s_MICE.containers:

    # customize the label to account for cases when there might not be a bar section
    labels = [f'{h:0.0f}' if (h := v.get_height()) != 0 else '' for v in c ]

    # set the bar label
    outliers_chart_X_train_4s_MICE.bar_label(c, labels=labels, fontsize=12, padding=3)

plt.xlim(0)


In [None]:
descriptive_normalized_X_train_4s["missing_percentage"].sort_values()

In [None]:
# Visualization of a variable with most missing values
plt.figure(figsize = (20,12))
X_train_4s['feature345'].plot(kind='kde',c='red',linewidth=3)
inverse_X_train_4s_normalized_KNN['feature345'].plot(kind='kde')
labels = ['Baseline', 'KNN','MICE']
plt.legend(labels)
plt.xlabel('feature345')
plt.gca().set(title='Density plot of feature345');

In [None]:
# Visualization of a variable with most missing values
plt.figure(figsize = (20,12))
X_train_4s['feature346'].plot(kind='kde',c='red',linewidth=3)
X_train_4s_MICE['feature346'].plot(kind='kde')
labels = ['Baseline','MICE']
plt.legend(labels)
plt.xlabel('feature346')
plt.gca().set(title='Density plot of feature345');

In [None]:
# Visualization of a variable with most missing values
plt.figure(figsize = (20,12))
X_train_4s['feature346'].plot(kind='kde',c='red',linewidth=3)
inverse_X_train_4s_normalized_KNN['feature346'].plot(kind='kde')
X_train_4s_MICE['feature346'].plot(kind='kde')
labels = ['Baseline', 'KNN','MICE']
plt.legend(labels)
plt.xlabel('feature346')
plt.gca().set(title='Density plot of feature346')

In [None]:
# Visualization of a variable with most missing values
plt.figure(figsize = (20,12))
X_train_4s['feature346'].plot(kind='kde',c='red',linewidth=3)
inverse_X_train_4s_normalized_KNN['feature346'].plot(kind='kde')
X_train_4s_MICE['feature346'].plot(kind='kde')
labels = ['Baseline', 'KNN','MICE']
plt.legend(labels)
plt.xlabel('feature580')
plt.gca().set(title='Density plot of feature feature346');

In [None]:
corr_X_train_4s = X_train_4s.corr()
# Correlation Heatmap
plt.figure(figsize = (20,8))
correlation_heatmap_X_train_4s= sns.heatmap(corr_X_train_4s, annot=False)
correlation_heatmap_X_train_4s.set_title('Correlation Heatmap of Features (Before Imputation of Missing Values)', fontdict={'size':24})


In [None]:
corr_X_train_4s_normalized_KNN = X_train_4s_normalized_KNN.corr()
# Correlation Heatmap
plt.figure(figsize = (20,8))
corr_X_train_4s_normalized_KNN= sns.heatmap(corr_X_train_4s_normalized_KNN, annot=False)
corr_X_train_4s_normalized_KNN.set_title('Correlation Heatmap of Features (After KNN Imputation)', fontdict={'size':24})


In [None]:
corr_X_train_4s_MICE = X_train_4s_MICE.corr()
# Correlation Heatmap
plt.figure(figsize = (20,8))
correlation_heatmap_X_train_4s_MICE= sns.heatmap(corr_X_train_4s_MICE, annot=False)
correlation_heatmap_X_train_4s_MICE.set_title('Correlation Heatmap of Features (After MICE Imputation)', fontdict={'size':24})


Since we have 450 remaining features, it is necessary find alternative methods to reduce the total number of features previous to the model implementation in order to get better computer performance.

# Feature Selection

## 9.1. Feature Selection Method #1: BORUTA (Wrapper)

Boruta is a feature selection algorithm.

It works as a wrapper algorithm around Random Forest.

In Boruta, features do not compete with one another. Instead, they compete against a randomized version of themselves called 'shadow features'.



In [None]:
def implementBoruta(df_X, df_y):

    DFx = df_X
    DFy = df_y

    df_X = df_X.values
    df_y = df_y.values
    # We need to create a Random Forest Classifier
    rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5, random_state=100)

    # We create the object Boruta
    boruta = BorutaPy(rf, n_estimators='auto', random_state=100, max_iter=100)

    # We start the process of selection of features
    boruta.fit(df_X, df_y)

    # We concatenate both sets in df_train
    df_train = pd.concat([DFx, DFy], axis=1)

    # We get the list of selected fratures
    selected = df_train.columns[:-1][boruta.support_].tolist()

    borutafeatures_df = DFx[selected]

    print(selected)
    return borutafeatures_df

In [None]:
Y_train_reset_encoded = Y_train_reset.replace({"PASS":0, "FAIL":1})

In [None]:
# boruta feature selection from trainset not scaled and imputed with KNN
borutafeatures_X_train_4s_normalized_KNN = implementBoruta(X_train_4s_normalized_KNN, Y_train_reset)
borutafeatures_X_train_4s_normalized_KNN

In [None]:
# boruta feature selection from trainset not scaled and imputed with KNN
borutafeatures_X_train_4s_MICE = implementBoruta(X_train_4s_MICE, Y_train)

# dataframe of features selected
pd.DataFrame(borutafeatures_X_train_4s_MICE)

In [None]:
Y_train

## 9.2. Feature Selection method #2: RFE (Wrapper)

RFE is a greedy optimization technique that looks for the highest performing feature subset.

It produces models over and over again, putting aside the best or worst performing feature at each iteration.

It builds the next model using the leftover features until all of the features are used up.

The features are then ranked based on the order of their elimination.

In [None]:
def implementRFE(df_X, df_y):

    DFx = df_X
    DFy = df_y

    df_X = df_X.values
    df_y = df_y.values
    # We need to create a Random Forest Classifier
    rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5, random_state=100)

    # We create the object RFE
    rfe = RFE(estimator = rf, n_features_to_select=20, step=10)

    # We start the process of selection of features
    rfe.fit(df_X, df_y)

    # We concatenate both sets in df_train
    df_train = pd.concat([DFx, DFy], axis=1)

    # We get the list of selected fratures
    selected = df_train.columns[:-1][rfe.support_].tolist()
    print(selected)
    return selected

In [None]:
# RFE feature selection from trainset imputed with KNN
RFEfeatures_X_train_4s_normalized_KNN = implementRFE(X_train_4s_normalized_KNN, Y_train_reset)

# dataframe of features selected
pd.DataFrame(RFEfeatures_X_train_4s_normalized_KNN)

In [None]:
# RFE feature selection from trainset imputed with MICE
RFEfeatures_X_train_4s_MICE = implementRFE(X_train_4s_MICE, Y_train)

# dataframe of features selected
pd.DataFrame(RFEfeatures_X_train_4s_MICE)

## 9.3. Feature Selection #3: LASSO (Embedded Method)

In [None]:
# Assuming you have your features stored in X and the target variable in y

def implementLasso(X, y):
    # Create Lasso regression model
    lasso = Lasso(alpha=0.1)  # Set the regularization parameter alpha

    # Perform feature selection using Lasso
    feature_selector = SelectFromModel(lasso)
    selected_features = feature_selector.fit_transform(X, y)

    # Get the selected feature indices
    feature_indices = feature_selector.get_support(indices=True)

    # Print the selected feature names
    selected_feature_names = X.columns[feature_indices]
    print("Selected features:", selected_feature_names)

    return selected_feature_names

In [None]:
# Give text labels to the training examples
Y_train_encoded = Y_train.replace({"PASS":0, "FAIL":1})

In [None]:
# Lasso feature selection from trainset imputed with KNN
Lassofeatures_X_train_4s_normalized_KNN = implementLasso(X_train_4s_normalized_KNN, Y_train_reset_encoded)

# dataframe of features selected
pd.DataFrame(Lassofeatures_X_train_4s_normalized_KNN)

In [None]:
# Give text labels to the training examples
Y_train_encoded = Y_train.replace({"PASS":0, "FAIL":1})

# Lasso feature selection from trainset imputed with MICE
Lassofeatures_X_train_4s_MICE = implementLasso(X_train_4s_MICE, Y_train_reset_encoded)

# dataframe of features selected
pd.DataFrame(Lassofeatures_X_train_4s_MICE)

# Correlation Matrix of Selected Features

In [None]:
corr_borutafeatures_X_train = X_train[borutafeatures_X_train_4s_normalized_KNN].corr()
# Correlation Heatmap
plt.figure(figsize = (20,8))
heatmap_borutafeatures_X_train= sns.heatmap(corr_borutafeatures_X_train, annot=True)
heatmap_borutafeatures_X_train.set_title('Correlation Heatmap of Features with BORUTA (Before Outlier Treatment)', fontdict={'size':24})



In [None]:
corr_borutafeatures_X_train_4s = X_train_4s[borutafeatures_X_train_4s_normalized_KNN].corr()
# Correlation Heatmap
plt.figure(figsize = (20,8))
heatmap_borutafeatures_X_train_4s= sns.heatmap(corr_borutafeatures_X_train_4s, annot=True)
heatmap_borutafeatures_X_train_4s.set_title('Correlation Heatmap of Features with BORUTA (After Outlier Treatment)', fontdict={'size':24})



In [None]:
corr_borutafeatures_X_train_4s_normalized_KNN = X_train_4s_normalized_KNN[borutafeatures_X_train_4s_normalized_KNN].corr()
# Correlation Heatmap
plt.figure(figsize = (20,8))
correlation_heatmap_borutafeatures_X_train_4s_normalized_KNN= sns.heatmap(corr_borutafeatures_X_train_4s_normalized_KNN, annot=True)
correlation_heatmap_borutafeatures_X_train_4s_normalized_KNN.set_title('Correlation Heatmap of Features with BORUTA (After KNN)', fontdict={'size':24})



# 10. Class Balancing Methods   

In [None]:
Y_train.value_counts()

## 10.1 Balancing Method #1: SMOTE

### 10.1.1 Knn + Boruta + SMOTE

In [None]:
oversampler = SMOTE(random_state=88)
X_train_normalized_smote_boruta_knn, Y_train_smote = oversampler.fit_resample(X_train_4s_normalized_KNN[borutafeatures_X_train_4s_normalized_KNN], Y_train_reset)

In [None]:
Y_train_smote.value_counts()

In [None]:
corr_X_train_normalized_smote_boruta_knn = X_train_normalized_smote_boruta_knn.corr()
# Correlation Heatmap
plt.figure(figsize = (20,8))
correlation_heatmap_X_train_normalized_smote_boruta_knn = sns.heatmap(corr_X_train_normalized_smote_boruta_knn, annot=True)
correlation_heatmap_X_train_normalized_smote_boruta_knn.set_title('Correlation Heatmap of Features (Knn + Boruta + SMOTE)', fontdict={'size':24})

### 10.1.1 Mice + Boruta + SMOTE

In [None]:
oversampler = SMOTE(random_state=88)
X_train_normalized_smote_boruta_mice, Y_train_smote = oversampler.fit_resample(X_train_4s_MICE[borutafeatures_X_train_4s_MICE], Y_train)

## 10.2. Balancing Method #2: ROSE

### 10.2.1 Knn + Boruta + ROSE

In [None]:

rose = RandomOverSampler(random_state=88)
X_train_normalized_rose_boruta_knn, Y_train_rose = rose.fit_resample(X_train_4s_normalized_KNN[borutafeatures_X_train_4s_normalized_KNN], Y_train_reset)

In [None]:
Y_train_rose.value_counts()

In [None]:
corr = X_train_normalized_rose_boruta_knn.corr()
# Correlation Heatmap
plt.figure(figsize = (20,8))
correlation_heatmap_X_train_rose = sns.heatmap(corr, annot=True)
correlation_heatmap_X_train_rose.set_title('Correlation Heatmap of Features (Knn + Boruta + ROSE)', fontdict={'size':24})

# 11. Model Deployment

### Preprocessing Test Dataset

In [None]:
# Drop constant columns and columns having missing values >50 as the same with Train Set

X_test.drop(constant_columns,axis=1,inplace=True)
X_test.drop(missing_50_col_list, axis=1, inplace=True)

In [None]:
# We apply the same outlier treatment but over TEST dataset
X_test_2 = outliers_treatment(X_test,4)

In [None]:
# Since we are going to use KNN imputation method for missing values
# We apply the same Standard Scaling to the X_test_scaled
scaler = MinMaxScaler()
scaler.fit(X_test_2)
X_test_scaled = pd.DataFrame(scaler.transform(X_test_2), index=X_test_2.index, columns=X_test_2.columns)

In [None]:
# Over X_test_scaled we apply KNN imputation
X_test_imputed_KNN = imputeKNN(X_test_scaled, 5)
# Since KNN imputation restart indexes over training dataset,
# We apply the reset_index to the test target values dataset too
Y_test_reset = Y_test.reset_index(drop=True)
# Then, we filter to only get the features selected by Boruta+KNN in the previous step.
X_test_final = X_test_imputed_KNN[borutafeatures_X_train_4s_normalized_KNN]

### Prediction & Results

In [None]:
# STEP 1: in order to organize the information better, we rename the variables with the treated datasets.
X_test = X_test_final
y_train = Y_train_smote
y_test = Y_test_reset

# STEP 2: We create and train the RandomForestClassifier model
randfor = RandomForestClassifier(n_jobs=-1, max_depth=5, random_state=100)
randfor.fit(X_train_normalized_smote_boruta_knn, y_train)

# STEP 3: We create the predictor to use it over the test dataset
y_pred = randfor.predict(X_test_final)

# STEP 4: We convert the tags in numeric values according to the libraries requirements
y_test_numeric = np.where(y_test == 'FAIL', 1, 0)
y_pred_numeric = np.where(y_pred == 'FAIL', 1, 0)

# STEP 5: Now we convert this values into dataframes also accordint to libraries requirements
y_pred = pd.DataFrame(y_pred)
y_test_numeric = pd.DataFrame(y_test_numeric)
y_pred_numeric = pd.DataFrame(y_pred_numeric)

# STEP 6: We calculate the confusion matrix and print it
confusion = confusion_matrix(y_test_numeric, y_pred_numeric)
print("Confusion Matrix:")
print(confusion)

# STEP 7: We calculate accuracy and print it
accuracy = accuracy_score(y_test_numeric, y_pred_numeric)
print("Accuracy:", accuracy)

# STEP 8: We calculate precision and print it
precision = precision_score(y_test_numeric, y_pred_numeric)
print("Precision:", precision)

# STEP 9: We calculate Recall index and print it
recall = recall_score(y_test_numeric, y_pred_numeric)
print("Recall:", recall)

# STEP 10: We calculate F1-score and print it
f1 = f1_score(y_test_numeric, y_pred_numeric)
print("F1-score:", f1)

# STEP 11: We calculate Area Under the Curve of ROC score (ROC-AUC) and print it
roc_auc = roc_auc_score(y_test_numeric, y_pred_numeric)
print("ROC AUC:", roc_auc)

## GridSearch for parameteres

In [None]:
# Define your Boruta transformer
class BorutaFeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, estimator, n_estimators=100, random_state=None):
        self.estimator = estimator
        self.n_estimators = n_estimators
        self.random_state = random_state
        self.selector = None

    def fit(self, X, y):
        self.selector = BorutaPy(estimator=self.estimator,
                                 n_estimators=self.n_estimators,
                                 random_state=self.random_state)
        self.selector.fit(X, y)
        return self

    def transform(self, X):
        return X[:, self.selector.support_]

In [None]:
# Encode Label train and test into numeric ones
Y_test_encoded = Y_test.replace({"PASS":0, "FAIL":1})
Y_train_encoded = Y_train.replace({"PASS":0, "FAIL":1})

In [None]:
# Create the pipeline with imbalance handling, imputation, scaling, and classifier
pipeline_knn = Pipeline([

    ('scaler', MinMaxScaler()),
    ('imputer', KNNImputer()),
    ('selector', BorutaFeatureSelector(estimator=RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5, random_state=100))),
    ('sampler', SMOTE()),
    ('classifier', RandomForestClassifier())
])

# Define the parameter grid for grid search
param_grid = {'selector': [BorutaPy(estimator=RandomForestClassifier()), RFE(estimator=RandomForestClassifier())],
              'sampler': [SMOTE(), RandomOverSampler()],
    'classifier': [RandomForestClassifier(), SVC(), LogisticRegression(), XGBClassifier()],
    'imputer__n_neighbors': [4,5,6],
    'imputer__weights': ['uniform', 'distance'],
    'sampler__random_state': [42,100]
}


# Create the grid search object
grid_search = GridSearchCV(pipeline_knn, param_grid, cv=5, scoring='f1')

# Fit the grid search object on the data
grid_search.fit(X_train_4s, Y_train_encoded)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Fit the pipeline with the best parameters on the full training data
pipeline_knn.set_params(**best_params)
pipeline_knn.fit(X_train_4s, Y_train_encoded)

# Evaluate the pipeline on the test data
accuracy = pipeline_knn.score(X_test_2, Y_test_encoded)

In [None]:
# Create dataframe from gridsearch results
result_df = pd.DataFrame.from_dict(grid_search.cv_results_, orient='columns')
result_df

### Pipeline Prediction Results

In [None]:
def pipeline_results(pipeline, xtest, ytest, confusion_title):
    
    # STEP 1: in order to organize the information better, we rename the variables with the treated datasets.
    x_test = xtest
    y_test = ytest


    y_pred = pipeline.predict(x_test)

    # STEP 5: Now we convert this values into dataframes also accordint to libraries requirements
    y_pred_numeric = pd.DataFrame(y_pred)
    y_test_numeric = pd.DataFrame(y_test)


    # STEP 6: We calculate the confusion matrix and print it
    confusion = confusion_matrix(y_test_numeric, y_pred_numeric)
    print("Confusion Matrix:")
    print(confusion)

    # STEP 7: We calculate accuracy and print it
    accuracy = accuracy_score(y_test_numeric, y_pred_numeric)
    print("Accuracy:", accuracy)

    # STEP 8: We calculate precision and print it
    precision = precision_score(y_test_numeric, y_pred_numeric)
    print("Precision:", precision)

    # STEP 9: We calculate Recall index and print it
    recall = recall_score(y_test_numeric, y_pred_numeric)
    print("Recall:", recall)

    # STEP 10: We calculate F1-score and print it
    f1 = f1_score(y_test_numeric, y_pred_numeric)
    print("F1-score:", f1)

    # STEP 11: We calculate Area Under the Curve of ROC score (ROC-AUC) and print it
    roc_auc = roc_auc_score(y_test_numeric, y_pred_numeric)
    print("ROC AUC:", roc_auc)

    # STEP 12: To visualize the confusion matrix, we create a heatmap using the existent matrix
    plt.figure(figsize=(8, 6))
    sns.heatmap(confusion, annot=True, fmt="d", cmap="Blues",xticklabels=["PASS", "FAIL"], yticklabels=["PASS", "FAIL"])
    plt.title(confusion_title)
    plt.xlabel("Predicted Values")
    plt.ylabel("Actual Values")
    plt.show()

#### KNN+BORUTA+SMOTE+RandomForestClassifier

In [None]:
# Create the pipeline with imbalance handling, imputation, scaling, and classifier
pipeline_knn_boruta_smote_RandomForestClassifier = Pipeline([

    ('scaler', MinMaxScaler()),
    ('imputer', KNNImputer(n_neighbors=5,weights="uniform")),
    ('selector', BorutaFeatureSelector(estimator=RandomForestClassifier())),
    ('sampler', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier())
])

pipeline_knn_boruta_smote_RandomForestClassifier.fit(X_train_4s, Y_train_encoded)

pipeline_results(pipeline_knn_boruta_smote_RandomForestClassifier, X_test_2, Y_test_encoded, "Confusion Matrix - KNN+BORUTA+SMOTE+RandomForestClassifier")

####  KNN+RFE+SMOTE+SVC

In [None]:
# Create the pipeline with imbalance handling, imputation, scaling, and classifier
pipeline_knn_boruta_smote_SVC = Pipeline([

    ('scaler', MinMaxScaler()),
    ('imputer', KNNImputer(n_neighbors=4,weights="uniform")),
    ('selector', RFE(estimator=RandomForestClassifier())),
    ('sampler', SMOTE(random_state=100)),
    ('classifier', SVC())
])


pipeline_knn_boruta_smote_SVC.fit(X_train_4s, Y_train_encoded)

pipeline_results(pipeline_knn_boruta_smote_SVC, X_test_2, Y_test_encoded, "Confusion Matrix - KNN+RFE+SMOTE+SVC")

#### KNN+RFE+SMOTE+LogisticRegression

In [None]:
# Create the pipeline with imbalance handling, imputation, scaling, and classifier
pipeline_knn_boruta_smote_LogisticRegression = Pipeline([

    ('scaler', MinMaxScaler()),
    ('imputer', KNNImputer(n_neighbors=4,weights="distance")),
    ('selector', RFE(estimator=RandomForestClassifier())),
    ('sampler', SMOTE(random_state=100)),
    ('classifier', LogisticRegression())
])


pipeline_knn_boruta_smote_LogisticRegression.fit(X_train_4s, Y_train_encoded)

pipeline_results(pipeline_knn_boruta_smote_LogisticRegression, X_test_2, Y_test_encoded, "Confusion Matrix - KNN+RFE+SMOTE+LogisticRegression")

#### KNN+BORUTA+SMOTE+XGB

In [None]:
# Create the pipeline with imbalance handling, imputation, scaling, and classifier
pipeline_knn_boruta_smote_XGB = Pipeline([

    ('scaler', MinMaxScaler()),
    ('imputer', KNNImputer(n_neighbors=4,weights="distance")),
    ('selector', BorutaFeatureSelector(estimator=RandomForestClassifier(), random_state=42)),
    ('sampler', SMOTE(random_state=42)),
    ('classifier', XGBClassifier(seed = 42 , objective = 'binary:logistic', missing = 0))
])

pipeline_knn_boruta_smote_XGB.fit(X_train_4s, Y_train_encoded)

pipeline_results(pipeline_knn_boruta_smote_XGB, X_test_2, Y_test_encoded, "Confusion Matrix - KNN+BORUTA+SMOTE+XGB")

In [None]:
# Create the pipeline with imbalance handling, imputation, scaling, and classifier
pipeline_knn_Lasso = Pipeline([

    ('imputer', KNNImputer()),
    ('selector', SelectFromModel(Lasso(random_state=42,normalize=True))),
    ('sampler', SMOTE()),
    ('classifier', SVC())
])

# Define the parameter grid for grid search
param_grid = {
    'sampler': [SMOTE(), RandomOverSampler()],
    'classifier': [RandomForestClassifier(), SVC(),LogisticRegression(),XGBClassifier()],
    'imputer__n_nearest_features': [4,5,6],
    'sampler__random_state': [42,100]
}

# Create the grid search object
grid_search_knn_Lasso = GridSearchCV(pipeline_knn_Lasso, param_grid, cv=5, scoring='f1',n_jobs=-1)

