#  Wafer fault Prediction

**Brief:** In electronics, a **wafer** (also called a slice or substrate) is a thin slice of semiconductor, such as a crystalline silicon (c-Si), used for the fabrication of integrated circuits and, in photovoltaics, to manufacture solar cells. The wafer serves as the substrate(serves as foundation for contruction of other components) for microelectronic devices built in and upon the wafer. 

It undergoes many microfabrication processes, such as doping, ion implantation, etching, thin-film deposition of various materials, and photolithographic patterning. Finally, the individual microcircuits are separated by wafer dicing and packaged as an integrated circuit.

## Problem Statement

**Data:** Wafers data


**Problem Statement:** Wafers are predominantly used to manufacture solar cells and are located at remote locations in bulk and they themselves consist of few hundreds of sensors. Wafers are fundamental of photovoltaic power generation, and production thereof requires high technology. Photovoltaic power generation system converts sunlight energy directly to electrical energy.

The motto behind figuring out the faulty wafers is to obliterate the need of having manual man-power doing the same. And make no mistake when we're saying this, even when they suspect a certain wafer to be faulty, they had to open the wafer from the scratch and deal with the issue, and by doing so all the wafers in the vicinity had to be stopped disrupting the whole process and stuff anf this is when that certain wafer was indeed faulty, however, when their suspicion came outta be false negative, then we can only imagine the waste of time, man-power and ofcourse, cost incurred.

**Solution:** Data fetched by wafers is to be passed through the machine learning pipeline and it is to be determined whether the wafer at hand is faulty or not apparently obliterating the need and thus cost of hiring manual labour.

## # Import Required Libraries:

In [None]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

import seaborn as sns
import matplotlib.pyplot as plt

: 

In [None]:
## Load the feature store dataset as dataframe

file_path = r"C:\ZZZ_PROGRAMING\DS ML Gen AI\ML Sensor Project\notebooks\wafer_23012020_041211.csv"
wafers = pd.read_csv(file_path)
print("Shape of the feature store dataset: ", wafers.shape)
wafers.head()

In [None]:
wafers.columns
## 592 columns

In [None]:
wafers.shape

In [None]:
wafers.drop(columns = ["Unnamed: 0", "Good/Bad"]).iloc[ : 100].to_csv("test.csv", index = False)

In [None]:
# Replace the column unnamed: 0 as wafer

wafers.rename(columns = {"Unnamed: 0" : "Wafer"}, inplace = True)

In [None]:
## Train-Test Split

from sklearn.model_selection import train_test_split

wafers, wafers_test = train_test_split(wafers, test_size = .20, random_state = 42)

In [None]:
wafers.shape

In [None]:
## Wafers' Info

wafers.info()

In [None]:
## Description of `wafers`

wafers.describe()

### Insight:

From the gist of only shown columns, it looks like some of features have pretty bad outliers. One thing is for sure, the data must be standardized.

In [None]:
## Looking at the Cats in our Target feature
wafers['Good/Bad'].value_counts()

### Insight:

Heavily imbalanced. Definitely gonna need `resampling`.

## # Analyze Missing Data:

Firstly, we'll check the missing data in the target feature and drop those records. **As if we already know a value of target feature then there's no need for a ML algorithm, damn right?** Therefore, the best way to deal with missing target entries is to delete them. For other missing features, we can definitely use impute strategies.

In [None]:
## Check missing values in target feature
wafers["Good/Bad"].isna().sum()

**=>** Woa, not even a single missing entry, I didn't see that coming.

In [None]:
wafers.isna().sum().sum()

In [None]:
## Check missing values in dependent feature variables
## Chnaging into percentage
wafers.isna().sum().sum() / (wafers.shape[0] * (wafers.shape[1] - 1))

In [None]:
wafers.shape[1]

In [None]:
wafers.shape[0]

**=>** Almost 4% out of total cells we're having, are missing.

We're gonna try all sort of imputation strategies and would choose the one with that's gonna give us least overall-error-val.

## # Visualization of Sensors' distribution:

In [None]:
# let's have a look at the distribution first 50 sensors of Wafers

In [None]:
# Select 50 random sensors
random_50_sensors_idx = []
for i in range(50):
    if i not in random_50_sensors_idx:
        random_50_sensors_idx.append(np.random.randint(1, 591))

In [None]:
# let's now, have a look at the distribution of random 50 sensors
plt.figure(figsize = (15, 100))
for i, col in enumerate(wafers.columns[random_50_sensors_idx]):
    plt.subplot(60, 3, i + 1)
    sns.distplot(x = wafers[col], color = 'indianred')
    plt.xlabel(col, weight = 'bold')
    plt.tight_layout()

### Insight:

Pretty good amount of them (either first 50 or random 50) either are constant (have 0 standard deviation) or have left skewness and right skewness. It ain't possible to analyze each feature and deal with its outliers individually, thus we oughta depend upon the scaling. 

For the **features with 0 standard deviation**, we can straight away drop them and for others that do have outliers, we gotta go ahead with the `Robust Scaling`.

### # Get Columns to Drop:

Will drop columns with zero standard deviation as they are not influencing the target variable in any way.

In [None]:
def get_cols_with_zero_std_dev(df: pd.DataFrame):
    """
    Returns a list of columns names who are having zero standard deviation.
    """
    cols_to_drop = []
    num_cols = [col for col in df.columns if df[col].dtype != 'O']  # numerical cols only
    for col in num_cols:
        if df[col].std() == 0:
            cols_to_drop.append(col)
    return cols_to_drop

def get_redundant_cols(df: pd.DataFrame, missing_thresh=.7):
    """
    Returns a list of columns having missing values more than certain thresh.
    """
    cols_missing_ratios = df.isna().sum().div(df.shape[0])
    cols_to_drop = list(cols_missing_ratios[cols_missing_ratios > missing_thresh].index)
    return cols_to_drop        

In [None]:
## Columns w missing vals more than 70%
cols_to_drop_1 = get_redundant_cols(wafers, missing_thresh=.7)
cols_to_drop_1

In [None]:
## Columns w 0 Standard Deviation
cols_to_drop_2 = get_cols_with_zero_std_dev(df = wafers)
cols_to_drop_2.append("Wafer")
cols_to_drop_2

In [None]:
## Cols to drop
cols_to_drop = cols_to_drop_1 + cols_to_drop_2

**=>** Features that are not gonna contribute to ML algorithm in anyway, whatsoever.

## # Separate Features and Labels out:

In [None]:
## Separate features and Labels out
X, y = wafers.drop(cols_to_drop, axis = 1), wafers[["Good/Bad"]]

In [None]:
## Dependent feature variables
print("Shape of the features now: ", X.shape)
X.head()

**=>** Now, we have 475 contributing features.

In [None]:
## Independent/Target Variables
print("Shape of the labels: ", y.shape)
y.head()

## # Data Transformation:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
# from sklearn.preprocessing import StandardScaler

imputer = KNNImputer(n_neighbors = 3)
preprocessing_pipeline = Pipeline(
    steps = [('Imputer', imputer), ('Scaler', RobustScaler())])
preprocessing_pipeline

In [None]:
## Transform "Wafers" features
X_trans = preprocessing_pipeline.fit_transform(X)
print("Shape of transformed features set: ", X_trans.shape)
X_trans

## # Shall we cluster "Wafers" instances?

Let's see whether clustering of data instances do us any good. 

In [None]:
pip install kneed

In [None]:
from sklearn.cluster import KMeans
from kneed import KneeLocator
from typing import Tuple
from dataclasses import dataclass

@dataclass
class ClusterDataInstances:
    """Divides the given data instances into different clusters via KMeans Clustering algorithm.
    Args:
        X (np.array): Takes in an array which gotta be clustered.
        desc (str): Description of the said array.
    """
    X: np.array
    desc: str

    def _get_ideal_number_of_clusters(self):
        """Returns the ideal number of clusters the given data instances should be divided into by 
        locating the dispersal point in number of clusters vs WCSS plot.

        Raises:
            e: Raises relevant exception should any kinda error pops up while determining the ideal
            number of clusters.

        Returns:
            int: Ideal number of clusters the given data instances should be divided into.
        """
        try:
            print(
                f'Getting the ideal number of clusters to cluster "{self.desc} set" into..')
            ####################### Compute WCSS for shortlisted number of clusters ##########################
            print("computing WCSS for shortlisted number of clusters..")
            wcss = []  # Within Summation of Squares
            for i in range(1, 11):
                kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
                kmeans.fit(self.X)
                wcss.append(kmeans.inertia_)
                print(f"WCSS for n_clusters = {i}: {kmeans.inertia_}")
            print("WCSS computed successfully for all shortlisted number of clusters!")
            ################### Finalize dispersal point as the ideal number of clusters #####################
            print("Finding the ideal number of clusters (by locating the dispersal point) via Elbow method..")
            knee_finder = KneeLocator(
                range(1, 11), wcss, curve = 'convex', direction = 'decreasing')  # range(1, 11) vs WCSS
            print(f"Ideal number of clusters to be formed: {knee_finder.knee}")
            return knee_finder.knee
            ...
        except Exception as e:
            print(e)
            raise e

    def create_clusters(self) -> Tuple:
        """Divides the given data instances into the different clusters, they first hand shoud've been divided into
        via offcourse Kmeans Clustering algorithm.
        Raises:
            e: Raises relevant exception should any kinda error pops up while dividing the given data instances into
            clusters.
        Returns:
            (KMeans, np.array): KMeans Clustering object being used to cluster the given data instances and the given dataset 
            along with the cluster labels, respectively.
        """
        try:
            ideal_clusters = self._get_ideal_number_of_clusters()
            print(f"Dividing the \"{self.desc}\" instances into {ideal_clusters} clusters via KMeans Clustering algorithm..")
            kmeans = KMeans(n_clusters=ideal_clusters, init = 'k-means++', random_state = 42)
            y_kmeans = kmeans.fit_predict(self.X)
            print(f"..said data instances divided into {ideal_clusters} clusters successfully!")
            return kmeans, np.c_[self.X, y_kmeans]
            ...
            
        except Exception as e:
            print(e)
            raise e

In [None]:
## Cluster `Wafer` instances
cluster_wafers = ClusterDataInstances(X = X_trans, desc = "wafers features")
clusterer, X_clus = cluster_wafers.create_clusters()
X_clus

In [None]:
## Clusters
np.unique(X_clus[ :, -1])

**=>** So the dataset was divided into 3 optimal clusters.

Let's have a look at their shapes..

In [None]:
## Configure "Clustered" array along with target features
wafers_clus = np.c_[X_clus, y]
## Cluster_1 data
wafers_1 = wafers_clus[wafers_clus[ :, -2] == 0]
wafers_1.shape

**=>** Perhaps we were wrong about dividing the `Wafers` dataset into clusters, as we can see pretty much of all datapoints lie in the first cluster itself.

Let's take look at another clusters anyway..

In [None]:
## Cluster_2 data
wafers_clus[wafers_clus[ :, -2] == 1].shape

**=>** Man, seriously?!

In [None]:
## Cluster_3 data
wafers_clus[wafers_clus[ :, -2] == 2].shape

**=>** Thus we mustn't divide the dataset into clusters. Not a good idea!

## # Resampling of Training Instances:

Resampling is imperative in this case as the target variable is highly imbalanced.

In [None]:
# %pip install imbalanced-learn

In [None]:
from imblearn.combine import SMOTETomek

X, y = X_trans[ :, :-1], y
resampler = SMOTETomek(sampling_strategy = "auto")
X_res, y_res = resampler.fit_resample(X, y)

In [None]:
print("Before resampling, Shape of training instances: ", np.c_[X, y].shape)
print("After resampling, Shape of training instances: ", np.c_[X_res, y_res].shape)

In [None]:
## Target Cats after Resampling
print(np.unique(y_res))
print(f"Value Counts: \n-1: {len(y_res[y_res == -1])}, 1: {len(y_res[y_res == 1])}")

**=>** Exactly what we wanted!

### # Prepare the Test set:

Do exactly the same to test set whatever's been done to the test set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size = 1/3, random_state = 42)

print(f"train set: {X_train.shape, y_train.shape}")
print(f"test set: {X_test.shape, y_test.shape}")

In [None]:
# # fetch only features that were used in training
# X_test, y_test = wafers_test[preprocessing_pipeline.feature_names_in_], wafers_test.iloc[:, -1]

# ## Transform the Test features
# X_test_trans = preprocessing_pipeline.transform(X_test)
# print(X_test_trans.shape, y_test.shape)

# ## Cluster the test features
# y_test_kmeans = clusterer.predict(X_test_trans)

# ## Configure the test array
# test_arr = np.c_[X_test_trans, y_test, y_test_kmeans]
# np.unique(y_test_kmeans)

In [None]:
# # Prepare the test features and test labels for cluster one

# X_test_prep, y_test_prep = test_arr[test_arr[:, -2] == ], test_arr[:, -1]
# print(X_test_prep.shape)

## # Model Selection and Training:

In [None]:
# pip install xgboost==0.90

In [None]:
## Prepared training sets
# X_prep = wafers_1[:, :-2]
# y_prep = wafers_1[:, -1]
# print(X_prep.shape, y_prep.shape)

In [None]:
# Prepared training and test sets
X_prep = X_train
y_prep = y_train
X_test_prep = X_test
y_test_prep = y_test

print(X_prep.shape, y_prep.shape)
print(X_test_prep.shape, y_test_prep.shape)

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import roc_auc_score

# Shortlisted base Models
svc_clf = SVC(kernel = 'linear')
svc_rbf_clf = SVC(kernel = 'rbf')
random_clf = RandomForestClassifier(random_state = 42)
xgb_clf = XGBClassifier(objective = 'binary:logistic')

In [None]:
## A function to display Scores
def display_scores(scores):
    print("Scores: ", scores)
    print("Mean: ", scores.mean())
    print("Standard Deviation: ", scores.std())

### # Evaluating `SVC (kernel='linear')` using cross-validation:

In [None]:
## SVC Scores
svc_scores = cross_val_score(svc_clf, X_prep, y_prep, scoring='roc_auc', cv=10, verbose=2)

In [None]:
display_scores(svc_scores)

In [None]:
## Performance on test set using cross-validation

# Predictions using cross-validation
svc_preds = cross_val_predict(svc_clf, X_test_prep, y_test_prep, cv = 5)

# AUC score
svc_auc = roc_auc_score(y_test_prep, svc_preds)
svc_auc

### # Evaluating `SVC (kernel='rbf')` using cross-validation:

In [None]:
## SVC rbf Scores
svc_rbf_scores = cross_val_score(svc_rbf_clf, X_prep, y_prep, scoring = 'roc_auc', cv = 10, verbose = 2)

In [None]:
display_scores(svc_rbf_scores)

In [None]:
## Performance on test set using cross-validation

# Predictions using cross-validation
svc_rbf_preds = cross_val_predict(svc_rbf_clf, X_test_prep, y_test_prep, cv = 5)

# AUC score
svc_rbf_auc = roc_auc_score(y_test_prep, svc_rbf_preds)
svc_rbf_auc

### # Evaluating `RandomForestClassifier` using cross-validation:

In [None]:
## Random Forest Scores
random_clf_scores = cross_val_score(random_clf, X_prep, y_prep, scoring = 'roc_auc', cv = 10, verbose = 2)

In [None]:
display_scores(random_clf_scores)

In [None]:
## Performance on test set using cross-validation

# Predictions using cross-validation
random_clf_preds = cross_val_predict(random_clf, X_test_prep, y_test_prep, cv = 5)

# AUC score
random_clf_auc = roc_auc_score(y_test_prep, random_clf_preds)
random_clf_auc