#  Wafer fault Prediction

**Brief:** In electronics, a **wafer** (also called a slice or substrate) is a thin slice of semiconductor, such as a crystalline silicon (c-Si), used for the fabrication of integrated circuits and, in photovoltaics, to manufacture solar cells. The wafer serves as the substrate(serves as foundation for contruction of other components) for microelectronic devices built in and upon the wafer. 

It undergoes many microfabrication processes, such as doping, ion implantation, etching, thin-film deposition of various materials, and photolithographic patterning. Finally, the individual microcircuits are separated by wafer dicing and packaged as an integrated circuit.

## Problem Statement

**Data:** Wafers data


**Problem Statement:** Wafers are predominantly used to manufacture solar cells and are located at remote locations in bulk and they themselves consist of few hundreds of sensors. Wafers are fundamental of photovoltaic power generation, and production thereof requires high technology. Photovoltaic power generation system converts sunlight energy directly to electrical energy.

The motto behind figuring out the faulty wafers is to obliterate the need of having manual man-power doing the same. And make no mistake when we're saying this, even when they suspect a certain wafer to be faulty, they had to open the wafer from the scratch and deal with the issue, and by doing so all the wafers in the vicinity had to be stopped disrupting the whole process and stuff anf this is when that certain wafer was indeed faulty, however, when their suspicion came outta be false negative, then we can only imagine the waste of time, man-power and ofcourse, cost incurred.

**Solution:** Data fetched by wafers is to be passed through the machine learning pipeline and it is to be determined whether the wafer at hand is faulty or not apparently obliterating the need and thus cost of hiring manual labour.

## # Import Required Libraries:

In [1]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In [2]:
df=pd.read_csv('wafer_23012020_041211.csv')

# Overview

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Sensor-1,Sensor-2,Sensor-3,Sensor-4,Sensor-5,Sensor-6,Sensor-7,Sensor-8,Sensor-9,...,Sensor-582,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
0,Wafer-801,2968.33,2476.58,2216.7333,1748.0885,1.1127,100.0,97.5822,0.1242,1.53,...,,0.5004,0.012,0.0033,2.4069,0.0545,0.0184,0.0055,33.7876,-1
1,Wafer-802,2961.04,2506.43,2170.0666,1364.5157,1.5447,100.0,96.77,0.123,1.3953,...,,0.4994,0.0115,0.0031,2.302,0.0545,0.0184,0.0055,33.7876,1
2,Wafer-803,3072.03,2500.68,2205.7445,1363.1048,1.0518,100.0,101.8644,0.122,1.3896,...,,0.4987,0.0118,0.0036,2.3719,0.0545,0.0184,0.0055,33.7876,-1
3,Wafer-804,3021.83,2419.83,2205.7445,1363.1048,1.0518,100.0,101.8644,0.122,1.4108,...,,0.4934,0.0123,0.004,2.4923,0.0545,0.0184,0.0055,33.7876,-1
4,Wafer-805,3006.95,2435.34,2189.8111,1084.6502,1.1993,100.0,104.8856,0.1234,1.5094,...,,0.4987,0.0145,0.0041,2.8991,0.0545,0.0184,0.0055,33.7876,-1


In [4]:
df.describe()

Unnamed: 0,Sensor-1,Sensor-2,Sensor-3,Sensor-4,Sensor-5,Sensor-6,Sensor-7,Sensor-8,Sensor-9,Sensor-10,...,Sensor-582,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
count,99.0,100.0,97.0,97.0,97.0,97.0,97.0,97.0,100.0,100.0,...,34.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,3017.301212,2487.1803,2202.168281,1484.362181,1.180367,100.0,97.449088,0.122195,1.461516,0.000243,...,74.331709,0.49939,0.013615,0.003549,2.727297,0.02351,0.014875,0.004685,77.430241,-0.88
std,71.819707,66.954212,30.350606,460.985871,0.349654,0.0,5.553324,0.002006,0.0713,0.01061,...,41.857728,0.003431,0.004344,0.000873,0.875848,0.011991,0.007557,0.002527,55.106166,0.477367
min,2825.67,2254.99,2114.6667,978.7832,0.7531,100.0,83.4233,0.116,1.3179,-0.0279,...,20.3091,0.4925,0.0076,0.0021,1.5152,0.0099,0.0048,0.0017,20.3091,-1.0
25%,2973.04,2446.595,2189.9667,1111.5436,0.8373,100.0,95.1089,0.1208,1.407375,-0.006925,...,47.356,0.4973,0.0113,0.003075,2.270425,0.0134,0.009475,0.0027,33.7876,-1.0
50%,3004.39,2493.89,2200.9889,1244.2899,1.1569,100.0,99.5133,0.1222,1.4537,0.001,...,65.12755,0.4994,0.01275,0.0034,2.5464,0.0218,0.0139,0.00385,62.0595,-1.0
75%,3070.385,2527.525,2213.2111,1963.8016,1.383,100.0,101.4578,0.1234,1.507425,0.008125,...,99.41905,0.501525,0.0147,0.003825,2.95375,0.028025,0.0192,0.0059,104.3034,-1.0
max,3221.21,2664.52,2315.2667,2363.6412,2.2073,100.0,107.1522,0.1262,1.6411,0.025,...,223.1018,0.5087,0.0437,0.0089,8.816,0.0545,0.0401,0.015,223.1018,1.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 592 entries, Unnamed: 0 to Good/Bad
dtypes: float64(494), int64(97), object(1)
memory usage: 462.6+ KB


In [6]:
df.isnull().sum().sum()

2306

In [7]:
df=df.dropna(axis=1)

In [8]:
df.duplicated().sum()

0

In [9]:
df

Unnamed: 0.1,Unnamed: 0,Sensor-2,Sensor-9,Sensor-10,Sensor-11,Sensor-12,Sensor-13,Sensor-14,Sensor-15,Sensor-16,...,Sensor-578,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
0,Wafer-801,2476.58,1.5300,-0.0279,-0.0040,0.9468,198.1219,0,6.0959,416.5950,...,17.6552,0.5004,0.0120,0.0033,2.4069,0.0545,0.0184,0.0055,33.7876,-1
1,Wafer-802,2506.43,1.3953,0.0084,0.0062,0.9461,204.6134,0,5.1756,406.3290,...,11.8075,0.4994,0.0115,0.0031,2.3020,0.0545,0.0184,0.0055,33.7876,1
2,Wafer-803,2500.68,1.3896,0.0138,0.0000,0.9656,199.5093,0,4.8205,414.1385,...,17.6552,0.4987,0.0118,0.0036,2.3719,0.0545,0.0184,0.0055,33.7876,-1
3,Wafer-804,2419.83,1.4108,-0.0046,-0.0024,0.9589,199.6262,0,13.3691,411.8383,...,17.6552,0.4934,0.0123,0.0040,2.4923,0.0545,0.0184,0.0055,33.7876,-1
4,Wafer-805,2435.34,1.5094,-0.0046,0.0121,0.9674,202.6499,0,3.4480,397.9388,...,15.1082,0.4987,0.0145,0.0041,2.8991,0.0545,0.0184,0.0055,33.7876,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Wafer-896,2526.44,1.5337,0.0090,0.0058,0.9685,196.6199,0,8.6805,408.5064,...,17.0523,0.5013,0.0076,0.0021,1.5152,0.0153,0.0048,0.0017,31.0176,-1
96,Wafer-897,2477.01,1.4695,0.0071,0.0215,0.9782,198.1811,0,6.8175,405.6568,...,13.2830,0.5003,0.0106,0.0028,2.1263,0.0153,0.0048,0.0017,31.0176,1
97,Wafer-898,2387.42,1.3603,-0.0031,0.0086,0.9575,197.5712,0,9.9421,409.1935,...,23.1735,0.5016,0.0130,0.0028,2.5865,0.0153,0.0048,0.0017,31.0176,-1
98,Wafer-899,2541.89,1.4493,-0.0194,-0.0018,0.9673,202.8336,0,8.6085,414.2447,...,14.6551,0.5023,0.0140,0.0033,2.7810,0.0153,0.0048,0.0017,31.0176,-1


In [10]:
df.rename(columns={"Unnamed: 0":"Wafer"},inplace=True)

In [11]:
df

Unnamed: 0,Wafer,Sensor-2,Sensor-9,Sensor-10,Sensor-11,Sensor-12,Sensor-13,Sensor-14,Sensor-15,Sensor-16,...,Sensor-578,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
0,Wafer-801,2476.58,1.5300,-0.0279,-0.0040,0.9468,198.1219,0,6.0959,416.5950,...,17.6552,0.5004,0.0120,0.0033,2.4069,0.0545,0.0184,0.0055,33.7876,-1
1,Wafer-802,2506.43,1.3953,0.0084,0.0062,0.9461,204.6134,0,5.1756,406.3290,...,11.8075,0.4994,0.0115,0.0031,2.3020,0.0545,0.0184,0.0055,33.7876,1
2,Wafer-803,2500.68,1.3896,0.0138,0.0000,0.9656,199.5093,0,4.8205,414.1385,...,17.6552,0.4987,0.0118,0.0036,2.3719,0.0545,0.0184,0.0055,33.7876,-1
3,Wafer-804,2419.83,1.4108,-0.0046,-0.0024,0.9589,199.6262,0,13.3691,411.8383,...,17.6552,0.4934,0.0123,0.0040,2.4923,0.0545,0.0184,0.0055,33.7876,-1
4,Wafer-805,2435.34,1.5094,-0.0046,0.0121,0.9674,202.6499,0,3.4480,397.9388,...,15.1082,0.4987,0.0145,0.0041,2.8991,0.0545,0.0184,0.0055,33.7876,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Wafer-896,2526.44,1.5337,0.0090,0.0058,0.9685,196.6199,0,8.6805,408.5064,...,17.0523,0.5013,0.0076,0.0021,1.5152,0.0153,0.0048,0.0017,31.0176,-1
96,Wafer-897,2477.01,1.4695,0.0071,0.0215,0.9782,198.1811,0,6.8175,405.6568,...,13.2830,0.5003,0.0106,0.0028,2.1263,0.0153,0.0048,0.0017,31.0176,1
97,Wafer-898,2387.42,1.3603,-0.0031,0.0086,0.9575,197.5712,0,9.9421,409.1935,...,23.1735,0.5016,0.0130,0.0028,2.5865,0.0153,0.0048,0.0017,31.0176,-1
98,Wafer-899,2541.89,1.4493,-0.0194,-0.0018,0.9673,202.8336,0,8.6085,414.2447,...,14.6551,0.5023,0.0140,0.0033,2.7810,0.0153,0.0048,0.0017,31.0176,-1


In [12]:
df.describe()

Unnamed: 0,Sensor-2,Sensor-9,Sensor-10,Sensor-11,Sensor-12,Sensor-13,Sensor-14,Sensor-15,Sensor-16,Sensor-17,...,Sensor-578,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,2487.1803,1.461516,0.000243,0.001867,0.964654,199.754021,0.0,8.197215,410.867133,9.751596,...,17.361958,0.49939,0.013615,0.003549,2.727297,0.02351,0.014875,0.004685,77.430241,-0.88
std,66.954212,0.0713,0.01061,0.007686,0.00983,2.549577,0.0,2.040945,8.273853,0.462981,...,14.676964,0.003431,0.004344,0.000873,0.875848,0.011991,0.007557,0.002527,55.106166,0.477367
min,2254.99,1.3179,-0.0279,-0.0146,0.9278,192.5209,0.0,3.448,395.3408,8.9012,...,7.5578,0.4925,0.0076,0.0021,1.5152,0.0099,0.0048,0.0017,20.3091,-1.0
25%,2446.595,1.407375,-0.006925,-0.003425,0.958,198.026475,0.0,6.68425,404.6913,9.45765,...,11.666725,0.4973,0.0113,0.003075,2.270425,0.0134,0.009475,0.0027,33.7876,-1.0
50%,2493.89,1.4537,0.001,0.00255,0.9669,199.4029,0.0,8.16355,409.6891,9.73055,...,14.9266,0.4994,0.01275,0.0034,2.5464,0.0218,0.0139,0.00385,62.0595,-1.0
75%,2527.525,1.507425,0.008125,0.0067,0.970475,202.42495,0.0,9.374575,416.843825,10.074325,...,17.6552,0.501525,0.0147,0.003825,2.95375,0.028025,0.0192,0.0059,104.3034,-1.0
max,2664.52,1.6411,0.025,0.0215,0.9795,204.6134,0.0,13.3691,432.8914,10.8515,...,96.9601,0.5087,0.0437,0.0089,8.816,0.0545,0.0401,0.015,223.1018,1.0


In [13]:
df['Good/Bad'].value_counts()

Good/Bad
-1    94
 1     6
Name: count, dtype: int64

From this we can clearley see that the data is hiighly imbalance

In [14]:
df['Sensor-10'].std()

0.01061012719009056

In [15]:
plt.figure(figsize=(15,100))
for i,col in enumerate(df.columns[2:52]):
    plt.subplot(60,3,i+1)
    sns.displot(x=df[col],color="indianred")
    plt.xlabel(col,weight="bold")
    plt.tight_layout()--

SyntaxError: invalid syntax (2408618861.py, line 6)

In [16]:
def get_col_with_zero_dev(df:pd.DataFrame):
    col_to_drop=[]
    num_col=[i for i in df.columns if df[i].dtype !='O']
    for col in num_col:
        if df[col].std()==0:
            col_to_drop.append(col)
    return col_to_drop


In [17]:
colum_to_drop_1=get_col_with_zero_dev(df)

In [18]:
colum_to_drop_1

['Sensor-14',
 'Sensor-43',
 'Sensor-50',
 'Sensor-53',
 'Sensor-98',
 'Sensor-150',
 'Sensor-180',
 'Sensor-187',
 'Sensor-190',
 'Sensor-227',
 'Sensor-230',
 'Sensor-231',
 'Sensor-232',
 'Sensor-233',
 'Sensor-234',
 'Sensor-235',
 'Sensor-236',
 'Sensor-237',
 'Sensor-238',
 'Sensor-241',
 'Sensor-242',
 'Sensor-243',
 'Sensor-244',
 'Sensor-257',
 'Sensor-258',
 'Sensor-259',
 'Sensor-260',
 'Sensor-261',
 'Sensor-262',
 'Sensor-263',
 'Sensor-264',
 'Sensor-265',
 'Sensor-266',
 'Sensor-267',
 'Sensor-285',
 'Sensor-316',
 'Sensor-323',
 'Sensor-326',
 'Sensor-365',
 'Sensor-370',
 'Sensor-371',
 'Sensor-372',
 'Sensor-373',
 'Sensor-374',
 'Sensor-375',
 'Sensor-376',
 'Sensor-379',
 'Sensor-380',
 'Sensor-381',
 'Sensor-382',
 'Sensor-395',
 'Sensor-396',
 'Sensor-397',
 'Sensor-398',
 'Sensor-399',
 'Sensor-400',
 'Sensor-401',
 'Sensor-402',
 'Sensor-403',
 'Sensor-404',
 'Sensor-405',
 'Sensor-423',
 'Sensor-452',
 'Sensor-459',
 'Sensor-462',
 'Sensor-499',
 'Sensor-502',


These are the colums that are not gonna contribute to ML algorithm in anywat,whatsoever

In [19]:
df.drop(columns=(colum_to_drop_1),inplace=True)

In [20]:
df.shape

(100, 335)

In [21]:
df

Unnamed: 0,Wafer,Sensor-2,Sensor-9,Sensor-10,Sensor-11,Sensor-12,Sensor-13,Sensor-15,Sensor-16,Sensor-17,...,Sensor-578,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
0,Wafer-801,2476.58,1.5300,-0.0279,-0.0040,0.9468,198.1219,6.0959,416.5950,9.5431,...,17.6552,0.5004,0.0120,0.0033,2.4069,0.0545,0.0184,0.0055,33.7876,-1
1,Wafer-802,2506.43,1.3953,0.0084,0.0062,0.9461,204.6134,5.1756,406.3290,10.7168,...,11.8075,0.4994,0.0115,0.0031,2.3020,0.0545,0.0184,0.0055,33.7876,1
2,Wafer-803,2500.68,1.3896,0.0138,0.0000,0.9656,199.5093,4.8205,414.1385,10.0666,...,17.6552,0.4987,0.0118,0.0036,2.3719,0.0545,0.0184,0.0055,33.7876,-1
3,Wafer-804,2419.83,1.4108,-0.0046,-0.0024,0.9589,199.6262,13.3691,411.8383,10.6553,...,17.6552,0.4934,0.0123,0.0040,2.4923,0.0545,0.0184,0.0055,33.7876,-1
4,Wafer-805,2435.34,1.5094,-0.0046,0.0121,0.9674,202.6499,3.4480,397.9388,9.4594,...,15.1082,0.4987,0.0145,0.0041,2.8991,0.0545,0.0184,0.0055,33.7876,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Wafer-896,2526.44,1.5337,0.0090,0.0058,0.9685,196.6199,8.6805,408.5064,9.5013,...,17.0523,0.5013,0.0076,0.0021,1.5152,0.0153,0.0048,0.0017,31.0176,-1
96,Wafer-897,2477.01,1.4695,0.0071,0.0215,0.9782,198.1811,6.8175,405.6568,8.9468,...,13.2830,0.5003,0.0106,0.0028,2.1263,0.0153,0.0048,0.0017,31.0176,1
97,Wafer-898,2387.42,1.3603,-0.0031,0.0086,0.9575,197.5712,9.9421,409.1935,10.0975,...,23.1735,0.5016,0.0130,0.0028,2.5865,0.0153,0.0048,0.0017,31.0176,-1
98,Wafer-899,2541.89,1.4493,-0.0194,-0.0018,0.9673,202.8336,8.6085,414.2447,9.4609,...,14.6551,0.5023,0.0140,0.0033,2.7810,0.0153,0.0048,0.0017,31.0176,-1


In [22]:
X=df.drop("Good/Bad",axis=1)
y=df['Good/Bad']

In [23]:
X.shape

(100, 334)

In [24]:
y.shape

(100,)

In [25]:
X.drop("Wafer",inplace=True,axis=1)

# Data Transformation

In [26]:
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer # For filling the missing values
# from sklearn.impute import SimpleImputer ** For filling the missing values
from sklearn.preprocessing import RobustScaler # For scalinng the data
# from sklearn.preprocessing import StandardScaler **For Scaling the data** 

imputer=KNNImputer(n_neighbors=3)
preprocessing_pipeline=Pipeline(
    steps=[("Imputer",imputer),('Scaler',RobustScaler())]
)
preprocessing_pipeline


In [27]:
X_trans=preprocessing_pipeline.fit_transform(X)
X_trans

array([[-0.21388855,  0.76261869, -1.92026578, ...,  0.46272494,
         0.515625  , -0.40093   ],
       [ 0.15494872, -0.58370815,  0.49169435, ...,  0.46272494,
         0.515625  , -0.40093   ],
       [ 0.08389967, -0.64067966,  0.85049834, ...,  0.46272494,
         0.515625  , -0.40093   ],
       ...,
       [-1.31558137, -0.93353323, -0.27242525, ..., -0.93573265,
        -0.671875  , -0.44021198],
       [ 0.59310515, -0.04397801, -1.35548173, ..., -0.93573265,
        -0.671875  , -0.44021198],
       [-0.35561596,  1.27936032,  0.71760797, ..., -0.93573265,
        -0.671875  , -0.44021198]])

# Resampling of Training Instances

In [28]:
pip install imbalanced-learn

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\Asus\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


In [31]:
from imblearn.combine import SMOTETomek
X,y=X_trans[:,:-1],y
resample=SMOTETomek(sampling_strategy='auto')
X_res,y_res=resample.fit_resample(X,y)

In [33]:
y_res.value_counts()

Good/Bad
-1    94
 1    94
Name: count, dtype: int64

## Prepare the Tes Set

In [34]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X_res,y_res,test_size=0.3,random_state=42)

In [35]:
X_train.shape,y_train.shape

((131, 332), (131,))