#  Wafer fault Detection

**Brief:** In electronics, a **wafer** (also called a slice or substrate) is a thin slice of semiconductor, such as a crystalline silicon (c-Si), used for the fabrication of integrated circuits and, in photovoltaics, to manufacture solar cells. The wafer serves as the substrate(serves as foundation for contruction of other components) for microelectronic devices built in and upon the wafer. 

It undergoes many microfabrication processes, such as doping, ion implantation, etching, thin-film deposition of various materials, and photolithographic patterning. Finally, the individual microcircuits are separated by wafer dicing and packaged as an integrated circuit.

## Problem Statement

**Data:** Wafers data


**Problem Statement:** Wafers are predominantly used to manufacture solar cells and are located at remote locations in bulk and they themselves consist of few hundreds of sensors. Wafers are fundamental of photovoltaic power generation, and production thereof requires high technology. Photovoltaic power generation system converts sunlight energy directly to electrical energy.

The motto behind figuring out the faulty wafers is to obliterate the need of having manual man-power doing the same. And make no mistake when we're saying this, even when they suspect a certain wafer to be faulty, they had to open the wafer from the scratch and deal with the issue, and by doing so all the wafers in the vicinity had to be stopped disrupting the whole process and stuff anf this is when that certain wafer was indeed faulty, however, when their suspicion came outta be false negative, then we can only imagine the waste of time, man-power and ofcourse, cost incurred.

**Solution:** Data fetched by wafers is to be passed through the machine learning pipeline and it is to be determined whether the wafer at hand is faulty or not apparently obliterating the need and thus cost of hiring manual labour.

In [1]:
import pandas as pd
import numpy as np
import warnings as warn
warn.filterwarnings('ignore')
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
wafers_data = pd.read_csv('wafers.csv')

In [3]:
wafers_data

Unnamed: 0.1,Unnamed: 0,Sensor-1,Sensor-2,Sensor-3,Sensor-4,Sensor-5,Sensor-6,Sensor-7,Sensor-8,Sensor-9,...,Sensor-582,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
0,Wafer-801,2968.33,2476.58,2216.7333,1748.0885,1.1127,100.0,97.5822,0.1242,1.5300,...,,0.5004,0.0120,0.0033,2.4069,0.0545,0.0184,0.0055,33.7876,-1
1,Wafer-802,2961.04,2506.43,2170.0666,1364.5157,1.5447,100.0,96.7700,0.1230,1.3953,...,,0.4994,0.0115,0.0031,2.3020,0.0545,0.0184,0.0055,33.7876,1
2,Wafer-803,3072.03,2500.68,2205.7445,1363.1048,1.0518,100.0,101.8644,0.1220,1.3896,...,,0.4987,0.0118,0.0036,2.3719,0.0545,0.0184,0.0055,33.7876,-1
3,Wafer-804,3021.83,2419.83,2205.7445,1363.1048,1.0518,100.0,101.8644,0.1220,1.4108,...,,0.4934,0.0123,0.0040,2.4923,0.0545,0.0184,0.0055,33.7876,-1
4,Wafer-805,3006.95,2435.34,2189.8111,1084.6502,1.1993,100.0,104.8856,0.1234,1.5094,...,,0.4987,0.0145,0.0041,2.8991,0.0545,0.0184,0.0055,33.7876,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Wafer-896,3013.66,2526.44,2185.2111,1141.6306,0.8447,100.0,100.5978,0.1217,1.5337,...,,0.5013,0.0076,0.0021,1.5152,0.0153,0.0048,0.0017,31.0176,-1
96,Wafer-897,2982.87,2477.01,2315.2667,2360.1325,1.1259,100.0,90.1144,0.1160,1.4695,...,,0.5003,0.0106,0.0028,2.1263,0.0153,0.0048,0.0017,31.0176,1
97,Wafer-898,3084.82,2387.42,2171.5000,1028.4440,0.7899,100.0,101.5122,0.1224,1.3603,...,,0.5016,0.0130,0.0028,2.5865,0.0153,0.0048,0.0017,31.0176,-1
98,Wafer-899,2955.87,2541.89,,,,,,,1.4493,...,,0.5023,0.0140,0.0033,2.7810,0.0153,0.0048,0.0017,31.0176,-1


In [4]:
wafers_data.dtypes

Unnamed: 0     object
Sensor-1      float64
Sensor-2      float64
Sensor-3      float64
Sensor-4      float64
               ...   
Sensor-587    float64
Sensor-588    float64
Sensor-589    float64
Sensor-590    float64
Good/Bad        int64
Length: 592, dtype: object

In [5]:
wafers_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 592 entries, Unnamed: 0 to Good/Bad
dtypes: float64(494), int64(97), object(1)
memory usage: 462.6+ KB


In [6]:
wafers_data.describe()

Unnamed: 0,Sensor-1,Sensor-2,Sensor-3,Sensor-4,Sensor-5,Sensor-6,Sensor-7,Sensor-8,Sensor-9,Sensor-10,...,Sensor-582,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
count,99.0,100.0,97.0,97.0,97.0,97.0,97.0,97.0,100.0,100.0,...,34.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,3017.301212,2487.1803,2202.168281,1484.362181,1.180367,100.0,97.449088,0.122195,1.461516,0.000243,...,74.331709,0.49939,0.013615,0.003549,2.727297,0.02351,0.014875,0.004685,77.430241,-0.88
std,71.819707,66.954212,30.350606,460.985871,0.349654,0.0,5.553324,0.002006,0.0713,0.01061,...,41.857728,0.003431,0.004344,0.000873,0.875848,0.011991,0.007557,0.002527,55.106166,0.477367
min,2825.67,2254.99,2114.6667,978.7832,0.7531,100.0,83.4233,0.116,1.3179,-0.0279,...,20.3091,0.4925,0.0076,0.0021,1.5152,0.0099,0.0048,0.0017,20.3091,-1.0
25%,2973.04,2446.595,2189.9667,1111.5436,0.8373,100.0,95.1089,0.1208,1.407375,-0.006925,...,47.356,0.4973,0.0113,0.003075,2.270425,0.0134,0.009475,0.0027,33.7876,-1.0
50%,3004.39,2493.89,2200.9889,1244.2899,1.1569,100.0,99.5133,0.1222,1.4537,0.001,...,65.12755,0.4994,0.01275,0.0034,2.5464,0.0218,0.0139,0.00385,62.0595,-1.0
75%,3070.385,2527.525,2213.2111,1963.8016,1.383,100.0,101.4578,0.1234,1.507425,0.008125,...,99.41905,0.501525,0.0147,0.003825,2.95375,0.028025,0.0192,0.0059,104.3034,-1.0
max,3221.21,2664.52,2315.2667,2363.6412,2.2073,100.0,107.1522,0.1262,1.6411,0.025,...,223.1018,0.5087,0.0437,0.0089,8.816,0.0545,0.0401,0.015,223.1018,1.0


In [7]:
wafers_data.shape

(100, 592)

In [9]:
wafers_data.dtypes

Unnamed: 0     object
Sensor-1      float64
Sensor-2      float64
Sensor-3      float64
Sensor-4      float64
               ...   
Sensor-587    float64
Sensor-588    float64
Sensor-589    float64
Sensor-590    float64
Good/Bad        int64
Length: 592, dtype: object

In [14]:
wafers_data['Good/Bad'].value_counts()

Good/Bad
-1    94
 1     6
Name: count, dtype: int64

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
train_data, test_data = train_test_split(wafers_data,test_size=0.20, random_state=42)

In [17]:
train_data

Unnamed: 0.1,Unnamed: 0,Sensor-1,Sensor-2,Sensor-3,Sensor-4,Sensor-5,Sensor-6,Sensor-7,Sensor-8,Sensor-9,...,Sensor-582,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
55,Wafer-856,,2532.45,2191.1333,2197.6570,1.1569,100.0,89.7222,0.1251,1.5762,...,,0.4936,0.0113,0.0033,2.2874,0.0133,0.0139,0.0038,104.3034,-1
88,Wafer-889,3221.21,2391.20,2189.9667,1046.6212,0.8662,100.0,102.3622,0.1208,1.4756,...,,0.4940,0.0123,0.0033,2.4860,0.0280,0.0078,0.0022,27.7601,-1
26,Wafer-827,2951.85,2525.00,2189.5777,1320.3197,1.3459,100.0,100.7744,0.1234,1.5590,...,53.8577,0.5025,0.0178,0.0045,3.5361,0.0286,0.0154,0.0056,53.8577,-1
42,Wafer-843,2982.07,2447.06,2199.6334,1242.8420,1.4083,100.0,99.2178,0.1221,1.4542,...,,0.4993,0.0151,0.0038,3.0214,0.0117,0.0262,0.0089,223.1018,-1
69,Wafer-870,3058.08,2524.60,2192.3778,1110.5453,0.8147,100.0,99.2922,0.1226,1.4958,...,24.6547,0.4974,0.0171,0.0040,3.4352,0.0218,0.0054,0.0020,24.6547,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,Wafer-861,3071.05,2642.15,2200.9889,1054.5240,1.3830,100.0,100.1800,0.1201,1.4532,...,,0.4973,0.0139,0.0039,2.7851,0.0122,0.0131,0.0039,107.5257,-1
71,Wafer-872,3043.18,2545.53,2192.3778,1110.5453,0.8147,100.0,99.2922,0.1226,1.3824,...,,0.4989,0.0131,0.0036,2.6253,0.0218,0.0054,0.0020,24.6547,-1
14,Wafer-815,3001.26,2519.92,2224.6778,1308.6479,1.3907,100.0,101.1333,0.1208,1.5172,...,48.4818,0.4959,0.0142,0.0037,2.8609,0.0278,0.0135,0.0042,48.4818,-1
92,Wafer-893,3007.00,2572.62,2213.2111,2070.7147,1.9705,100.0,87.7411,0.1232,1.4446,...,,0.4987,0.0172,0.0041,3.4417,0.0195,0.0149,0.0047,76.0035,-1


In [18]:
train_data['Good/Bad'].value_counts()

Good/Bad
-1    74
 1     6
Name: count, dtype: int64

In [19]:
train_data['Good/Bad'].isna().sum()

0

In [20]:
train_data.isna().sum().sum()

1822

In [21]:
train_data.shape

(80, 592)

In [22]:
(1822/(80*592))*100

3.847128378378378

In [23]:
# plt.figure(figsize=(15,100))
# for i, col in enumerate(train_data.columns[1:51]):
#     plt.subplot(20,3,i+1)
#     sns.distplot(x=train_data[col], color='indianred')
#     plt.xlabel(col,weight='bold')
#     plt.tight_layout()


In [24]:
# random_set = []
# for i in range(50):
#     random = np.random.randint(1,592)
#     if random not in random_set:
#         random_set.append(random)
# print(random_set)

In [25]:
# plt.figure(figsize=(15,100))
# for i,col in enumerate(train_data.columns[random_set]):
#     plt.subplot(20,3,i+1)
#     sns.distplot(x=train_data[col],color='indianred')
#     plt.xlabel(col)
#     plt.tight_layout()

In [26]:
def cols_to_be_dropped_wrt_std_zero(df):
    drop_cols_wrt_std_zero = []
    for col in df.columns[1:]:
        if df[col].std() == 0:
            drop_cols_wrt_std_zero.append(col)
    return drop_cols_wrt_std_zero
def cols_to_be_dropped_wrt_null_vals(df):
    drop_cols_wrt_null_vals = []
    for col in df.columns[1:]:
        if (df[col].isna().sum()/df.shape[0])>0.7:
            drop_cols_wrt_null_vals.append(col)
    return drop_cols_wrt_null_vals

    

In [27]:
drop_cols = cols_to_be_dropped_wrt_std_zero(train_data) + cols_to_be_dropped_wrt_null_vals(train_data)

In [28]:
drop_cols

['Sensor-6',
 'Sensor-14',
 'Sensor-43',
 'Sensor-50',
 'Sensor-53',
 'Sensor-70',
 'Sensor-75',
 'Sensor-98',
 'Sensor-142',
 'Sensor-150',
 'Sensor-179',
 'Sensor-180',
 'Sensor-187',
 'Sensor-190',
 'Sensor-191',
 'Sensor-192',
 'Sensor-193',
 'Sensor-194',
 'Sensor-195',
 'Sensor-207',
 'Sensor-210',
 'Sensor-227',
 'Sensor-230',
 'Sensor-231',
 'Sensor-232',
 'Sensor-233',
 'Sensor-234',
 'Sensor-235',
 'Sensor-236',
 'Sensor-237',
 'Sensor-238',
 'Sensor-241',
 'Sensor-242',
 'Sensor-243',
 'Sensor-244',
 'Sensor-257',
 'Sensor-258',
 'Sensor-259',
 'Sensor-260',
 'Sensor-261',
 'Sensor-262',
 'Sensor-263',
 'Sensor-264',
 'Sensor-265',
 'Sensor-266',
 'Sensor-267',
 'Sensor-277',
 'Sensor-285',
 'Sensor-314',
 'Sensor-315',
 'Sensor-316',
 'Sensor-323',
 'Sensor-326',
 'Sensor-327',
 'Sensor-328',
 'Sensor-329',
 'Sensor-330',
 'Sensor-331',
 'Sensor-343',
 'Sensor-348',
 'Sensor-365',
 'Sensor-370',
 'Sensor-371',
 'Sensor-372',
 'Sensor-373',
 'Sensor-374',
 'Sensor-375',
 'Se

In [30]:
X= train_data.drop(drop_cols,axis=1)

In [31]:
X = X.drop('Good/Bad',axis=1)

In [32]:
X

Unnamed: 0.1,Unnamed: 0,Sensor-1,Sensor-2,Sensor-3,Sensor-4,Sensor-5,Sensor-7,Sensor-8,Sensor-9,Sensor-10,...,Sensor-581,Sensor-582,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590
55,Wafer-856,,2532.45,2191.1333,2197.6570,1.1569,89.7222,0.1251,1.5762,0.0028,...,,,0.4936,0.0113,0.0033,2.2874,0.0133,0.0139,0.0038,104.3034
88,Wafer-889,3221.21,2391.20,2189.9667,1046.6212,0.8662,102.3622,0.1208,1.4756,-0.0025,...,,,0.4940,0.0123,0.0033,2.4860,0.0280,0.0078,0.0022,27.7601
26,Wafer-827,2951.85,2525.00,2189.5777,1320.3197,1.3459,100.7744,0.1234,1.5590,-0.0032,...,0.0056,53.8577,0.5025,0.0178,0.0045,3.5361,0.0286,0.0154,0.0056,53.8577
42,Wafer-843,2982.07,2447.06,2199.6334,1242.8420,1.4083,99.2178,0.1221,1.4542,0.0142,...,,,0.4993,0.0151,0.0038,3.0214,0.0117,0.0262,0.0089,223.1018
69,Wafer-870,3058.08,2524.60,2192.3778,1110.5453,0.8147,99.2922,0.1226,1.4958,0.0004,...,0.0020,24.6547,0.4974,0.0171,0.0040,3.4352,0.0218,0.0054,0.0020,24.6547
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,Wafer-861,3071.05,2642.15,2200.9889,1054.5240,1.3830,100.1800,0.1201,1.4532,0.0049,...,,,0.4973,0.0139,0.0039,2.7851,0.0122,0.0131,0.0039,107.5257
71,Wafer-872,3043.18,2545.53,2192.3778,1110.5453,0.8147,99.2922,0.1226,1.3824,-0.0001,...,,,0.4989,0.0131,0.0036,2.6253,0.0218,0.0054,0.0020,24.6547
14,Wafer-815,3001.26,2519.92,2224.6778,1308.6479,1.3907,101.1333,0.1208,1.5172,-0.0135,...,0.0042,48.4818,0.4959,0.0142,0.0037,2.8609,0.0278,0.0135,0.0042,48.4818
92,Wafer-893,3007.00,2572.62,2213.2111,2070.7147,1.9705,87.7411,0.1232,1.4446,-0.0050,...,,,0.4987,0.0172,0.0041,3.4417,0.0195,0.0149,0.0047,76.0035


In [33]:
y = train_data['Good/Bad']

In [34]:
y

55   -1
88   -1
26   -1
42   -1
69   -1
     ..
60   -1
71   -1
14   -1
92   -1
51   -1
Name: Good/Bad, Length: 80, dtype: int64

In [35]:
X.shape[1]

465

In [36]:
len(y)

80

In [37]:
X = X.drop('Unnamed: 0', axis=1)

In [38]:
X

Unnamed: 0,Sensor-1,Sensor-2,Sensor-3,Sensor-4,Sensor-5,Sensor-7,Sensor-8,Sensor-9,Sensor-10,Sensor-11,...,Sensor-581,Sensor-582,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590
55,,2532.45,2191.1333,2197.6570,1.1569,89.7222,0.1251,1.5762,0.0028,-0.0066,...,,,0.4936,0.0113,0.0033,2.2874,0.0133,0.0139,0.0038,104.3034
88,3221.21,2391.20,2189.9667,1046.6212,0.8662,102.3622,0.1208,1.4756,-0.0025,0.0025,...,,,0.4940,0.0123,0.0033,2.4860,0.0280,0.0078,0.0022,27.7601
26,2951.85,2525.00,2189.5777,1320.3197,1.3459,100.7744,0.1234,1.5590,-0.0032,0.0135,...,0.0056,53.8577,0.5025,0.0178,0.0045,3.5361,0.0286,0.0154,0.0056,53.8577
42,2982.07,2447.06,2199.6334,1242.8420,1.4083,99.2178,0.1221,1.4542,0.0142,-0.0064,...,,,0.4993,0.0151,0.0038,3.0214,0.0117,0.0262,0.0089,223.1018
69,3058.08,2524.60,2192.3778,1110.5453,0.8147,99.2922,0.1226,1.4958,0.0004,0.0037,...,0.0020,24.6547,0.4974,0.0171,0.0040,3.4352,0.0218,0.0054,0.0020,24.6547
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,3071.05,2642.15,2200.9889,1054.5240,1.3830,100.1800,0.1201,1.4532,0.0049,-0.0048,...,,,0.4973,0.0139,0.0039,2.7851,0.0122,0.0131,0.0039,107.5257
71,3043.18,2545.53,2192.3778,1110.5453,0.8147,99.2922,0.1226,1.3824,-0.0001,-0.0050,...,,,0.4989,0.0131,0.0036,2.6253,0.0218,0.0054,0.0020,24.6547
14,3001.26,2519.92,2224.6778,1308.6479,1.3907,101.1333,0.1208,1.5172,-0.0135,0.0070,...,0.0042,48.4818,0.4959,0.0142,0.0037,2.8609,0.0278,0.0135,0.0042,48.4818
92,3007.00,2572.62,2213.2111,2070.7147,1.9705,87.7411,0.1232,1.4446,-0.0050,-0.0007,...,,,0.4987,0.0172,0.0041,3.4417,0.0195,0.0149,0.0047,76.0035


In [39]:
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import RobustScaler


In [40]:
preprocessor = Pipeline(steps=[('imputer',KNNImputer(n_neighbors=3)),('scaler',RobustScaler()) ])

In [41]:
preprocessor

In [42]:
X_trans = preprocessor.fit_transform(X)

In [43]:
X_trans

array([[-0.02781221,  0.37395233, -0.4289214 , ...,  0.08510638,
         0.        ,  0.75955556],
       [ 2.50431022, -1.38644649, -0.47986463, ..., -0.56382979,
        -0.51612903, -0.52610857],
       [-0.60204699,  0.28110298, -0.49685153, ...,  0.24468085,
         0.58064516, -0.08775867],
       ...,
       [-0.03223295,  0.21779093,  1.03590393, ...,  0.04255319,
         0.12903226, -0.17805529],
       [ 0.03396281,  0.87459106,  0.53517467, ...,  0.19148936,
         0.29032258,  0.28421459],
       [ 0.86164048,  0.3813055 , -0.59146288, ..., -0.08510638,
        -0.16129032, -0.20782888]])

In [44]:
# !pip install kneed

In [45]:
# from sklearn.cluster import KMeans
# from kneed import KneeLocator
# from typing import Tuple
# from dataclasses import dataclass

In [46]:
# @dataclass
# class ClusterDataInstances:
#     X: np.array
#     desc: str
#     def get_ideal_number_of_clusters(self):
#         wcss = []
#         for i in range(1,11):
#             kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
#             kmeans.fit(self.X)
#             wcss.append(kmeans.inertia_)
#         print(wcss)
#         knee_finder = KneeLocator(range(1,11),wcss,curve='convex',direction='decreasing')
#         return knee_finder.knee
#     def create_clusters(self)->tuple:
#         ideal_clusters = self.get_ideal_number_of_clusters()
#         kmeans = KMeans(n_clusters=ideal_clusters,init='k-means++',random_state=42)
#         y_kmeans = kmeans.fit_predict(self.X)
#         return kmeans, np.c_[self.X,y_kmeans]


In [48]:
# clusters_algo = ClusterDataInstances(X=X_trans,desc='WaferFeatures')
# clusterer,X_clustered = clusters_algo.create_clusters()

In [49]:
# X_clustered

In [50]:
# np.unique(X_clustered[:,-1])

In [51]:
# wafers_clus = np.c_[X_clustered,y]

In [52]:
# wafers_clus[wafers_clus[:,-2]==2].shape

In [53]:
# X

In [54]:
# x = X

In [55]:
# @dataclass 
# class ClusterDataInstances:
#     x: np.array
#     desc: str
#     def finding_best_K(self):
#         wcss = []
#         for i in range(1,11):
#             kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
#             kmeans.fit(self.x)
#             wcss.append(kmeans.inertia_)
#         print(wcss)
#         k = KneeLocator(range(1,11),wcss,curve='convex',direction='decreasing')
#         return k.knee
#     def createClusters(self)-> tuple:
#         k = self.finding_best_K()
#         kmeans = KMeans(n_clusters=k,init='k-means++',random_state=42)
#         y = kmeans.fit_predict(self.x)
#         return kmeans,np.c_[self.x,y]
    
    



In [56]:
# cluster_algorithm = ClusterDataInstances(x=X_trans,desc='KMeans')
# clusterer,clustered_data = cluster_algorithm.createClusters()
# print(clustered_data)

In [57]:
# X

In [58]:
X = pd.concat([X,y],axis=1)

In [59]:
X

Unnamed: 0,Sensor-1,Sensor-2,Sensor-3,Sensor-4,Sensor-5,Sensor-7,Sensor-8,Sensor-9,Sensor-10,Sensor-11,...,Sensor-582,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
55,,2532.45,2191.1333,2197.6570,1.1569,89.7222,0.1251,1.5762,0.0028,-0.0066,...,,0.4936,0.0113,0.0033,2.2874,0.0133,0.0139,0.0038,104.3034,-1
88,3221.21,2391.20,2189.9667,1046.6212,0.8662,102.3622,0.1208,1.4756,-0.0025,0.0025,...,,0.4940,0.0123,0.0033,2.4860,0.0280,0.0078,0.0022,27.7601,-1
26,2951.85,2525.00,2189.5777,1320.3197,1.3459,100.7744,0.1234,1.5590,-0.0032,0.0135,...,53.8577,0.5025,0.0178,0.0045,3.5361,0.0286,0.0154,0.0056,53.8577,-1
42,2982.07,2447.06,2199.6334,1242.8420,1.4083,99.2178,0.1221,1.4542,0.0142,-0.0064,...,,0.4993,0.0151,0.0038,3.0214,0.0117,0.0262,0.0089,223.1018,-1
69,3058.08,2524.60,2192.3778,1110.5453,0.8147,99.2922,0.1226,1.4958,0.0004,0.0037,...,24.6547,0.4974,0.0171,0.0040,3.4352,0.0218,0.0054,0.0020,24.6547,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,3071.05,2642.15,2200.9889,1054.5240,1.3830,100.1800,0.1201,1.4532,0.0049,-0.0048,...,,0.4973,0.0139,0.0039,2.7851,0.0122,0.0131,0.0039,107.5257,-1
71,3043.18,2545.53,2192.3778,1110.5453,0.8147,99.2922,0.1226,1.3824,-0.0001,-0.0050,...,,0.4989,0.0131,0.0036,2.6253,0.0218,0.0054,0.0020,24.6547,-1
14,3001.26,2519.92,2224.6778,1308.6479,1.3907,101.1333,0.1208,1.5172,-0.0135,0.0070,...,48.4818,0.4959,0.0142,0.0037,2.8609,0.0278,0.0135,0.0042,48.4818,-1
92,3007.00,2572.62,2213.2111,2070.7147,1.9705,87.7411,0.1232,1.4446,-0.0050,-0.0007,...,,0.4987,0.0172,0.0041,3.4417,0.0195,0.0149,0.0047,76.0035,-1


In [60]:
X.shape

(80, 465)

In [61]:

y.shape

(80,)

In [62]:
# pip install imbalanced-learn

In [63]:
X

Unnamed: 0,Sensor-1,Sensor-2,Sensor-3,Sensor-4,Sensor-5,Sensor-7,Sensor-8,Sensor-9,Sensor-10,Sensor-11,...,Sensor-582,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
55,,2532.45,2191.1333,2197.6570,1.1569,89.7222,0.1251,1.5762,0.0028,-0.0066,...,,0.4936,0.0113,0.0033,2.2874,0.0133,0.0139,0.0038,104.3034,-1
88,3221.21,2391.20,2189.9667,1046.6212,0.8662,102.3622,0.1208,1.4756,-0.0025,0.0025,...,,0.4940,0.0123,0.0033,2.4860,0.0280,0.0078,0.0022,27.7601,-1
26,2951.85,2525.00,2189.5777,1320.3197,1.3459,100.7744,0.1234,1.5590,-0.0032,0.0135,...,53.8577,0.5025,0.0178,0.0045,3.5361,0.0286,0.0154,0.0056,53.8577,-1
42,2982.07,2447.06,2199.6334,1242.8420,1.4083,99.2178,0.1221,1.4542,0.0142,-0.0064,...,,0.4993,0.0151,0.0038,3.0214,0.0117,0.0262,0.0089,223.1018,-1
69,3058.08,2524.60,2192.3778,1110.5453,0.8147,99.2922,0.1226,1.4958,0.0004,0.0037,...,24.6547,0.4974,0.0171,0.0040,3.4352,0.0218,0.0054,0.0020,24.6547,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,3071.05,2642.15,2200.9889,1054.5240,1.3830,100.1800,0.1201,1.4532,0.0049,-0.0048,...,,0.4973,0.0139,0.0039,2.7851,0.0122,0.0131,0.0039,107.5257,-1
71,3043.18,2545.53,2192.3778,1110.5453,0.8147,99.2922,0.1226,1.3824,-0.0001,-0.0050,...,,0.4989,0.0131,0.0036,2.6253,0.0218,0.0054,0.0020,24.6547,-1
14,3001.26,2519.92,2224.6778,1308.6479,1.3907,101.1333,0.1208,1.5172,-0.0135,0.0070,...,48.4818,0.4959,0.0142,0.0037,2.8609,0.0278,0.0135,0.0042,48.4818,-1
92,3007.00,2572.62,2213.2111,2070.7147,1.9705,87.7411,0.1232,1.4446,-0.0050,-0.0007,...,,0.4987,0.0172,0.0041,3.4417,0.0195,0.0149,0.0047,76.0035,-1


In [64]:
# from imblearn.combine import SIMOTomek

In [65]:
# resampler = SIMOTomek(sampling_strategey = 'auto')

In [66]:
X

Unnamed: 0,Sensor-1,Sensor-2,Sensor-3,Sensor-4,Sensor-5,Sensor-7,Sensor-8,Sensor-9,Sensor-10,Sensor-11,...,Sensor-582,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
55,,2532.45,2191.1333,2197.6570,1.1569,89.7222,0.1251,1.5762,0.0028,-0.0066,...,,0.4936,0.0113,0.0033,2.2874,0.0133,0.0139,0.0038,104.3034,-1
88,3221.21,2391.20,2189.9667,1046.6212,0.8662,102.3622,0.1208,1.4756,-0.0025,0.0025,...,,0.4940,0.0123,0.0033,2.4860,0.0280,0.0078,0.0022,27.7601,-1
26,2951.85,2525.00,2189.5777,1320.3197,1.3459,100.7744,0.1234,1.5590,-0.0032,0.0135,...,53.8577,0.5025,0.0178,0.0045,3.5361,0.0286,0.0154,0.0056,53.8577,-1
42,2982.07,2447.06,2199.6334,1242.8420,1.4083,99.2178,0.1221,1.4542,0.0142,-0.0064,...,,0.4993,0.0151,0.0038,3.0214,0.0117,0.0262,0.0089,223.1018,-1
69,3058.08,2524.60,2192.3778,1110.5453,0.8147,99.2922,0.1226,1.4958,0.0004,0.0037,...,24.6547,0.4974,0.0171,0.0040,3.4352,0.0218,0.0054,0.0020,24.6547,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,3071.05,2642.15,2200.9889,1054.5240,1.3830,100.1800,0.1201,1.4532,0.0049,-0.0048,...,,0.4973,0.0139,0.0039,2.7851,0.0122,0.0131,0.0039,107.5257,-1
71,3043.18,2545.53,2192.3778,1110.5453,0.8147,99.2922,0.1226,1.3824,-0.0001,-0.0050,...,,0.4989,0.0131,0.0036,2.6253,0.0218,0.0054,0.0020,24.6547,-1
14,3001.26,2519.92,2224.6778,1308.6479,1.3907,101.1333,0.1208,1.5172,-0.0135,0.0070,...,48.4818,0.4959,0.0142,0.0037,2.8609,0.0278,0.0135,0.0042,48.4818,-1
92,3007.00,2572.62,2213.2111,2070.7147,1.9705,87.7411,0.1232,1.4446,-0.0050,-0.0007,...,,0.4987,0.0172,0.0041,3.4417,0.0195,0.0149,0.0047,76.0035,-1


In [67]:
y

55   -1
88   -1
26   -1
42   -1
69   -1
     ..
60   -1
71   -1
14   -1
92   -1
51   -1
Name: Good/Bad, Length: 80, dtype: int64

In [68]:
# from imblearn.combine import SMOTEomek

In [69]:
X_trans.shape

(80, 464)

In [70]:
# import numpy as np
# import pandas as pd
# from sklearn.utils import resample

# # Example data

# # Separate majority and minority classes
# majority_class = X[X['Good/Bad'] == -1]
# minority_class = X[X['Good/Bad'] == 1]
# # Undersample majority class
# undersampled_majority = resample(majority_class, replace=False, n_samples=len(minority_class), random_state=42)

# # Combine minority class with undersampled majority class
# undersampled_data = pd.concat([undersampled_majority, minority_class])

# # Shuffle the undersampled data
# undersampled_data = undersampled_data.sample(frac=1, random_state=42)

# # Check the class distribution in the undersampled data
# print(undersampled_data['target'].value_counts())

import numpy as np
import pandas as pd
from sklearn.utils import resample


# Separate majority and minority classes
majority_class = X[X['Good/Bad'] == -1]
minority_class = X[X['Good/Bad'] == 1]

# Upsample minority class
upsampled_minority = resample(minority_class, replace=True, n_samples=len(majority_class), random_state=42)

# Combine majority class with upsampled minority class
upsampled_data = pd.concat([majority_class, upsampled_minority])

# Shuffle the upsampled data
upsampled_data = upsampled_data.sample(frac=1, random_state=42)

# Check the class distribution in the upsampled data
print(upsampled_data['Good/Bad'].value_counts())


Good/Bad
 1    74
-1    74
Name: count, dtype: int64


In [73]:
X = upsampled_data

In [74]:





preprocessor.fit_transform(X)

array([[-0.64852356, -0.16074564,  0.0902825 , ..., -0.47368421,
        -0.26399102,  0.5       ],
       [-0.33479957,  0.87640146, -0.40810708, ..., -0.47368421,
        -0.26399102, -0.5       ],
       [ 0.70475061, -1.15426179,  4.33797763, ..., -0.47368421,
        -0.26399102,  0.5       ],
       ...,
       [ 0.03125   ,  0.81926246,  0.11051176, ...,  0.07894737,
         1.13692318, -0.5       ],
       [-0.64852356, -0.16074564,  0.0902825 , ..., -0.47368421,
        -0.26399102,  0.5       ],
       [-0.12825122,  0.        ,  4.33797763, ..., -0.47368421,
        -0.26399102,  0.5       ]])

In [75]:
X['Good/Bad'].value_counts()

Good/Bad
 1    74
-1    74
Name: count, dtype: int64

In [76]:
X

Unnamed: 0,Sensor-1,Sensor-2,Sensor-3,Sensor-4,Sensor-5,Sensor-7,Sensor-8,Sensor-9,Sensor-10,Sensor-11,...,Sensor-582,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
99,2914.86,2465.11,2210.2778,2120.5760,1.0700,95.1089,0.1230,1.5817,0.0118,0.0000,...,,0.5026,0.0121,0.0032,2.4064,0.0153,0.0048,0.0017,31.0176,1
98,2955.87,2541.89,,,,,,1.4493,-0.0194,-0.0018,...,,0.5023,0.0140,0.0033,2.7810,0.0153,0.0048,0.0017,31.0176,-1
94,3091.76,2391.56,2315.2667,2360.1325,1.1259,90.1144,0.1160,1.6107,0.0250,0.0125,...,31.0176,0.5087,0.0116,0.0032,2.2764,0.0153,0.0048,0.0017,31.0176,1
49,2998.89,2532.66,2189.3556,2363.6412,2.1415,83.4233,0.1246,1.4108,0.0095,-0.0026,...,,0.5022,0.0097,0.0027,1.9367,0.0147,0.0095,0.0028,65.0365,-1
41,3212.46,2522.41,2200.2333,1173.8377,1.3281,101.6111,0.1211,1.4650,0.0035,0.0053,...,,0.4936,0.0131,0.0032,2.6457,0.0117,0.0262,0.0089,223.1018,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14,3001.26,2519.92,2224.6778,1308.6479,1.3907,101.1333,0.1208,1.5172,-0.0135,0.0070,...,48.4818,0.4959,0.0142,0.0037,2.8609,0.0278,0.0135,0.0042,48.4818,-1
94,3091.76,2391.56,2315.2667,2360.1325,1.1259,90.1144,0.1160,1.6107,0.0250,0.0125,...,31.0176,0.5087,0.0116,0.0032,2.2764,0.0153,0.0048,0.0017,31.0176,1
5,3003.72,2537.66,2210.7778,2008.9216,1.1351,91.1078,0.1240,1.3940,-0.0073,0.0006,...,114.2878,0.5033,0.0154,0.0043,3.0647,0.0099,0.0113,0.0038,114.2878,-1
99,2914.86,2465.11,2210.2778,2120.5760,1.0700,95.1089,0.1230,1.5817,0.0118,0.0000,...,,0.5026,0.0121,0.0032,2.4064,0.0153,0.0048,0.0017,31.0176,1


In [77]:
from sklearn.model_selection import train_test_split

In [78]:
X_Trans = preprocessor.fit_transform(X)

In [79]:
X

Unnamed: 0,Sensor-1,Sensor-2,Sensor-3,Sensor-4,Sensor-5,Sensor-7,Sensor-8,Sensor-9,Sensor-10,Sensor-11,...,Sensor-582,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
99,2914.86,2465.11,2210.2778,2120.5760,1.0700,95.1089,0.1230,1.5817,0.0118,0.0000,...,,0.5026,0.0121,0.0032,2.4064,0.0153,0.0048,0.0017,31.0176,1
98,2955.87,2541.89,,,,,,1.4493,-0.0194,-0.0018,...,,0.5023,0.0140,0.0033,2.7810,0.0153,0.0048,0.0017,31.0176,-1
94,3091.76,2391.56,2315.2667,2360.1325,1.1259,90.1144,0.1160,1.6107,0.0250,0.0125,...,31.0176,0.5087,0.0116,0.0032,2.2764,0.0153,0.0048,0.0017,31.0176,1
49,2998.89,2532.66,2189.3556,2363.6412,2.1415,83.4233,0.1246,1.4108,0.0095,-0.0026,...,,0.5022,0.0097,0.0027,1.9367,0.0147,0.0095,0.0028,65.0365,-1
41,3212.46,2522.41,2200.2333,1173.8377,1.3281,101.6111,0.1211,1.4650,0.0035,0.0053,...,,0.4936,0.0131,0.0032,2.6457,0.0117,0.0262,0.0089,223.1018,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14,3001.26,2519.92,2224.6778,1308.6479,1.3907,101.1333,0.1208,1.5172,-0.0135,0.0070,...,48.4818,0.4959,0.0142,0.0037,2.8609,0.0278,0.0135,0.0042,48.4818,-1
94,3091.76,2391.56,2315.2667,2360.1325,1.1259,90.1144,0.1160,1.6107,0.0250,0.0125,...,31.0176,0.5087,0.0116,0.0032,2.2764,0.0153,0.0048,0.0017,31.0176,1
5,3003.72,2537.66,2210.7778,2008.9216,1.1351,91.1078,0.1240,1.3940,-0.0073,0.0006,...,114.2878,0.5033,0.0154,0.0043,3.0647,0.0099,0.0113,0.0038,114.2878,-1
99,2914.86,2465.11,2210.2778,2120.5760,1.0700,95.1089,0.1230,1.5817,0.0118,0.0000,...,,0.5026,0.0121,0.0032,2.4064,0.0153,0.0048,0.0017,31.0176,1


In [80]:
X_data = X.drop('Good/Bad',axis=1)
y_data = X['Good/Bad']

In [81]:
x_data_trans = preprocessor.fit_transform(X_data)

In [82]:
x_train,x_test,y_train,y_test = train_test_split(x_data_trans,y_data,test_size=0.33,random_state=42)

In [83]:
x_train

array([[-0.64852356, -0.16074564,  0.0902825 , ..., -0.53676471,
        -0.47368421, -0.26399102],
       [-0.12825122,  0.        ,  4.33797763, ..., -0.53676471,
        -0.47368421, -0.26399102],
       [ 1.62809823,  0.61326489, -0.31610306, ...,  1.03676471,
         1.42105263,  2.96757902],
       ...,
       [-0.17093788, -0.07442929,  0.27279492, ...,  0.        ,
         0.        ,  0.73600898],
       [ 0.70475061, -1.15426179,  4.33797763, ..., -0.53676471,
        -0.47368421, -0.26399102],
       [ 0.65380202,  0.31230582, -0.34037413, ...,  1.03676471,
         1.42105263,  2.96757902]])

In [84]:
x_prep = x_train
y_prep = y_train
x_test_prep = x_test
y_test_prep = y_test

print(x_prep.shape, y_prep.shape)
print(x_test_prep.shape, y_test_prep.shape)

(99, 464) (99,)
(49, 464) (49,)


In [85]:
# y_prep



In [89]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score


In [90]:
svc_clf = SVC(kernel='linear')
svc_rbf_clf = SVC(kernel='rbf')
random_clf = RandomForestClassifier(random_state=42)
xgb_clf = XGBClassifier(objective='binary:logistic')

In [91]:
def display_scores(scores):
    print('scores: ', scores)
    print('mean: ',scores.mean())
    print('std: ',scores.std())

   



In [92]:
svc_clf_score = cross_val_score(svc_clf,x_prep,y_prep,scoring='roc_auc',verbose=2,cv=10)
display_scores(svc_clf_score)
svc_pred = cross_val_predict(svc_clf,x_test_prep,y_test_prep,cv=5)
svc_auc_score = roc_auc_score(y_test_prep,svc_pred)
print(svc_auc_score)

[CV] END .................................................... total time=   0.4s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
scores:  [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
mean:  1.0
std:  0.0
0.9782608695652174


In [97]:
svc_rbf_clf_score = cross_val_score(svc_rbf_clf,x_prep,y_prep,scoring='roc_auc',cv=10)
display_scores(svc_rbf_clf_score)
svc_rbf_pred = cross_val_predict(svc_rbf_clf,x_test_prep,y_test_prep,cv=5)
svc_rbf_auc_score = roc_auc_score(y_test_prep,svc_rbf_pred)
print(svc_rbf_auc_score)

scores:  [0.96       1.         0.96       0.68       0.92       1.
 1.         1.         0.91666667 1.        ]
mean:  0.9436666666666665
std:  0.0933862944976403
0.6086956521739131


In [98]:
random_clf_score = cross_val_score(random_clf,x_prep,y_prep,scoring='roc_auc',verbose=2,cv=10)
display_scores(random_clf_score)
random_pred = cross_val_predict(random_clf,x_test_prep,y_test_prep,cv=5)
random_auc_score = roc_auc_score(y_test_prep,svc_pred)
print(random_auc_score)

[CV] END .................................................... total time=   0.6s
[CV] END .................................................... total time=   0.3s
[CV] END .................................................... total time=   0.3s
[CV] END .................................................... total time=   0.3s
[CV] END .................................................... total time=   0.3s
[CV] END .................................................... total time=   0.3s
[CV] END .................................................... total time=   0.4s
[CV] END .................................................... total time=   0.2s
[CV] END .................................................... total time=   0.4s
[CV] END .................................................... total time=   0.3s
scores:  [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
mean:  1.0
std:  0.0
0.9782608695652174


In [100]:
xgb_clf_score = cross_val_score(svc_clf,x_prep,y_prep,scoring='roc_auc',verbose=2,cv=10)
display_scores(xgb_clf_score)


[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s
scores:  [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
mean:  1.0
std:  0.0
