# Sensor Fault Detection

**Brief:** In electronics, a **wafer** (also called a slice or substrate) is a thin slice of semiconductor, such as a crystalline silicon (c-Si), used for the fabrication of integrated circuits and, in photovoltaics, to manufacture solar cells. The wafer serves as the substrate(serves as foundation for contruction of other components) for microelectronic devices built in and upon the wafer. 

It undergoes many microfabrication processes, such as doping, ion implantation, etching, thin-film deposition of various materials, and photolithographic patterning. Finally, the individual microcircuits are separated by wafer dicing and packaged as an integrated circuit.

## Problem Statement

**Data:** Wafers data


**Problem Statement:** Wafers are predominantly used to manufacture solar cells and are located at remote locations in bulk and they themselves consist of few hundreds of sensors. Wafers are fundamental of photovoltaic power generation, and production thereof requires high technology. Photovoltaic power generation system converts sunlight energy directly to electrical energy.

The motto behind figuring out the faulty wafers is to obliterate the need of having manual man-power doing the same. And make no mistake when we're saying this, even when they suspect a certain wafer to be faulty, they had to open the wafer from the scratch and deal with the issue, and by doing so all the wafers in the vicinity had to be stopped disrupting the whole process and stuff anf this is when that certain wafer was indeed faulty, however, when their suspicion came outta be false negative, then we can only imagine the waste of time, man-power and ofcourse, cost incurred.

**Solution:** Data fetched by wafers is to be passed through the machine learning pipeline and it is to be determined whether the wafer at hand is faulty or not apparently obliterating the need and thus cost of hiring manual labour.

### Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [10]:
file_path = r'wafer_23012020_041211.csv'
wafer_df = pd.read_csv(file_path)
print(wafer_df.shape)
wafer_df.head()

(100, 592)


Unnamed: 0.1,Unnamed: 0,Sensor-1,Sensor-2,Sensor-3,Sensor-4,Sensor-5,Sensor-6,Sensor-7,Sensor-8,Sensor-9,...,Sensor-582,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
0,Wafer-801,2968.33,2476.58,2216.7333,1748.0885,1.1127,100.0,97.5822,0.1242,1.53,...,,0.5004,0.012,0.0033,2.4069,0.0545,0.0184,0.0055,33.7876,-1
1,Wafer-802,2961.04,2506.43,2170.0666,1364.5157,1.5447,100.0,96.77,0.123,1.3953,...,,0.4994,0.0115,0.0031,2.302,0.0545,0.0184,0.0055,33.7876,1
2,Wafer-803,3072.03,2500.68,2205.7445,1363.1048,1.0518,100.0,101.8644,0.122,1.3896,...,,0.4987,0.0118,0.0036,2.3719,0.0545,0.0184,0.0055,33.7876,-1
3,Wafer-804,3021.83,2419.83,2205.7445,1363.1048,1.0518,100.0,101.8644,0.122,1.4108,...,,0.4934,0.0123,0.004,2.4923,0.0545,0.0184,0.0055,33.7876,-1
4,Wafer-805,3006.95,2435.34,2189.8111,1084.6502,1.1993,100.0,104.8856,0.1234,1.5094,...,,0.4987,0.0145,0.0041,2.8991,0.0545,0.0184,0.0055,33.7876,-1


In [11]:
wafer_df.describe()

Unnamed: 0,Sensor-1,Sensor-2,Sensor-3,Sensor-4,Sensor-5,Sensor-6,Sensor-7,Sensor-8,Sensor-9,Sensor-10,...,Sensor-582,Sensor-583,Sensor-584,Sensor-585,Sensor-586,Sensor-587,Sensor-588,Sensor-589,Sensor-590,Good/Bad
count,99.0,100.0,97.0,97.0,97.0,97.0,97.0,97.0,100.0,100.0,...,34.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,3017.301212,2487.1803,2202.168281,1484.362181,1.180367,100.0,97.449088,0.122195,1.461516,0.000243,...,74.331709,0.49939,0.013615,0.003549,2.727297,0.02351,0.014875,0.004685,77.430241,-0.88
std,71.819707,66.954212,30.350606,460.985871,0.349654,0.0,5.553324,0.002006,0.0713,0.01061,...,41.857728,0.003431,0.004344,0.000873,0.875848,0.011991,0.007557,0.002527,55.106166,0.477367
min,2825.67,2254.99,2114.6667,978.7832,0.7531,100.0,83.4233,0.116,1.3179,-0.0279,...,20.3091,0.4925,0.0076,0.0021,1.5152,0.0099,0.0048,0.0017,20.3091,-1.0
25%,2973.04,2446.595,2189.9667,1111.5436,0.8373,100.0,95.1089,0.1208,1.407375,-0.006925,...,47.356,0.4973,0.0113,0.003075,2.270425,0.0134,0.009475,0.0027,33.7876,-1.0
50%,3004.39,2493.89,2200.9889,1244.2899,1.1569,100.0,99.5133,0.1222,1.4537,0.001,...,65.12755,0.4994,0.01275,0.0034,2.5464,0.0218,0.0139,0.00385,62.0595,-1.0
75%,3070.385,2527.525,2213.2111,1963.8016,1.383,100.0,101.4578,0.1234,1.507425,0.008125,...,99.41905,0.501525,0.0147,0.003825,2.95375,0.028025,0.0192,0.0059,104.3034,-1.0
max,3221.21,2664.52,2315.2667,2363.6412,2.2073,100.0,107.1522,0.1262,1.6411,0.025,...,223.1018,0.5087,0.0437,0.0089,8.816,0.0545,0.0401,0.015,223.1018,1.0


### Insights
Many columns seem to have 0 standard deviation and many outliers too. So need many steps of preprocessing before modelling

In [29]:
wafer_df.isna().sum().sum() 

2306

In [23]:
na_columns = list(wafer_df.columns[wafer_df.isna().sum()>0])
print(len(na_columns))

167


#### Insights
we have 167 columns with na values and 2306 cells with na values


In [33]:
wafer_df.isna().sum().sum()/(wafer_df.shape[0]*(wafer_df.shape[1]-1))

0.03901861252115059

### Insights
Close to 4 percent of cells are missing so imputation would need to be done properly

In [25]:
wafer_df['Good/Bad'].value_counts()

Good/Bad
-1    94
 1     6
Name: count, dtype: int64

### Insights
Tarrget column is heavily imbalanced so resampling would also be required

In [27]:
# Checking Missing columns in target column
wafer_df['Good/Bad'].isna().sum()

0

### Insights
we have zero rows with target columns missing.

## Visualization

In [66]:
import random
def plot_random_50_columns(dataframe):
    """
    Plots distplot for 50 columns which are selected randomly
    """
    # Select 50 random sensors

    random_50_sensors_idx = set()
    total_columns = len(wafer_df.columns)
    num_columns = 50
    
    while len(random_50_sensors_idx)<num_columns:
        random_index = random.randint(0, total_columns - 1)
        random_50_sensors_idx.add(random_index)
    print(random_50_sensors_idx)
    random_50_sensors_idx
    

In [67]:
plot_random_50_columns(wafer_df)

592
[393, 412, 221, 441, 300, 340, 414, 65, 145, 537, 356, 193, 19, 571, 18, 557, 194, 178, 188, 265, 502, 462, 62, 71, 191, 307, 44, 37, 279, 361, 212, 237, 456, 109, 234, 357, 149, 59, 257, 417, 534, 271, 172, 440, 297, 266, 424, 55, 265, 197, 155]
