Capstone Project Report

Predicting Yield in Semiconductor Manufacturing

Domain: Semiconductor Manufacturing Process

1. Context

A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signal variables collected from sensors and/or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information, and noise. Engineers typically have a much larger number of signals than are required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process, enabling an increase in process throughput, decreased time to learning, and reduced per unit production costs. These signals can be used as features to predict the yield type. By analyzing and trying out different combinations of features, essential signals that impact the yield type can be identified.

2. Data Description

The dataset, sensor-data.csv, consists of 1567 examples, each with 591 features. Each example represents a single production entity with associated measured features, and the labels represent a simple pass/fail yield for in-house line testing. The target column “-1” corresponds to a pass, and “1” corresponds to a fail.



3. Project Objective

The objective of this project is to build a classifier to predict the Pass/Fail yield of a particular process entity and analyze whether all the features are required to build the model or not.

4. Steps and Tasks

Step 1: Import and Explore the Data

Task 1.1: Import the data from sensor-data.csv using pandas.

Task 1.2: Inspect the first few rows of the dataset to understand its structure and contents.

Task 1.3: Check for the presence of any missing values.

Task 1.4: Explore the basic statistics of the dataset (mean, median, standard deviation, etc.).


In [2]:
import pandas as pd

# Load the data
data = pd.read_csv('/content/drive/MyDrive/uci-secom.csv')

# Check for missing values
missing_values = data.isnull().sum()
print(f"Missing values in each column:\n{missing_values[missing_values > 0]}")

# Separate numeric and non-numeric columns
numeric_cols = data.select_dtypes(include=['number']).columns
non_numeric_cols = data.select_dtypes(exclude=['number']).columns

# Fill missing values in numeric columns with the mean
data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())

# Fill missing values in non-numeric columns with the mode
for col in non_numeric_cols:
    data[col].fillna(data[col].mode()[0], inplace=True)

# Verify there are no more missing values
missing_values_after = data.isnull().sum()
print(f"Missing values after filling:\n{missing_values_after[missing_values_after > 0]}")


Missing values in each column:
0       6
1       7
2      14
3      14
4      14
       ..
585     1
586     1
587     1
588     1
589     1
Length: 538, dtype: int64
Missing values after filling:
Series([], dtype: int64)


Step 2: Data Cleansing

Data cleansing is a crucial step to ensure that the data is accurate and ready for analysis. This step involves handling missing values, removing irrelevant attributes, and making necessary modifications to the data.

Task 2.1: Handle Missing Values

Identify columns with missing values.

Decide on an appropriate strategy to handle missing values (e.g., mean/mode/median imputation, or removing rows/columns).

In [3]:
# Check for missing values
missing_values = data.isnull().sum()
print(f"Missing values in each column:\n{missing_values}")

# Display basic statistics of the dataset
print(data.describe())

Missing values in each column:
Time         0
0            0
1            0
2            0
3            0
            ..
586          0
587          0
588          0
589          0
Pass/Fail    0
Length: 592, dtype: int64
                 0            1            2            3            4  \
count  1567.000000  1567.000000  1567.000000  1567.000000  1567.000000   
mean   3014.452896  2495.850231  2200.547318  1396.376627     4.197013   
std      73.480613    80.227793    29.380932   439.712852    56.103066   
min    2743.240000  2158.750000  2060.660000     0.000000     0.681500   
25%    2966.665000  2452.885000  2181.099950  1083.885800     1.017700   
50%    3011.840000  2498.910000  2200.955600  1287.353800     1.317100   
75%    3056.540000  2538.745000  2218.055500  1590.169900     1.529600   
max    3356.350000  2846.440000  2315.266700  3715.041700  1114.536600   

            5            6            7            8            9  ...  \
count  1567.0  1567.000000  1567.0000

**Task** 2.2: Remove Constant Features
Remove features that have the same value in all rows.

In [6]:
# Remove constant features
constant_features = [col for col in data.columns if data[col].nunique() == 1]
data.drop(columns=constant_features, inplace=True)
print(f"Constant features dropped: {constant_features}")


Constant features dropped: ['5', '13', '42', '49', '52', '69', '97', '141', '149', '178', '179', '186', '189', '190', '191', '192', '193', '194', '226', '229', '230', '231', '232', '233', '234', '235', '236', '237', '240', '241', '242', '243', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '276', '284', '313', '314', '315', '322', '325', '326', '327', '328', '329', '330', '364', '369', '370', '371', '372', '373', '374', '375', '378', '379', '380', '381', '394', '395', '396', '397', '398', '399', '400', '401', '402', '403', '404', '414', '422', '449', '450', '451', '458', '461', '462', '463', '464', '465', '466', '481', '498', '501', '502', '503', '504', '505', '506', '507', '508', '509', '512', '513', '514', '515', '528', '529', '530', '531', '532', '533', '534', '535', '536', '537', '538']


Task 2.3: Remove Duplicate Rows
Remove any duplicate rows in the dataset.

In [7]:
# Remove duplicate rows
data.drop_duplicates(inplace=True)


Task 2.4: Identify and Remove Outliers
Use statistical methods or visualization techniques to identify outliers.
Remove or handle outliers based on the chosen method.

In [8]:
# Using the IQR method to identify outliers
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

# Filtering out the outliers
data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]

# Display the shape of the data after outlier removal
print(f"Data shape after outlier removal: {data.shape}")


Data shape after outlier removal: (0, 475)
