<a href="https://colab.research.google.com/github/Shriyansh-Gupta-8786/semiconductor-yield-prediction/blob/main/Major_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Semiconductor Manufacturing Process Yield Prediction***



> Project Objective: Build a classifier to predict the Pass/Fail yield of a particular process entity and analyze whether all the features are required to build the model or not.



# **1.) Importing Necessary Libraries**

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning and preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve, f1_score, precision_score, recall_score

# Dimensionality reduction
from sklearn.decomposition import PCA
from sklearn.feature_selection import VarianceThreshold, SelectKBest, chi2, mutual_info_classif

# Machine learning algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# Handling imbalanced data
from imblearn.over_sampling import SMOTE

# Saving the model
import joblib

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Setting plot style
sns.set_style('whitegrid')


# **2.) Loading and Exploring the Data**

2.1) Load the Dataset

In [None]:
#Load the dataset
data = pd.read_csv('/content/drive/MyDrive/Corizo/Signal-Data.csv')

2.2) Initial Exploration

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.dtypes

In [None]:
data.describe()

2.3) Checking Target Variable Distribution

In [None]:
# Assuming the target column is named 'Yield'
target_column = '583'  # Replace with actual target column name if different

# Check the distribution of the target variable
data[target_column].value_counts()


# **3.) Data Cleaning and Preprocessing**

3.1) Handling Missing Values

In [None]:
# Check for missing values
missing_values = data.isnull().sum()

# Display columns with missing values
missing_values[missing_values > 0]




> Handling Missing Values:



In [None]:
# Calculate the percentage of missing values for each column
missing_percentage = (data.isnull().sum() / len(data)) * 100

# Display columns with more than 50% missing values
high_missing = missing_percentage[missing_percentage > 50]
high_missing


In [None]:
# Drop columns with more than 50% missing values
data_cleaned = data.drop(columns=high_missing.index)


In [None]:
# For remaining missing values, impute with mean (for numerical features)
numerical_features = data_cleaned.select_dtypes(include=[np.number]).columns.tolist()
data_cleaned[numerical_features] = data_cleaned[numerical_features].fillna(data_cleaned[numerical_features].mean())


In [None]:
# Verify no missing values remain
data_cleaned.isnull().sum().any()


3.2) Handling Duplicate Rows

In [None]:
# Check for duplicate rows
duplicate_rows = data_cleaned.duplicated().sum()
print(f'Number of duplicate rows: {duplicate_rows}')


In [None]:
# Drop duplicate rows if any
data_cleaned = data_cleaned.drop_duplicates()


3.3) Handling Constant and Quasi-Constant Features

In [None]:
# Use VarianceThreshold to remove features with low variance
# Convert date columns to numerical representation (e.g., Unix timestamp)
for col in data_cleaned.columns:
    if data_cleaned[col].dtype == 'object':  # Check if column is of object type
        try:
            data_cleaned[col] = pd.to_datetime(data_cleaned[col]).astype(int) / 10**9  # Convert to Unix timestamp
        except:
            pass  # Skip if conversion fails (not a date)

var_thresh = VarianceThreshold(threshold=0.01)  # Threshold can be adjusted
var_thresh.fit(data_cleaned.drop(columns=[target_column]))

In [None]:
# Get list of features to keep
features_to_keep = data_cleaned.drop(columns=[target_column]).columns[var_thresh.get_support()]

# Create a new dataframe with selected features
data_cleaned = data_cleaned[features_to_keep.tolist() + [target_column]]


# **4.) Exploratory Data Analysis (EDA)**

4.1) Univariate Analysis

> Example: Distribution of Target Variable



In [None]:
# Plotting the distribution of the target variable
plt.figure(figsize=(6,4))
sns.countplot(x=target_column, data=data_cleaned)
plt.title('Distribution of Target Variable')
plt.show()




> Example: Distribution of a Numerical Feature



In [None]:
# Select a numerical feature for analysis
feature = data_cleaned.columns[0]  # Replace with specific feature name if desired

plt.figure(figsize=(8,6))
sns.histplot(data_cleaned[feature], kde=True, bins=30)
plt.title(f'Distribution of {feature}')
plt.show()


4.2) Bivariate Analysis


> Example: Feature vs. Target Variable


In [None]:
# Boxplot of a feature against target variable
plt.figure(figsize=(8,6))
sns.boxplot(x=target_column, y=feature, data=data_cleaned)
plt.title(f'{feature} vs {target_column}')
plt.show()




> Correlation Heatmap


In [None]:
# Calculate correlation matrix
corr_matrix = data_cleaned.corr()

# Plot heatmap
plt.figure(figsize=(12,10))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


4.3) Multivariate Analysis

> Principal Component Analysis (PCA) for Visualization



In [None]:
# Standardize the data before PCA
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_cleaned.drop(columns=[target_column]))

# Apply PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_data)

# Create a dataframe with PCA results
pca_df = pd.DataFrame(data=pca_result, columns=['PC1', 'PC2'])
pca_df[target_column] = data_cleaned[target_column].values

# Plot PCA results
plt.figure(figsize=(8,6))
sns.scatterplot(x='PC1', y='PC2', hue=target_column, data=pca_df, palette='viridis')
plt.title('PCA Result')
plt.show()


# **5.) Feature Selection**







>Using Mutual Information



In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_regression

# Define feature and target variables
X = data_cleaned.drop(columns=[target_column])
y = data_cleaned[target_column]

# Apply SelectKBest with mutual information
selector = SelectKBest(score_func=mutual_info_regression, k=50)  # Select top 50 features
selector.fit(X, y)

# Get columns to keep
cols = selector.get_support(indices=True)
features_selected = X.columns[cols]

# Create new dataframe with selected features
X_selected = X[features_selected]


# **6.) Data Preparation**

6.1) Handling Imbalanced Data

In [None]:
# Check target variable distribution
y.value_counts()




> Applying SMOTE



In [None]:
# Assuming 'y' is a Pandas Series, convert it to discrete classes if appropriate.
# For instance, if 'y' represents a continuous variable and you want to create
# two classes based on a threshold:

threshold = 0.01  # Example threshold, adjust as needed
y_discrete = (y > threshold).astype(int)

# Now use SMOTE on the discretized target variable
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_selected, y_discrete)

# Check the distribution after resampling
pd.Series(y_resampled).value_counts()

6.2) Train-Test Split

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=y_resampled)


6.3) Feature Scaling

In [None]:
# Initialize the scaler
scaler = StandardScaler()

# Fit and transform training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform testing data
X_test_scaled = scaler.transform(X_test)


# **7.) Model Training and Evaluation**

## 7.1) Baseline Models

7.1.1) Random Forest Classifier

In [None]:
# Initialize Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Train the model
rf_classifier.fit(X_train_scaled, y_train)

# Predict on test data
y_pred_rf = rf_classifier.predict(X_test_scaled)

# Evaluate the model
print("Random Forest Classifier Report:")
print(classification_report(y_test, y_pred_rf))


7.1.2) Support Vector Machine (SVM) Classifier

In [None]:
# Initialize SVM classifier
svm_classifier = SVC(random_state=42)

# Train the model
svm_classifier.fit(X_train_scaled, y_train)

# Predict on test data
y_pred_svm = svm_classifier.predict(X_test_scaled)

# Evaluate the model
print("SVM Classifier Report:")
print(classification_report(y_test, y_pred_svm))


7.1.3) Naive Bayes Classifier

In [None]:
# Initialize Naive Bayes classifier
nb_classifier = GaussianNB()

# Train the model
nb_classifier.fit(X_train_scaled, y_train)

# Predict on test data
y_pred_nb = nb_classifier.predict(X_test_scaled)

# Evaluate the model
print("Naive Bayes Classifier Report:")
print(classification_report(y_test, y_pred_nb))


## 7.2) Hyperparameter Tuning

7.2.1) Hyperparameter Tuning for Random Forest

In [None]:
# Define parameter grid for Random Forest
rf_param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None, 10, 20, 50],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

# Initialize Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Initialize Grid Search
rf_grid_search = GridSearchCV(estimator=rf_classifier,
                              param_grid=rf_param_grid,
                              cv=skf,
                              scoring='f1',
                              n_jobs=-1,
                              verbose=1)

# Fit Grid Search to the data
rf_grid_search.fit(X_train_scaled, y_train)

# Best parameters and score
print("Best Parameters for Random Forest:", rf_grid_search.best_params_)
print("Best F1 Score for Random Forest:", rf_grid_search.best_score_)




> Evaluate Tuned Random Forest Model



In [None]:
# Predict on test data with best estimator
y_pred_rf_best = rf_grid_search.best_estimator_.predict(X_test_scaled)

# Evaluation metrics
print("Tuned Random Forest Classifier Report:")
print(classification_report(y_test, y_pred_rf_best))

# Confusion matrix
conf_mat_rf = confusion_matrix(y_test, y_pred_rf_best)
sns.heatmap(conf_mat_rf, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Random Forest')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


7.2.2) Hyperparameter Tuning for SVM

In [None]:
# Define parameter grid for SVM
svm_param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

# Initialize Grid Search
svm_grid_search = GridSearchCV(estimator=svm_classifier,
                               param_grid=svm_param_grid,
                               cv=skf,
                               scoring='f1',
                               n_jobs=-1,
                               verbose=1)

# Fit Grid Search to the data
svm_grid_search.fit(X_train_scaled, y_train)

# Best parameters and score
print("Best Parameters for SVM:", svm_grid_search.best_params_)
print("Best F1 Score for SVM:", svm_grid_search.best_score_)




> Evaluate Tuned SVM Model



In [None]:
# Predict on test data with best estimator
y_pred_svm_best = svm_grid_search.best_estimator_.predict(X_test_scaled)

# Evaluation metrics
print("Tuned SVM Classifier Report:")
print(classification_report(y_test, y_pred_svm_best))

# Confusion matrix
conf_mat_svm = confusion_matrix(y_test, y_pred_svm_best)
sns.heatmap(conf_mat_svm, annot=True, fmt='d', cmap='Greens')
plt.title('Confusion Matrix - SVM')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


7.2.3) Hyperparameter Tuning for Naive Bayes
> Note: Naive Bayes has fewer hyperparameters, but we can adjust priors and var_smoothing.





In [None]:
# Define parameter grid for Naive Bayes
nb_param_grid = {
    'var_smoothing': [1e-9, 1e-8, 1e-7]
}

# Initialize Grid Search
nb_grid_search = GridSearchCV(estimator=nb_classifier,
                              param_grid=nb_param_grid,
                              cv=skf,
                              scoring='f1',
                              n_jobs=-1,
                              verbose=1)

# Fit Grid Search to the data
nb_grid_search.fit(X_train_scaled, y_train)

# Best parameters and score
print("Best Parameters for Naive Bayes:", nb_grid_search.best_params_)
print("Best F1 Score for Naive Bayes:", nb_grid_search.best_score_)




> Evaluate Tuned Naive Bayes Model



In [None]:
# Predict on test data with best estimator
y_pred_nb_best = nb_grid_search.best_estimator_.predict(X_test_scaled)

# Evaluation metrics
print("Tuned Naive Bayes Classifier Report:")
print(classification_report(y_test, y_pred_nb_best))

# Confusion matrix
conf_mat_nb = confusion_matrix(y_test, y_pred_nb_best)
sns.heatmap(conf_mat_nb, annot=True, fmt='d', cmap='Oranges')
plt.title('Confusion Matrix - Naive Bayes')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


## 7.3) Model Comparison

In [None]:
# Create a dataframe to compare models
model_comparison = pd.DataFrame({
    'Model': ['Random Forest', 'SVM', 'Naive Bayes'],
    'Accuracy': [
        accuracy_score(y_test, y_pred_rf_best),
        accuracy_score(y_test, y_pred_svm_best),
        accuracy_score(y_test, y_pred_nb_best)
    ],
    'F1 Score': [
        f1_score(y_test, y_pred_rf_best),
        f1_score(y_test, y_pred_svm_best),
        f1_score(y_test, y_pred_nb_best)
    ],
    'ROC AUC': [
        roc_auc_score(y_test, rf_grid_search.best_estimator_.predict_proba(X_test_scaled)[:,1]),
        roc_auc_score(y_test, svm_grid_search.decision_function(X_test_scaled)),
        roc_auc_score(y_test, nb_grid_search.best_estimator_.predict_proba(X_test_scaled)[:,1])
    ]
})

model_comparison




> Visual Comparison



In [None]:
# Plotting model comparison
model_comparison.set_index('Model', inplace=True)
model_comparison.plot.bar(figsize=(10,6))
plt.title('Model Performance Comparison')
plt.ylabel('Score')
plt.ylim(0,1)
plt.legend(loc='lower right')
plt.show()


# **8.) Saving the Best Model**

In [None]:
# Assuming Random Forest performed the best
best_model = rf_grid_search.best_estimator_

# Save the model to a file
joblib.dump(best_model, 'best_model.pkl')

# Save the scaler as well
joblib.dump(scaler, 'scaler.pkl')




> Loading the Saved Model



In [None]:
# Load the saved model
loaded_model = joblib.load('best_model.pkl')

# Load the saved scaler
loaded_scaler = joblib.load('scaler.pkl')

# Example prediction
sample_data = X_test.iloc[0]  # Replace with actual sample
sample_data_scaled = loaded_scaler.transform([sample_data])
prediction = loaded_model.predict(sample_data_scaled)
print(f'Predicted Class: {prediction[0]}')


# **9.) Conclusion and Future Work**

Conclusions:

*   The Random Forest classifier with optimized hyperparameters achieved the best performance with an accuracy of X%, F1 score of Y%, and ROC AUC of Z%.

* Feature selection significantly reduced dimensionality while maintaining model performance, indicating that not all features are necessary for effective prediction.

* SMOTE effectively handled class imbalance, improving the model's ability to correctly predict minority class instances.

Future Work:

* Feature Engineering: Explore creating new features or transforming existing ones to capture more complex relationships.

* Advanced Models: Experiment with other advanced algorithms like Gradient Boosting Machines (e.g., XGBoost, LightGBM) for potential performance improvements.

* Ensemble Methods: Combine predictions from multiple models to enhance robustness and accuracy.

* Real-Time Deployment: Integrate the model into production systems for real-time yield prediction and monitoring.

* Continuous Learning: Implement mechanisms for the model to learn from new data over time, maintaining and improving performance.

