# Semiconductor Manufacturing - Test Result Classification Project

## Model Development

This notebook contains Model Development, and save the result to /results folder.

Model used: Logistic Regression / Support Vector Classifier / Random Forest Classifer / XGBoost Classifier

ML pipeline structure: 

- Data Preprocess
- Standardization

- For i in random states, 

    - Stratifed train/test split
    - Stratifed 3 folds CV insider GridSearch CV
      - Optimized for F1.5
  
    - Save mean/std scores
    - Hyperpamater Tunning
  
- Output 5 best models for each random states for each ML algo -> /results

### Environment Set-up

In [2]:
# data wrangling
import numpy as np 
import pandas as pd 

# plot
import matplotlib.pyplot as plt
import seaborn as sns

# data prep
from sklearn.model_selection import train_test_split, KFold, GridSearchCV,StratifiedKFold
from sklearn.metrics import confusion_matrix, make_scorer, f1_score, recall_score, roc_auc_score, accuracy_score,fbeta_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif

# models
import joblib
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, OneClassSVM
from sklearn.linear_model import LogisticRegression, Lasso
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

# to avoid warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

### Data Source Description:

- Data: **UCI SECOM Dataset (Kaggle)**

    - Data from a semi-conductor manufacturing process

    - Number of Instances: 1567

    - Number of Attributes: 591
    
    - Missing Values? Yes

- UCI Link: https://archive.ics.uci.edu/dataset/179/secom
  
- Kaggle Link: https://www.kaggle.com/datasets/paresh2047/uci-semcom

- Associated Tasks: **Classification**

- **Imbalanced data**: 93.4% (Pass) / 6.6% (Fail)


### Load data

In [3]:
# reading the data
data = pd.read_csv('../data/uci-secom.csv')

# we have 1,567 rows and 592 columns
# print(data.shape)

# checking the first 5 rows
# data.head(5)

### Data Preprocess

1. **Drop Univariate features**: 

    - Drop 116 features that contain constant value (dropped before split)

2. **Drop the features with strong correlation**

    - Drop features that have strong correaltion with other feature more than absolute value 0.9

3. **Replace Missing Data with 0**:  

    - According to the data source provider, the absence of a signal (the feature value) is assumed to be **NO** signal, here I replace the null valeus with 0.  

    - Only replace missing data with 0 while use RF, Logistic, and SVC.

        - Note: For XGBoost, I'll leave it to the model itself the handle the missing values. 
        
4. **Feature Selection**: 

    - Use built-in feature selection: XGBoost, RF


In [56]:
# 1. Drop Unique Value Columns
unique_value_columns = data.columns[data.nunique() == 1]
data_cleaned = data.drop(columns=unique_value_columns) 
print("Data Shape after dropping univariate columns:", data_cleaned.shape)

Data Shape after dropping univariate columns: (1567, 476)


In [57]:
# 2. Drop the features have correlation with other feaures more than 0.9
corr_matrix = data_cleaned.iloc[:, 1:].corr()
strong_corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i + 1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.9:
            strong_corr_pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))

# Drop features with strong correlation (|corr| > 0.9)
features_to_drop = set()
for feature1, feature2, corr_value in strong_corr_pairs:
    if abs(corr_value) > 0.9:
        features_to_drop.add(feature2)

data_cleaned = data_cleaned.drop(columns=features_to_drop)
print("New data shape after dropping strongly correlated features:", data_cleaned.shape)


New data shape after dropping strongly correlated features: (1567, 256)


In [None]:
# 3. Replace missing values with '0' for RF, Logistic, and SVC
df_na_0 = data_cleaned.replace(np.NaN, 0) # for RF, Logistic, and SVC
df_na = data_cleaned # for XGboost

In [None]:
# 4. Select X,Y data
y = data_cleaned['Pass/Fail']
y = np.where(y == -1, 0, 1) # change -1 to 0
X_0 = df_na_0.drop(columns=['Pass/Fail','Time'])
X_XGB = df_na.drop(columns=['Pass/Fail','Time']) # for XGboost
print("Final X shape:",X_0.shape)

Final X shape: (1567, 254)


### ML Pipeline Function

1. **Split & Cross Validation Strategy**:

    - This function splits the data into other/test (80/20)
  
    - KFold with 3 folds to 'other'.

2. **Evaluation Metrics**: 

    - Use F_1.5 Score: The F_1.5 score is maximized through cross-validation during grid search.
  
      - Why F_1.5: More weights towards recall, consider failture detection matters more in this situation. 

3. **Function Parameters**:

    - X: Features dataframe

    - y: Target variable

    - ML_algo: Model to train

    - param_grid: Hyperparameter grid for GridSearchCV 
        
4. **Returns**:

    - ml_results: Contain best model information (*iter_results*) for 5 iteration for one specific ML algo and parameter grid.
  
    - iter_results: 
        ```
            iter_results['grid'] = grid
            iter_results['best_params'] = grid.best_params_
            iter_results['validation_score'] = grid.best_score_
            iter_results['test_scores'] = fbeta_score(y_test, y_test_pred,beta=1.5, pos_label=1)
            iter_results['y_test'] = y_test
            iter_results['y_test_pred'] = y_test_pred
            iter_results['cv_results'] = grid.cv_results_
        ```
1. **Reproducibility and Randomness**:

    - Random States: Range(5)

In [8]:
def ML_StratifiedKFold_f_score(X, y, ML_algo, param_grid):

    ml_results = []
    final_models = []   

    random_states = range(5) # Random states for reproducibility
    for random_state in random_states:
        iter_results = {'random_state': random_state}        
        print("Random State:", random_state + 1) 

        X_other, X_test, y_other, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state, stratify=y)

        skf = StratifiedKFold(n_splits=3,shuffle=True,random_state=random_state)

        steps = [('scaler', StandardScaler())]
        steps.append(('model', ML_algo))
        pipe = Pipeline(steps)
        score = make_scorer(fbeta_score,beta=1.5, pos_label=1)

        grid = GridSearchCV(pipe, param_grid=param_grid, cv=skf, scoring= score, return_train_score=True, verbose=True, n_jobs=-1)
        grid.fit(X_other, y_other)

        iter_results['grid'] = grid
        iter_results['best_params'] = grid.best_params_
        iter_results['validation_score'] = grid.best_score_

        final_models.append(grid)
        y_test_pred = grid.best_estimator_.predict(X_test)
        y_test_pred = final_models[-1].predict(X_test)
        iter_results['test_scores'] = fbeta_score(y_test, y_test_pred,beta=1.5, pos_label=1)
        iter_results['y_test'] = y_test
        iter_results['y_test_pred'] = y_test_pred
        iter_results['cv_results'] = grid.cv_results_
        ml_results.append(iter_results)

        print("Best Model Parameters:", iter_results['best_params'])
        print('validation score:',iter_results['validation_score'])
        print("Test F1.5-score:", iter_results['test_scores'])
        
    return ml_results

### Training & Hyperparameter Tunning

- Logistic Regression
- SVC
- Random Forest Classifier
- XGBoost

#### 1. Logistic Regression

In [47]:
# Logistic Regression parameter grid 
lr_param_grid = [
    {
        'model__solver': ['liblinear'],  #  Only Support l1 & l2
        'model__penalty': ['l1', 'l2'], 
        'model__C': [0.01, 0.1, 1, 10, 100] 
    },
    {
        'model__solver': ['saga'],  # Support elastic net
        'model__penalty': ['elasticnet'],  
        'model__C': [0.01, 0.1, 1, 10, 100],
        'model__l1_ratio': [0.5, 0.7]  # Elastic net ratio
    }]

lr_model = LogisticRegression(max_iter=100000, class_weight='balanced')

lr_results = ML_StratifiedKFold_f_score(X_0, y, lr_model, lr_param_grid)

Random State: 1
Fitting 3 folds for each of 20 candidates, totalling 60 fits




Best Model Parameters: {'model__C': 0.01, 'model__l1_ratio': 0.7, 'model__penalty': 'elasticnet', 'model__solver': 'saga'}
validation score: 0.310185169150392
Test F1.5-score: 0.2765957446808511
Random State: 2
Fitting 3 folds for each of 20 candidates, totalling 60 fits




Best Model Parameters: {'model__C': 0.01, 'model__l1_ratio': 0.7, 'model__penalty': 'elasticnet', 'model__solver': 'saga'}
validation score: 0.30145158118948445
Test F1.5-score: 0.3797752808988764
Random State: 3
Fitting 3 folds for each of 20 candidates, totalling 60 fits




Best Model Parameters: {'model__C': 0.01, 'model__l1_ratio': 0.7, 'model__penalty': 'elasticnet', 'model__solver': 'saga'}
validation score: 0.2811287267115584
Test F1.5-score: 0.3603411513859275
Random State: 4
Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best Model Parameters: {'model__C': 0.01, 'model__l1_ratio': 0.5, 'model__penalty': 'elasticnet', 'model__solver': 'saga'}
validation score: 0.27369263989300846
Test F1.5-score: 0.2392638036809816
Random State: 5
Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best Model Parameters: {'model__C': 0.01, 'model__penalty': 'l1', 'model__solver': 'liblinear'}
validation score: 0.29093883467154485
Test F1.5-score: 0.2914798206278027


In [48]:
import joblib
joblib.dump(lr_results, '../results/lr_results.pkl')

['../results/lr_results.pkl']

#### Support Vector Classifier



In [42]:
svc_param_grid = [{
    'model__C': [0.01, 0.1, 1, 10, 100],
    'model__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}]

svc_model = SVC(probability=True, class_weight='balanced')

svc_results = ML_StratifiedKFold_f_score(X_0, y, svc_model, svc_param_grid)

Random State: 1
Fitting 3 folds for each of 30 candidates, totalling 90 fits
Best Model Parameters: {'model__C': 1, 'model__gamma': 0.001}
validation score: 0.25761848109349733
Test F1.5-score: 0.2869757174392936
Random State: 2
Fitting 3 folds for each of 30 candidates, totalling 90 fits
Best Model Parameters: {'model__C': 1, 'model__gamma': 0.001}
validation score: 0.2612216093747098
Test F1.5-score: 0.24940047961630696
Random State: 3
Fitting 3 folds for each of 30 candidates, totalling 90 fits
Best Model Parameters: {'model__C': 0.01, 'model__gamma': 0.001}
validation score: 0.1873506159220445
Test F1.5-score: 0.18892733564013842
Random State: 4
Fitting 3 folds for each of 30 candidates, totalling 90 fits
Best Model Parameters: {'model__C': 1, 'model__gamma': 0.001}
validation score: 0.24175515305130857
Test F1.5-score: 0.3341902313624679
Random State: 5
Fitting 3 folds for each of 30 candidates, totalling 90 fits
Best Model Parameters: {'model__C': 1, 'model__gamma': 0.001}
valida

In [49]:
joblib.dump(svc_results, '../results/svc_results.pkl')

['../results/svc_results.pkl']

#### Random Forest Classifier

In [41]:
rf_param_grid = {
    'model__max_depth': [ 2, 3, 4,  5, 10, 100 ],
    'model__max_features': [0.25, 0.5, 0.75, 1]}
rf_model = RandomForestClassifier(class_weight='balanced', n_jobs=-1)
rf_results = ML_StratifiedKFold_f_score(X_0, y, rf_model, rf_param_grid)

Random State: 1
Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best Model Parameters: {'model__max_depth': 2, 'model__max_features': 0.5}
validation score: 0.2824915983879206
Test F1.5-score: 0.3302540415704388
Random State: 2
Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best Model Parameters: {'model__max_depth': 2, 'model__max_features': 0.75}
validation score: 0.2361911755954796
Test F1.5-score: 0.41379310344827586
Random State: 3
Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best Model Parameters: {'model__max_depth': 2, 'model__max_features': 0.75}
validation score: 0.24690804600493868
Test F1.5-score: 0.3117505995203837
Random State: 4
Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best Model Parameters: {'model__max_depth': 2, 'model__max_features': 0.75}
validation score: 0.26107174372418684
Test F1.5-score: 0.2805755395683453
Random State: 5
Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best Model Parame

In [50]:
joblib.dump(rf_results, '../results/rf_results.pkl')

['../results/rf_results.pkl']

#### XGBoost

In [52]:
negative_instances = len(y) - sum(y) 
positive_instances = sum(y) 
scale_pos_weight = negative_instances / positive_instances

# XGBoost parameter grid
xgb_param_grid = {
    'model__max_depth': [1, 2, 3, 5, 10],
    'model__learning_rate': [0.01, 0.1, 0.3, 1],
    'model__subsample': [0.5, 0.7, 0.9],
    'model__colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9]
    }

xgb_model = XGBClassifier(scale_pos_weight=scale_pos_weight,eval_metric='logloss')

xgb_results = ML_StratifiedKFold_f_score(X_XGB, y, xgb_model, xgb_param_grid)

Random State: 1
Fitting 3 folds for each of 300 candidates, totalling 900 fits
Best Model Parameters: {'model__colsample_bytree': 0.5, 'model__learning_rate': 0.01, 'model__max_depth': 1, 'model__subsample': 0.5}
validation score: 0.32369811566415657
Test F1.5-score: 0.3023255813953488
Random State: 2
Fitting 3 folds for each of 300 candidates, totalling 900 fits
Best Model Parameters: {'model__colsample_bytree': 0.7, 'model__learning_rate': 0.1, 'model__max_depth': 1, 'model__subsample': 0.9}
validation score: 0.32334400456116735
Test F1.5-score: 0.3797752808988764
Random State: 3
Fitting 3 folds for each of 300 candidates, totalling 900 fits
Best Model Parameters: {'model__colsample_bytree': 0.9, 'model__learning_rate': 0.01, 'model__max_depth': 1, 'model__subsample': 0.7}
validation score: 0.2949882939889226
Test F1.5-score: 0.3443708609271523
Random State: 4
Fitting 3 folds for each of 300 candidates, totalling 900 fits
Best Model Parameters: {'model__colsample_bytree': 0.5, 'model

In [53]:
joblib.dump(xgb_results, '../results/xgb_results.pkl')

['../results/xgb_results.pkl']