## Predictive Modeling for Defect Reduction
#### Introduction: In the modern manufacturing landscape, minimizing defects is paramount for ensuring product quality, customer satisfaction, and operational efficiency. To address this challenge, data science techniques offer a powerful approach by leveraging historical data to predict and mitigate defects in production processes. In this report, I shall present a data science approach to predict manufacturing defects using a comprehensive dataset encompassing various production metrics, supply chain factors, quality control assessments, maintenance schedules, workforce productivity indicators, energy consumption patterns, and additive manufacturing specifics.


In [2]:
#pip install scikit-learn

In [3]:
#pip install imbalanced-learn

In [None]:
pip install matplotlib

In [19]:
# import all necessery libraries 
import numpy as np 
import pandas as pd 
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

In [5]:
# loading the file 
file_path = "manufacturing_defect_dataset.csv"
manufacturing_data = pd.read_csv(file_path)

In [6]:
# displaying basic information of the data 
manufacturing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3240 entries, 0 to 3239
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ProductionVolume      3240 non-null   int64  
 1   ProductionCost        3240 non-null   float64
 2   SupplierQuality       3240 non-null   float64
 3   DeliveryDelay         3240 non-null   int64  
 4   DefectRate            3240 non-null   float64
 5   QualityScore          3240 non-null   float64
 6   MaintenanceHours      3240 non-null   int64  
 7   DowntimePercentage    3240 non-null   float64
 8   InventoryTurnover     3240 non-null   float64
 9   StockoutRate          3240 non-null   float64
 10  WorkerProductivity    3240 non-null   float64
 11  SafetyIncidents       3240 non-null   int64  
 12  EnergyConsumption     3240 non-null   float64
 13  EnergyEfficiency      3240 non-null   float64
 14  AdditiveProcessTime   3240 non-null   float64
 15  AdditiveMaterialCost 

In [7]:
# display the columns of the data 
manufacturing_data.head()

Unnamed: 0,ProductionVolume,ProductionCost,SupplierQuality,DeliveryDelay,DefectRate,QualityScore,MaintenanceHours,DowntimePercentage,InventoryTurnover,StockoutRate,WorkerProductivity,SafetyIncidents,EnergyConsumption,EnergyEfficiency,AdditiveProcessTime,AdditiveMaterialCost,DefectStatus
0,202,13175.403783,86.648534,1,3.121492,63.463494,9,0.052343,8.630515,0.081322,85.042379,0,2419.616785,0.468947,5.551639,236.439301,1
1,535,19770.046093,86.310664,4,0.819531,83.697818,20,4.908328,9.296598,0.038486,99.657443,7,3915.566713,0.119485,9.080754,353.957631,1
2,960,19060.820997,82.132472,0,4.514504,90.35055,1,2.464923,5.097486,0.002887,92.819264,2,3392.385362,0.496392,6.562827,396.189402,1
3,370,5647.606037,87.335966,5,0.638524,67.62869,8,4.692476,3.577616,0.055331,96.887013,8,4652.400275,0.183125,8.097496,164.13587,1
4,206,7472.222236,81.989893,3,3.867784,82.728334,9,2.746726,6.851709,0.068047,88.315554,7,1581.630332,0.263507,6.406154,365.708964,1


In [8]:
# handle missing values if any 
print(manufacturing_data.isnull().sum()) 

ProductionVolume        0
ProductionCost          0
SupplierQuality         0
DeliveryDelay           0
DefectRate              0
QualityScore            0
MaintenanceHours        0
DowntimePercentage      0
InventoryTurnover       0
StockoutRate            0
WorkerProductivity      0
SafetyIncidents         0
EnergyConsumption       0
EnergyEfficiency        0
AdditiveProcessTime     0
AdditiveMaterialCost    0
DefectStatus            0
dtype: int64


### There are no missing values in the dataset 

In [9]:
# Check the percentage of missing values 
missing_values_percentage = (manufacturing_data.isnull().mean() * 100 )
print(missing_values_percentage)

ProductionVolume        0.0
ProductionCost          0.0
SupplierQuality         0.0
DeliveryDelay           0.0
DefectRate              0.0
QualityScore            0.0
MaintenanceHours        0.0
DowntimePercentage      0.0
InventoryTurnover       0.0
StockoutRate            0.0
WorkerProductivity      0.0
SafetyIncidents         0.0
EnergyConsumption       0.0
EnergyEfficiency        0.0
AdditiveProcessTime     0.0
AdditiveMaterialCost    0.0
DefectStatus            0.0
dtype: float64


In [10]:
# check for data imbalance 
defects_counts = manufacturing_data['DefectStatus'].value_counts()
print("Defect counts: ", defects_counts)

Defect counts:  DefectStatus
1    2723
0     517
Name: count, dtype: int64


#### The dataset shows that there are 2723 defects and 517 non-defects, therefore the dataset seems to be imbalanced. I can use SMOTE (Synthetic Mionrity Oversampling Technique) to reseolve the issue  

In [11]:
from imblearn.over_sampling import SMOTE

In [12]:
# seperate features (X) and target (y)
X = manufacturing_data.drop('DefectStatus', axis=1)
y = manufacturing_data['DefectStatus']

In [13]:
# feature selection by using chi-squared test to select best features 
from sklearn.feature_selection import SelectKBest, chi2
best_features = SelectKBest(score_func=chi2, k=10)
X_best = best_features.fit_transform(X,y)

In [14]:
# model selection using train test split datasets 
X_train, X_test, y_train, y_test = train_test_split(X_best, y, test_size=0.2, random_state=42)

In [15]:
# model evaluation 
scaler = StandardScaler() # initialize scaler 
X_train_scaled = scaler.fit_transform(X_train) # feature scalling
X_test_scaled = scaler.transform(X_test) # feature scalling 

# apply SMOTE 
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train_scaled, y_train)

# initialize Random forest classifier 
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_resampled,y_resampled)

In [None]:
# evaluate the model 
# predictions 
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
y_pred = rf_classifier.predict(X_test_scaled)
# print classification report 
print("Classification report:\n", classification_report(y_test, y_pred))
#print confusion matrix 
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
# print accuracy score 
print("Accuracy score:\n", accuracy_score(y_test, y_pred))

Classification report:
               precision    recall  f1-score   support

           0       0.91      0.77      0.84       102
           1       0.96      0.99      0.97       546

    accuracy                           0.95       648
   macro avg       0.93      0.88      0.90       648
weighted avg       0.95      0.95      0.95       648

Confusion matrix:
 [[ 79  23]
 [  8 538]]
Accuracy score:
 0.9521604938271605


#### The model identifies well class 1 which is defective parts 99% of the time (Recall), and excellent performce of the model with a high f1-score at 97%. But the model needs improvement because with a recall of 79 on class 0 (non defective), it means the model missed 23 of non-defective parts. 

### Feature importance 

In [17]:
# features importance 
rf_feature_importance = rf_classifier.feature_importances_
print("Feature importance:", rf_feature_importance)

Feature importance: [0.11414529 0.03844907 0.03995547 0.22802099 0.1520152  0.29363539
 0.03034909 0.03467295 0.03046117 0.03829539]


In [26]:
# get the original features names from the dataset 
original_features_names = X.columns
# get the indices of the selected features 
selected_indices = best_features.get_support(indices=True)
# retrieve actual names from the selected features
selected_features_names = [original_features_names[i] for i in selected_indices ]
print("Selected features:", selected_features_names)

Selected features: ['ProductionVolume', 'ProductionCost', 'SupplierQuality', 'DefectRate', 'QualityScore', 'MaintenanceHours', 'SafetyIncidents', 'EnergyConsumption', 'EnergyEfficiency', 'AdditiveMaterialCost']


#### Here are the 10 Selected features: ['ProductionVolume', 'ProductionCost', 'SupplierQuality', 'DefectRate', 'QualityScore', 'MaintenanceHours', 'SafetyIncidents', 'EnergyConsumption', 'EnergyEfficiency', 'AdditiveMaterialCost']

In [31]:
# map feature names to feature importances
importance_dict = dict(zip(selected_features_names, rf_feature_importance))
# sort by importance 
sorted_importance = sorted(importance_dict.items(), key=lambda x: x[1], reverse=True)
# print sorted features with importance 
print("Features importances:")
for feature, importance in sorted_importance: 
    print(f"{feature}:{importance}")

Features importances:
MaintenanceHours:0.2936353877720894
DefectRate:0.2280209887615997
QualityScore:0.15201519644262912
ProductionVolume:0.11414528754502296
SupplierQuality:0.039955469177281905
ProductionCost:0.0384490725576763
AdditiveMaterialCost:0.03829538772705802
EnergyConsumption:0.03467295241796534
EnergyEfficiency:0.030461168597693402
SafetyIncidents:0.030349089000983865


#### Model optimization

In [21]:
# define hyperparameter grid for random forest classifier 
param_dist = {
    'n_estimators':[100, 200, 300, 400],
    'max_features':['auto', 'sqrt', 'log2'],
    'max_depth': [10, 20, 30, 40,None],
    'min_samples_split': [2, 5, 10], 
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

In [22]:
# initialize the model
rf = RandomForestClassifier(random_state=42)

# initialize RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_dist,n_iter=100, scoring='accuracy', cv=3, verbose=2, 
    random_state=42, n_jobs=-1)

# fit RandomizedSearchCV
random_search.fit(X_resampled,y_resampled)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


##### The runtime for this model is more than 4 minutes 

In [23]:
# best hyperparameters 
print("Best hyperparameters:", random_search.best_params_)

Best hyperparameters: {'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': None, 'bootstrap': False}


In [24]:
# train the model with the best parameters 
best_rf = random_search.best_estimator_

# predict on test set 
y_pred_rf = best_rf.predict(X_test_scaled)

In [25]:
# evaluate the model 
print("Classification report:\n", classification_report(y_test, y_pred_rf))
print("Accuracy score:\n", accuracy_score(y_test, y_pred_rf))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred_rf))

Classification report:
               precision    recall  f1-score   support

           0       0.93      0.77      0.84       102
           1       0.96      0.99      0.97       546

    accuracy                           0.96       648
   macro avg       0.94      0.88      0.91       648
weighted avg       0.95      0.96      0.95       648

Accuracy score:
 0.9552469135802469
Confusion matrix:
 [[ 79  23]
 [  6 540]]


### Conclusion

In this project, I developed a predictive model to reduce manufacturing defects by leveraging a dataset containing key production metrics. Through exploratory data analysis and feature selection, I identified the most relevant features influencing defect prediction. The dataset's inherent class imbalance was addressed using SMOTE to enhance the model's ability to detect minority class instances.

A Random Forest classifier was fine-tuned using `RandomizedSearchCV`, achieving an overall accuracy of 95.52%. The model demonstrated high precision (96%) and recall (99%) for detecting defective products, ensuring reliable identification of quality issues. However, the recall for non-defective products (77%) indicates potential over-rejection of valid items, suggesting room for further optimization.

This model provides actionable insights for improving manufacturing processes, minimizing defects, and enhancing operational efficiency. Future enhancements may include advanced ensemble techniques, improved handling of class imbalance, and feature engineering to further refine predictive accuracy.