# Steel Ingot Defect Detection – Full Pipeline

This notebook reproduces the workflow from *Data-Driven Approach for Defect Identification in Steel Ingot Casting via Machine Learning*. It walks through data understanding, exploratory analysis, model training (Random Forest, XGBoost, SVM, MLP), ensemble optimization, and explainability using SHAP and a linear SVM decision boundary.

**Notebook outline**
1. Imports & configuration
2. Data loading & overview
3. Exploratory data analysis
4. Preprocessing & splitting
5. Model training & ensemble tuning
6. Evaluation (confusion matrices, ROC, PR, comparisons)
7. Explainability (SHAP + linear SVM)
8. Conclusions

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from src.config import PATHS, TRAINING
from src.utils import configure_plotting, plot_confusion_matrix
from src.data_loading import load_dataset, describe_target_distribution
from src.eda import plot_density_by_class, plot_correlation_heatmap
from src.preprocessing import split_and_scale
from src.training import train_and_optimize
from src.evaluation import (
    plot_roc_curves,
    plot_pr_curves,
    plot_metric_comparison,
    metrics_to_dataframe,
)
from src.explainability import compute_shap_values, plot_shap_summary, plot_shap_importance, explain_linear_svm

configure_plotting()
np.random.seed(TRAINING.random_state)

## 1. Data Loading & Overview

In [None]:
df, target_column, numeric_features = load_dataset()
print(f'Target column: {target_column}')
print(f'Numeric features ({len(numeric_features)}): {numeric_features}')
df.head()

In [None]:
print('Dataset shape:', df.shape)
print('Missing values per column (should be zero):')
print(df.isna().sum())
print('
Target distribution:')
print(describe_target_distribution(df, target_column))

## 2. Exploratory Data Analysis – Density & Frequency Distributions
The paper highlights overlapping distributions between defective and non-defective ingots for many parameters. Below we overlay kernel density estimates for each numeric feature split by the defect label.

In [None]:
density_figs = plot_density_by_class(df, numeric_features, target_column)
for fig in density_figs:
    display(fig)
    plt.close(fig)

## 3. Correlation Matrix & Heatmap
We reproduce the correlation analysis between alloying elements and process parameters.

In [None]:
heatmap_fig = plot_correlation_heatmap(df, numeric_features)
display(heatmap_fig)
plt.close(heatmap_fig)

## 4. Preprocessing & Train/Test Split
We perform a stratified split and standardize numeric inputs for margin-based models.

In [None]:
preprocessed = split_and_scale(df, numeric_features, target_column)
print(f'Train shape: {preprocessed.X_train.shape}, Test shape: {preprocessed.X_test.shape}')
print('Train target distribution:
', preprocessed.y_train.value_counts(normalize=True))
print('Test target distribution:
', preprocessed.y_test.value_counts(normalize=True))

## 5. Model Training & Ensemble Optimization
We train Random Forest, XGBoost, RBF-SVM, and MLP models, then tune class weights and the ensemble decision threshold as described in the paper.

In [None]:
artifacts = train_and_optimize(preprocessed)
print(f'Best class weight: {artifacts.best_class_weight}')
print(f'Best ensemble threshold: {artifacts.best_threshold:.2f}')
metrics_df = metrics_to_dataframe(artifacts.metrics)
metrics_df

## 6. Confusion Matrices
The confusion matrices illustrate the balance between precision and recall for each base model and the optimized ensemble.

In [None]:
for name, metric in artifacts.metrics.items():
    fig = plot_confusion_matrix(metric.confusion_matrix, labels=['Non-defect', 'Defect'])
    fig.suptitle(f'Confusion Matrix – {name}')
    display(fig)
    plt.close(fig)

## 7. ROC Curves

In [None]:
roc_fig = plot_roc_curves(artifacts.metrics)
display(roc_fig)
plt.close(roc_fig)

## 8. Precision–Recall Curves

In [None]:
pr_fig = plot_pr_curves(artifacts.metrics)
display(pr_fig)
plt.close(pr_fig)

## 9. Model Comparison Bar Chart

In [None]:
comparison_fig = plot_metric_comparison(artifacts.metrics)
display(comparison_fig)
plt.close(comparison_fig)

## 10. SHAP Global Explainability
Tree-based SHAP values highlight which parameters most influence the ensemble's tree component (Random Forest).

In [None]:
rf_model = artifacts.models['Random Forest']
background = preprocessed.X_train.values
X_target = preprocessed.X_test.values
_, shap_values = compute_shap_values(
    rf_model,
    background,
    X_target,
    feature_names=preprocessed.X_train.columns,
)
plot_shap_summary(shap_values, X_target, feature_names=preprocessed.X_train.columns)
plot_shap_importance(shap_values, feature_names=preprocessed.X_train.columns)

## 11. Linear SVM Decision Boundary
We derive the explicit linear decision function and visualize coefficient-driven importance, mirroring the paper's analysis.

In [None]:
linear_expl = explain_linear_svm(preprocessed, class_weight=artifacts.best_class_weight)
print('Linear SVM decision function:')
print(linear_expl.equation)
display(linear_expl.coefficients)
display(linear_expl.figure)
plt.close(linear_expl.figure)

## 12. Conclusions & Next Steps
- The dataset is clean and balanced enough for stratified modeling, but overlapping densities confirm that multivariate ML is necessary.
- Correlation analysis highlights linked thermal parameters (superheat, casting temperature, teeming speed) that co-vary with defect tendencies.
- The tuned ensemble (class weights {0:62, 1:12}, threshold 0.62) balances precision (~0.81) and recall (~0.97), aligning with the paper.
- SHAP and the linear SVM both point to temperature management and select alloying elements (e.g., Mn, S, Al) as key levers for reducing defects.
- Future work: cross-validation, streaming inference hooks, and integration with process-control dashboards.