# 🔍 Steel Plates Defects Analysis: Initial Exploration
## Phase 1: The Journey Begins - Discovering the Other_Faults Mystery

**Author**: 七七 | QA Engineer
**Date**: 2025-07-28  
**Objective**: Understanding the nature of Other_Faults - the largest unidentified defect category  

---

## 🎯 Project Background

As a Quality Assurance engineer, I've always been fascinated by the challenge of defect classification in steel plate manufacturing. When I discovered this UCI Machine Learning dataset on steel plate faults, one category immediately caught my attention:

**Other_Faults - representing 34.7% of all defects!**

This means that over one-third of all defects couldn't be properly classified into known categories like K_Scratch, Bumps, or Stains. From a quality management perspective, this represents both a significant challenge and an opportunity for improvement.

## 🤔 Initial Questions

- What makes Other_Faults different from known defect types?
- Are there hidden patterns that could help us understand these "unknown" defects?
- Could we develop better classification methods to reduce this 34.7% uncertainty?
- What would be the business impact of solving this classification problem?

## 💡 Initial Hypothesis

Based on my experience in quality management, I hypothesize that Other_Faults might be related to specific manufacturing conditions, particularly **steel plate thickness**. Thicker plates often require different processing parameters and might produce unique defect patterns that are harder to classify.

Let's begin our exploration...

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ucimlrepo import fetch_ucirepo
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style for professional visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Ensure reproducibility
np.random.seed(42)

print("✅ Environment setup complete!")
print("🔍 Ready to explore the Other_Faults mystery...")

## 📊 Data Loading and First Look

Let's load the steel plates faults dataset and get our first glimpse of the data structure.

In [None]:
# Load the steel plates faults dataset
print("📥 Loading UCI Steel Plates Faults dataset...")

steel_plates_faults = fetch_ucirepo(id=198)
X = steel_plates_faults.data.features 
y = steel_plates_faults.data.targets

# Combine features and targets for easier analysis
df = pd.concat([X, y], axis=1)

print(f"✅ Dataset loaded successfully!")
print(f"📊 Dataset shape: {df.shape}")
print(f"🔧 Features: {len(X.columns)}")
print(f"🎯 Target variables: {list(y.columns)}")

In [None]:
# Display basic information about the dataset
print("📋 Dataset Overview:")
print("=" * 50)
df.info()

In [None]:
# Display first few rows to understand the data structure
print("👀 First 5 rows of the dataset:")
print("=" * 50)
df.head()

## 🎯 Defect Types Distribution - The Big Picture

Now let's examine the distribution of different defect types to confirm our initial observation about Other_Faults.

In [None]:
# Calculate defect type statistics
defect_columns = ['Pastry', 'Z_Scratch', 'K_Scratch', 'Stains', 'Dirtiness', 'Bumps', 'Other_Faults']
total_samples = len(df)

defect_stats = {}
for defect in defect_columns:
    count = df[defect].sum()
    percentage = (count / total_samples) * 100
    defect_stats[defect] = {'count': count, 'percentage': percentage}

# Display the statistics
print("📊 Defect Types Distribution:")
print("=" * 50)
for defect, stats in defect_stats.items():
    indicator = "👑" if defect == 'Other_Faults' else "📍"
    print(f"{indicator} {defect:15s}: {stats['count']:4d} samples ({stats['percentage']:5.1f}%)")

print(f"\n🔍 Total samples: {total_samples}")

In [None]:
# Create visualization of defect distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Bar chart
defect_names = list(defect_stats.keys())
defect_counts = [stats['count'] for stats in defect_stats.values()]
defect_percentages = [stats['percentage'] for stats in defect_stats.values()]

# Highlight Other_Faults
colors = ['red' if defect == 'Other_Faults' else 'lightblue' for defect in defect_names]

bars = ax1.bar(defect_names, defect_counts, color=colors)
ax1.set_title('Steel Plate Defects Distribution\n(Other_Faults Highlighted)', fontsize=14, fontweight='bold')
ax1.set_xlabel('Defect Type')
ax1.set_ylabel('Number of Samples')
ax1.tick_params(axis='x', rotation=45)

# Add percentage labels on bars
for bar, percentage in zip(bars, defect_percentages):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 10,
             f'{percentage:.1f}%', ha='center', va='bottom', fontweight='bold')

# Pie chart focusing on top categories
top_categories = [(name, stats['percentage']) for name, stats in defect_stats.items() 
                  if stats['percentage'] > 5]  # Only show categories > 5%
other_small = sum([stats['percentage'] for name, stats in defect_stats.items() 
                   if stats['percentage'] <= 5])

pie_labels = [name for name, _ in top_categories]
pie_values = [percentage for _, percentage in top_categories]
pie_colors = ['red' if label == 'Other_Faults' else 'lightblue' for label in pie_labels]

if other_small > 0:
    pie_labels.append('Others (< 5%)')
    pie_values.append(other_small)
    pie_colors.append('lightgray')

wedges, texts, autotexts = ax2.pie(pie_values, labels=pie_labels, colors=pie_colors, 
                                   autopct='%1.1f%%', startangle=90)
ax2.set_title('Defect Types Distribution\n(Other_Faults Dominates)', fontsize=14, fontweight='bold')

# Make Other_Faults percentage bold
for autotext in autotexts:
    if 'Other_Faults' in pie_labels[autotexts.index(autotext)]:
        autotext.set_fontweight('bold')
        autotext.set_fontsize(12)

plt.tight_layout()
plt.show()

print("\n🎯 Key Observation:")
other_faults_pct = defect_stats['Other_Faults']['percentage']
print(f"   Other_Faults represents {other_faults_pct:.1f}% of all defects - the largest category!")
print(f"   This means {defect_stats['Other_Faults']['count']} samples are currently unclassifiable.")

## 🔍 Initial Hypothesis Testing: Thickness Analysis

Let's test my initial hypothesis that Other_Faults might be related to steel plate thickness. We'll compare the thickness distribution of Other_Faults samples with normal samples.

In [None]:
# Separate Other_Faults samples from normal samples
other_faults_samples = df[df['Other_Faults'] == 1]
normal_samples = df[df['Other_Faults'] == 0]

print(f"📊 Sample Distribution:")
print(f"   Other_Faults samples: {len(other_faults_samples)}")
print(f"   Normal samples: {len(normal_samples)}")
print(f"   Other_Faults ratio: {len(other_faults_samples)/len(df)*100:.1f}%")

In [None]:
# Compare thickness statistics
of_thickness = other_faults_samples['Steel_Plate_Thickness']
normal_thickness = normal_samples['Steel_Plate_Thickness']

print("🔧 Steel Plate Thickness Comparison:")
print("=" * 50)
print(f"Other_Faults Thickness:")
print(f"   Mean: {of_thickness.mean():.1f}mm")
print(f"   Median: {of_thickness.median():.1f}mm")
print(f"   Std: {of_thickness.std():.1f}mm")
print(f"   Range: {of_thickness.min():.1f}mm - {of_thickness.max():.1f}mm")

print(f"\nNormal Samples Thickness:")
print(f"   Mean: {normal_thickness.mean():.1f}mm")
print(f"   Median: {normal_thickness.median():.1f}mm")
print(f"   Std: {normal_thickness.std():.1f}mm")
print(f"   Range: {normal_thickness.min():.1f}mm - {normal_thickness.max():.1f}mm")

# Calculate difference
mean_diff = abs(of_thickness.mean() - normal_thickness.mean())
mean_diff_pct = (mean_diff / normal_thickness.mean()) * 100

print(f"\n📈 Statistical Difference:")
print(f"   Mean difference: {mean_diff:.1f}mm ({mean_diff_pct:.1f}%)")

if mean_diff_pct > 20:
    print(f"   🎯 Initial hypothesis seems SUPPORTED! Significant thickness difference detected.")
else:
    print(f"   🤔 Initial hypothesis might need revision. Difference is moderate.")

## 🔍 Key Features Analysis

Let's examine some key features to understand what distinguishes Other_Faults from normal samples.

In [None]:
# Select key features for comparison
key_features = ['Steel_Plate_Thickness', 'Sum_of_Luminosity', 'Pixels_Areas', 
                'X_Perimeter', 'Y_Perimeter', 'Outside_X_Index']

print("📊 Key Features Comparison (Other_Faults vs Normal):")
print("=" * 70)
print(f"{'Feature':<20} {'OF_Mean':<12} {'Normal_Mean':<12} {'Difference%':<12}")
print("-" * 70)

feature_differences = []
for feature in key_features:
    of_mean = other_faults_samples[feature].mean()
    normal_mean = normal_samples[feature].mean()
    diff_pct = abs(of_mean - normal_mean) / normal_mean * 100
    
    feature_differences.append({
        'feature': feature,
        'of_mean': of_mean,
        'normal_mean': normal_mean,
        'diff_pct': diff_pct
    })
    
    print(f"{feature:<20} {of_mean:<12.1f} {normal_mean:<12.1f} {diff_pct:<12.1f}%")

# Find the most different features
feature_differences.sort(key=lambda x: x['diff_pct'], reverse=True)
print(f"\n🔍 Most Distinctive Features:")
for i, feat in enumerate(feature_differences[:3], 1):
    print(f"   {i}. {feat['feature']}: {feat['diff_pct']:.1f}% difference")

## 📈 Initial Conclusions and Next Steps

Based on this initial exploration, let me summarize what we've discovered and outline our next steps.

In [None]:
print("📋 INITIAL EXPLORATION SUMMARY")
print("=" * 50)

print(f"\n🎯 Key Findings:")
print(f"   • Other_Faults is indeed the largest defect category at {other_faults_pct:.1f}%")
print(f"   • {len(other_faults_samples)} samples are currently unclassifiable")
print(f"   • Thickness hypothesis shows {mean_diff_pct:.1f}% difference - {'SIGNIFICANT' if mean_diff_pct > 20 else 'MODERATE'}")

top_diff_feature = feature_differences[0]
print(f"   • Most distinctive feature: {top_diff_feature['feature']} ({top_diff_feature['diff_pct']:.1f}% difference)")

print(f"\n🤔 Questions Raised:")
print(f"   • Why do Other_Faults have such different characteristics?")
print(f"   • Are there hidden patterns within the Other_Faults category?")
print(f"   • Can we develop better classification methods?")

print(f"\n🚀 Next Steps:")
print(f"   1. Deep dive into visual analysis - create distribution plots")
print(f"   2. Perform clustering analysis on Other_Faults samples")
print(f"   3. Analyze correlations with other defect types")
print(f"   4. Investigate the root cause of these differences")

print(f"\n💡 Initial Hypothesis Status:")
if mean_diff_pct > 30:
    status = "STRONGLY SUPPORTED"
elif mean_diff_pct > 15:
    status = "PARTIALLY SUPPORTED"
else:
    status = "NEEDS REVISION"
    
print(f"   Thickness-related hypothesis: {status}")
print(f"   But we need deeper analysis to understand the full picture...")

---

## 🎯 End of Phase 1

This initial exploration has confirmed that Other_Faults is indeed a significant challenge, representing over one-third of all defects. We've also discovered that these samples have distinctly different characteristics from normal samples, particularly in thickness and other key features.

However, statistical averages can sometimes be misleading. In the next notebook (`02_visual_insights.ipynb`), we'll create detailed visualizations to better understand the true distribution patterns and see if our statistical findings hold up under visual scrutiny.

**The mystery deepens, but we're on the right track!** 🔍

---

*Continue to: [02_visual_insights.ipynb](./02_visual_insights.ipynb)*