# 🔍 Steel Plates Defects Analysis: Initial Exploration
# 鋼板缺陷分析：初步探索
## Phase 1: The Journey Begins - Discovering the Other_Faults Mystery
## 第一階段：分析之旅啟程 - 探索Other_Faults的奧秘

**Author**: Yu-Ching, Chou | QA Engineer (品質保證工程師)
**Date**: 2025-07-28  
**Objective**: Understanding the nature of Other_Faults - the largest unidentified defect category  
**目標**: 了解Other_Faults的本質 - 最大的未識別缺陷類別

---

## 🎯 Project Background | 專案背景

As a Quality Assurance engineer, I've always been fascinated by the challenge of defect classification in steel plate manufacturing. When I discovered this UCI Machine Learning dataset on steel plate faults, one category immediately caught my attention:

身為品質保證工程師，我一直對鋼板製造過程中的缺陷分類問題深感興趣。當我發現這個UCI機器學習鋼板缺陷資料集時，有一個類別立刻吸引了我的注意：

**Other_Faults - representing 34.7% of all defects!**
**Other_Faults - 佔所有缺陷的34.7%！**

According to the dataset documentation, Other_Faults encompasses "a broader range of faults or defects not explicitly categorized in the other fault types." This means:

根據資料集說明，Other_Faults包含了「無法明確歸類到其他已知缺陷類型的各種表面瑕疵」，這意味著：

- **K_Scratch**: Perpendicular to rolling direction ✓ Identifiable | K型刮痕：垂直軋制方向 ✓ 可識別
- **Z_Scratch**: Parallel to rolling direction ✓ Identifiable | Z型刮痕：平行軋制方向 ✓ 可識別  
- **Bumps**: Raised surface areas ✓ Identifiable | 凸起：表面凸起區域 ✓ 可識別
- **Stains**: Discolored contaminated areas ✓ Identifiable | 污漬：變色污染區域 ✓ 可識別
- **Other_Faults**: Over one-third of defects ❌ Unclassifiable | 其他缺陷：超過三分之一 ❌ 無法分類

From a quality management perspective, this represents both a significant challenge and an opportunity for improvement.

從品質管理角度來看，這代表重大挑戰，也是改善機會。

## 🤔 Initial Questions | 初步問題

- What makes Other_Faults different from known defect types? | 是什麼讓Other_Faults與已知缺陷類型不同？
- Are there hidden patterns that could help us understand these "unknown" defects? | 是否存在隱藏模式幫助理解這些「未知」缺陷？
- Could we develop better classification methods to reduce this 34.7% uncertainty? | 能否開發更好的分類方法來降低34.7%的不確定性？
- What would be the business impact of solving this classification problem? | 解決分類問題的商業影響為何？

## 💡 Initial Hypothesis | 初步假設

Based on my experience in quality management, I hypothesize that Other_Faults might be related to specific manufacturing conditions, particularly **steel plate thickness**. Thicker plates often require different processing parameters and might produce unique defect patterns that are harder to classify.

基於品質管理經驗，我假設Other_Faults可能與特定製造條件有關，特別是**鋼板厚度**。較厚鋼板通常需要不同加工參數，可能產生較難分類的獨特缺陷模式。

Let's begin our exploration... | 讓我們開始探索...

In [None]:
# Import necessary libraries | 匯入必要函式庫
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ucimlrepo import fetch_ucirepo
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style for professional visualizations
# 設定專業視覺化樣式
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Ensure reproducibility | 確保結果可重現
np.random.seed(42)

print("✅ Environment setup complete! | 環境設定完成！")
print("🔍 Ready to explore the Other_Faults mystery... | 準備探索Other_Faults的奧秘...")

## 📊 Data Loading and First Look | 資料載入與初步觀察

Let's load the steel plates faults dataset and get our first glimpse of the data structure.

讓我們載入鋼板缺陷資料集，並初步了解資料結構。

In [None]:
# Load the steel plates faults dataset | 載入鋼板缺陷資料集
print("📥 Loading UCI Steel Plates Faults dataset... | 載入UCI鋼板缺陷資料集...")

steel_plates_faults = fetch_ucirepo(id=198)
X = steel_plates_faults.data.features 
y = steel_plates_faults.data.targets

# Combine features and targets for easier analysis
# 合併特徵與目標變數以便分析
df = pd.concat([X, y], axis=1)

print(f"✅ Dataset loaded successfully! | 資料集載入成功！")
print(f"📊 Dataset shape | 資料集維度: {df.shape}")
print(f"🔧 Features | 特徵數量: {len(X.columns)}")
print(f"🎯 Target variables | 目標變數: {list(y.columns)}")

In [None]:
# Display basic information about the dataset | 顯示資料集基本資訊
print("📋 Dataset Overview | 資料集概覽:")
print("=" * 50)
df.info()

In [None]:
# Display first few rows to understand the data structure
# 顯示前幾列以了解資料結構
print("👀 First 5 rows of the dataset | 資料集前5列:")
print("=" * 50)
df.head()

## 🎯 Defect Types Distribution - The Big Picture
## 缺陷類型分布 - 全貌觀察

Now let's examine the distribution of different defect types to confirm our initial observation about Other_Faults.

現在讓我們檢視不同缺陷類型的分布，確認對Other_Faults的初步觀察。

In [None]:
# Calculate defect type statistics | 計算缺陷類型統計
defect_columns = ['Pastry', 'Z_Scratch', 'K_Scratch', 'Stains', 'Dirtiness', 'Bumps', 'Other_Faults']
total_samples = len(df)

defect_stats = {}
for defect in defect_columns:
    count = df[defect].sum()
    percentage = (count / total_samples) * 100
    defect_stats[defect] = {'count': count, 'percentage': percentage}

# Display the statistics | 顯示統計結果
print("📊 Defect Types Distribution | 缺陷類型分布:")
print("=" * 50)
for defect, stats in defect_stats.items():
    indicator = "👑" if defect == 'Other_Faults' else "📍"
    print(f"{indicator} {defect:15s}: {stats['count']:4d} samples ({stats['percentage']:5.1f}%)")

print(f"\n🔍 Total samples | 總樣本數: {total_samples}")

In [None]:
# Create visualization of defect distribution | 建立缺陷分布視覺化
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Bar chart | 長條圖
defect_names = list(defect_stats.keys())
defect_counts = [stats['count'] for stats in defect_stats.values()]
defect_percentages = [stats['percentage'] for stats in defect_stats.values()]

# Highlight Other_Faults | 突顯Other_Faults
colors = ['red' if defect == 'Other_Faults' else 'lightblue' for defect in defect_names]

bars = ax1.bar(defect_names, defect_counts, color=colors)
ax1.set_title('Steel Plate Defects Distribution\n(Other_Faults Highlighted)', fontsize=14, fontweight='bold')
ax1.set_xlabel('Defect Type')
ax1.set_ylabel('Number of Samples')
ax1.tick_params(axis='x', rotation=45)

# Add percentage labels on bars | 在長條上加入百分比標籤
for bar, percentage in zip(bars, defect_percentages):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 10,
             f'{percentage:.1f}%', ha='center', va='bottom', fontweight='bold')

# Pie chart focusing on top categories | 圓餅圖聚焦主要類別
top_categories = [(name, stats['percentage']) for name, stats in defect_stats.items() 
                  if stats['percentage'] > 5]  # Only show categories > 5%
other_small = sum([stats['percentage'] for name, stats in defect_stats.items() 
                   if stats['percentage'] <= 5])

pie_labels = [name for name, _ in top_categories]
pie_values = [percentage for _, percentage in top_categories]
pie_colors = ['red' if label == 'Other_Faults' else 'lightblue' for label in pie_labels]

if other_small > 0:
    pie_labels.append('Others (< 5%)')
    pie_values.append(other_small)
    pie_colors.append('lightgray')

wedges, texts, autotexts = ax2.pie(pie_values, labels=pie_labels, colors=pie_colors, 
                                   autopct='%1.1f%%', startangle=90)
ax2.set_title('Defect Types Distribution\n(Other_Faults Dominates)', fontsize=14, fontweight='bold')

# Make Other_Faults percentage bold | 讓Other_Faults的百分比加粗
for autotext in autotexts:
    if 'Other_Faults' in pie_labels[autotexts.index(autotext)]:
        autotext.set_fontweight('bold')
        autotext.set_fontsize(12)

plt.tight_layout()
plt.show()

print("\n🎯 Key Observation | 關鍵觀察:")
other_faults_pct = defect_stats['Other_Faults']['percentage']
print(f"   Other_Faults represents {other_faults_pct:.1f}% of all defects - the largest category!")
print(f"   Other_Faults佔所有缺陷的 {other_faults_pct:.1f}% - 是最大的類別！")
print(f"   This means {defect_stats['Other_Faults']['count']} samples are currently unclassifiable.")
print(f"   這意味著有 {defect_stats['Other_Faults']['count']} 個樣本目前無法分類。")

## 🔍 Initial Hypothesis Testing: Thickness Analysis
## 初步假設驗證：厚度分析

Let's test my initial hypothesis that Other_Faults might be related to steel plate thickness. We'll compare the thickness distribution of Other_Faults samples with normal samples.

讓我們測試初步假設，即Other_Faults可能與鋼板厚度有關。我們將比較Other_Faults樣本與正常樣本的厚度分布。

In [None]:
# Separate Other_Faults samples from normal samples
# 分離Other_Faults樣本與正常樣本
other_faults_samples = df[df['Other_Faults'] == 1]
normal_samples = df[df['Other_Faults'] == 0]

print(f"📊 Sample Distribution | 樣本分布:")
print(f"   Other_Faults samples | Other_Faults樣本: {len(other_faults_samples)}")
print(f"   Normal samples | 正常樣本: {len(normal_samples)}")
print(f"   Other_Faults ratio | Other_Faults比例: {len(other_faults_samples)/len(df)*100:.1f}%")

In [None]:
# Compare thickness statistics | 比較厚度統計
of_thickness = other_faults_samples['Steel_Plate_Thickness']
normal_thickness = normal_samples['Steel_Plate_Thickness']

print("🔧 Steel Plate Thickness Comparison | 鋼板厚度比較:")
print("=" * 50)
print(f"Other_Faults Thickness | Other_Faults厚度:")
print(f"   Mean | 平均值: {of_thickness.mean():.1f}mm")
print(f"   Median | 中位數: {of_thickness.median():.1f}mm")
print(f"   Std | 標準差: {of_thickness.std():.1f}mm")
print(f"   Range | 範圍: {of_thickness.min():.1f}mm - {of_thickness.max():.1f}mm")

print(f"\nNormal Samples Thickness | 正常樣本厚度:")
print(f"   Mean | 平均值: {normal_thickness.mean():.1f}mm")
print(f"   Median | 中位數: {normal_thickness.median():.1f}mm")
print(f"   Std | 標準差: {normal_thickness.std():.1f}mm")
print(f"   Range | 範圍: {normal_thickness.min():.1f}mm - {normal_thickness.max():.1f}mm")

# Calculate difference | 計算差異
mean_diff = abs(of_thickness.mean() - normal_thickness.mean())
mean_diff_pct = (mean_diff / normal_thickness.mean()) * 100

print(f"\n📈 Statistical Difference | 統計差異:")
print(f"   Mean difference | 平均值差異: {mean_diff:.1f}mm ({mean_diff_pct:.1f}%)")

if mean_diff_pct > 20:
    print(f"   🎯 Initial hypothesis seems SUPPORTED! Significant thickness difference detected.")
    print(f"   🎯 初步假設似乎獲得支持！檢測到顯著的厚度差異。")
else:
    print(f"   🤔 Initial hypothesis might need revision. Difference is moderate.")
    print(f"   🤔 初步假設可能需要修正。差異程度為中等。")

## 🔍 Key Features Analysis | 關鍵特徵分析

Let's examine some key features to understand what distinguishes Other_Faults from normal samples.

讓我們檢視一些關鍵特徵，了解Other_Faults與正常樣本的區別。

In [None]:
# Select key features for comparison | 選擇關鍵特徵進行比較
key_features = ['Steel_Plate_Thickness', 'Sum_of_Luminosity', 'Pixels_Areas', 
                'X_Perimeter', 'Y_Perimeter', 'Outside_X_Index']

print("📊 Key Features Comparison (Other_Faults vs Normal) | 關鍵特徵比較:")
print("=" * 70)
print(f"{'Feature':<20} {'OF_Mean':<12} {'Normal_Mean':<12} {'Difference%':<12}")
print("-" * 70)

feature_differences = []
for feature in key_features:
    of_mean = other_faults_samples[feature].mean()
    normal_mean = normal_samples[feature].mean()
    diff_pct = abs(of_mean - normal_mean) / normal_mean * 100
    
    feature_differences.append({
        'feature': feature,
        'of_mean': of_mean,
        'normal_mean': normal_mean,
        'diff_pct': diff_pct
    })
    
    print(f"{feature:<20} {of_mean:<12.1f} {normal_mean:<12.1f} {diff_pct:<12.1f}%")

# Find the most different features | 找出差異最大的特徵
feature_differences.sort(key=lambda x: x['diff_pct'], reverse=True)
print(f"\n🔍 Most Distinctive Features | 最具區別性的特徵:")
for i, feat in enumerate(feature_differences[:3], 1):
    print(f"   {i}. {feat['feature']}: {feat['diff_pct']:.1f}% difference | {feat['diff_pct']:.1f}% 差異")

## 📈 Initial Conclusions and Next Steps | 初步結論與下一步

Based on this initial exploration, let me summarize what we've discovered and outline our next steps.

基於這次初步探索，讓我總結發現並規劃後續步驟。

In [None]:
print("📋 INITIAL EXPLORATION SUMMARY | 初步探索總結")
print("=" * 50)

print(f"\n🎯 Key Findings | 主要發現:")
print(f"   • Other_Faults is indeed the largest defect category at {other_faults_pct:.1f}%")
print(f"   • Other_Faults確實是最大的缺陷類別，佔 {other_faults_pct:.1f}%")
print(f"   • {len(other_faults_samples)} samples are currently unclassifiable")
print(f"   • {len(other_faults_samples)} 個樣本目前無法分類")
print(f"   • Thickness hypothesis shows {mean_diff_pct:.1f}% difference - {'SIGNIFICANT' if mean_diff_pct > 20 else 'MODERATE'}")
print(f"   • 厚度假設顯示 {mean_diff_pct:.1f}% 差異 - {'顯著' if mean_diff_pct > 20 else '中等'}")

top_diff_feature = feature_differences[0]
print(f"   • Most distinctive feature: {top_diff_feature['feature']} ({top_diff_feature['diff_pct']:.1f}% difference)")
print(f"   • 最具區別性特徵: {top_diff_feature['feature']} ({top_diff_feature['diff_pct']:.1f}% 差異)")

print(f"\n🤔 Questions Raised | 引發的問題:")
print(f"   • Why do Other_Faults have such different characteristics?")
print(f"   • 為什麼Other_Faults具有如此不同的特性？")
print(f"   • Are there hidden patterns within the Other_Faults category?")
print(f"   • Other_Faults類別內是否存在隱藏的模式？")
print(f"   • Can we develop better classification methods?")
print(f"   • 我們能否開發更好的分類方法？")

print(f"\n🚀 Next Steps | 下一步計畫:")
print(f"   1. Deep dive into visual analysis - create distribution plots")
print(f"   1. 深入視覺化分析 - 建立分布圖")
print(f"   2. Perform clustering analysis on Other_Faults samples")
print(f"   2. 對Other_Faults樣本進行聚類分析")
print(f"   3. Analyze correlations with other defect types")
print(f"   3. 分析與其他缺陷類型的相關性")
print(f"   4. Investigate the root cause of these differences")
print(f"   4. 調查這些差異的根本原因")

print(f"\n💡 Initial Hypothesis Status | 初步假設狀態:")
if mean_diff_pct > 30:
    status = "STRONGLY SUPPORTED | 強力支持"
elif mean_diff_pct > 15:
    status = "PARTIALLY SUPPORTED | 部分支持"
else:
    status = "NEEDS REVISION | 需要修正"
    
print(f"   Thickness-related hypothesis: {status}")
print(f"   厚度相關假設: {status}")
print(f"   But we need deeper analysis to understand the full picture...")
print(f"   但我們需要更深入的分析來了解全貌...")

---

## 🎯 End of Phase 1 | 第一階段結束

This initial exploration has confirmed that Other_Faults is indeed a significant challenge, representing over one-third of all defects. We've also discovered that these samples have distinctly different characteristics from normal samples, particularly in thickness and other key features.

這次初步探索確認了Other_Faults確實是個重大挑戰，佔所有缺陷的三分之一以上。我們也發現這些樣本與正常樣本具有明顯不同的特性，特別是在厚度和其他關鍵特徵方面。

However, statistical averages can sometimes be misleading. In the next notebook (`02_visual_insights.ipynb`), we'll create detailed visualizations to better understand the true distribution patterns and see if our statistical findings hold up under visual scrutiny.

然而，統計平均值有時可能會誤導。在下一個notebook (`02_visual_insights.ipynb`)中，我們將建立詳細的視覺化圖表，更好地理解真實的分布模式，並檢驗統計發現是否經得起視覺化驗證。

**The mystery deepens, but we're on the right track!** 🔍
**謎團加深了，但我們走在正確的道路上！** 🔍

---

## 💼 Skills Demonstrated | 展示技能

**Data Analysis Skills | 數據分析技能:**
- 📊 Data Loading & Preprocessing | 資料載入與前處理
- 🔍 Exploratory Data Analysis (EDA) | 探索性資料分析
- 📈 Statistical Analysis & Hypothesis Testing | 統計分析與假設驗證
- 🎯 Business Problem Identification | 業務問題識別
- 📋 Project Planning & Phase Management | 專案規劃與階段管理

**QA Engineering Perspective | 品保工程師觀點:**
- 🔧 Applied quality management thinking to identify key issues | 運用品質管理思維識別關鍵問題
- 📊 Combined manufacturing experience with data interpretation | 結合製造業經驗解讀資料模式
- 🎯 Focused on actionable business problems | 聚焦於可執行的業務問題
- 📈 Used scientific hypothesis-driven approach | 採用假設驗證的科學方法

---

*Continue to: [02_visual_insights.ipynb](./02_visual_insights.ipynb)*