# 📈 Visual Insights: When Charts Contradict Statistics
## Phase 2: The First Turning Point - Discovering the Truth Behind Averages

**Author**: Yu-Ching, CHou | QA Engineer 
**Date**: 2025-07-28  
**Previous Phase**: Initial exploration suggested thickness-related hypothesis  
**Current Objective**: Visualize distributions to validate our statistical findings  

---

## 🔙 Recap: What We Found in Phase 1

Our initial exploration revealed some interesting patterns:

### **Key Statistics from Phase 1:**
- **Other_Faults**: 34.7% of all defects (673 samples)
- **Average thickness difference**: 55.3% (Other_Faults: 102.6mm vs Normal: 66.1mm)
- **Initial conclusion**: Other_Faults might be related to thicker steel plates

### **The Question That Emerged:**
*If the statistics show such a clear thickness difference, why do I have a nagging feeling that something doesn't add up?*

## 💡 The Power of Visualization

As a quality engineer, I've learned that **numbers can lie, but distributions don't**. Statistical averages can be misleading, especially when dealing with skewed distributions or outliers. 

Let's create detailed visualizations to see if our statistical findings hold up under visual scrutiny.

> *"The greatest value of a picture is when it forces us to notice what we never expected to see."* - John Tukey

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ucimlrepo import fetch_ucirepo
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style for professional visualizations
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10
sns.set_palette("husl")

# Ensure reproducibility
np.random.seed(42)

print("✅ Visualization environment ready!")
print("📊 Let's see what the data really looks like...")

In [None]:
# Reload data (for notebook independence)
print("📥 Reloading dataset for analysis...")

steel_plates_faults = fetch_ucirepo(id=198)
X = steel_plates_faults.data.features 
y = steel_plates_faults.data.targets
df = pd.concat([X, y], axis=1)

# Separate samples
other_faults_data = df[df['Other_Faults'] == 1]
normal_data = df[df['Other_Faults'] == 0]

print(f"✅ Data loaded: {len(df)} total samples")
print(f"📊 Other_Faults: {len(other_faults_data)} samples")
print(f"📊 Normal: {len(normal_data)} samples")

## 📊 The Truth Revelation: Steel Plate Thickness Distribution

Let's start with the most important visualization - the thickness distribution that challenged my initial hypothesis.

In [None]:
# Create comprehensive thickness distribution analysis
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(18, 14))
fig.suptitle('Steel Plate Thickness Analysis: Statistics vs Reality', fontsize=16, fontweight='bold')

# 1. Overlapping histograms
ax1.hist(normal_data['Steel_Plate_Thickness'], bins=30, alpha=0.7, 
         label=f'Normal Samples (n={len(normal_data)})', color='lightblue', density=True)
ax1.hist(other_faults_data['Steel_Plate_Thickness'], bins=30, alpha=0.7, 
         label=f'Other_Faults (n={len(other_faults_data)})', color='red', density=True)

# Add mean lines
normal_mean = normal_data['Steel_Plate_Thickness'].mean()
of_mean = other_faults_data['Steel_Plate_Thickness'].mean()

ax1.axvline(normal_mean, color='blue', linestyle='--', linewidth=2, 
           label=f'Normal Mean: {normal_mean:.1f}mm')
ax1.axvline(of_mean, color='darkred', linestyle='--', linewidth=2,
           label=f'Other_Faults Mean: {of_mean:.1f}mm')

ax1.set_title('Thickness Distribution: The Statistical Story', fontsize=12, fontweight='bold')
ax1.set_xlabel('Steel Plate Thickness (mm)')
ax1.set_ylabel('Density')
ax1.legend()
ax1.grid(True, alpha=0.3)

# 2. Box plots for better distribution understanding
thickness_data = [normal_data['Steel_Plate_Thickness'], other_faults_data['Steel_Plate_Thickness']]
box_plot = ax2.boxplot(thickness_data, labels=['Normal', 'Other_Faults'], 
                       patch_artist=True, notch=True)
box_plot['boxes'][0].set_facecolor('lightblue')
box_plot['boxes'][1].set_facecolor('red')
box_plot['boxes'][1].set_alpha(0.7)

ax2.set_title('Box Plot: Distribution Shape Comparison', fontsize=12, fontweight='bold')
ax2.set_ylabel('Steel Plate Thickness (mm)')
ax2.grid(True, alpha=0.3)

# 3. Separate histograms with focus on frequency
ax3.hist(other_faults_data['Steel_Plate_Thickness'], bins=25, alpha=0.8, 
         color='red', edgecolor='darkred')
ax3.set_title('Other_Faults Only: Where Do They Really Concentrate?', fontsize=12, fontweight='bold')
ax3.set_xlabel('Steel Plate Thickness (mm)')
ax3.set_ylabel('Frequency (Count)')
ax3.grid(True, alpha=0.3)

# Add median line
of_median = other_faults_data['Steel_Plate_Thickness'].median()
ax3.axvline(of_median, color='darkred', linestyle=':', linewidth=3,
           label=f'Median: {of_median:.1f}mm')
ax3.legend()

# 4. Cumulative distribution
normal_sorted = np.sort(normal_data['Steel_Plate_Thickness'])
of_sorted = np.sort(other_faults_data['Steel_Plate_Thickness'])

ax4.plot(normal_sorted, np.arange(1, len(normal_sorted)+1)/len(normal_sorted), 
         label='Normal Samples', color='blue', linewidth=2)
ax4.plot(of_sorted, np.arange(1, len(of_sorted)+1)/len(of_sorted), 
         label='Other_Faults', color='red', linewidth=2)

ax4.set_title('Cumulative Distribution: The Full Picture', fontsize=12, fontweight='bold')
ax4.set_xlabel('Steel Plate Thickness (mm)')
ax4.set_ylabel('Cumulative Probability')
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print detailed statistics
print("📊 DETAILED THICKNESS ANALYSIS")
print("=" * 50)
print(f"\n📈 Statistical Summary:")
print(f"   Other_Faults Mean: {of_mean:.1f}mm")
print(f"   Normal Mean: {normal_mean:.1f}mm")
print(f"   Difference: {of_mean - normal_mean:.1f}mm ({(of_mean-normal_mean)/normal_mean*100:.1f}%)")

print(f"\n📊 Distribution Reality:")
print(f"   Other_Faults Median: {of_median:.1f}mm")
print(f"   Normal Median: {normal_data['Steel_Plate_Thickness'].median():.1f}mm")
print(f"   Other_Faults Mode (approx): {stats.mode(np.round(other_faults_data['Steel_Plate_Thickness'], 0), keepdims=True)[0][0]:.0f}mm")

## 🚨 The First Major Discovery: Distribution vs Statistics

**Wait a minute!** Looking at the histograms, something doesn't match our statistical conclusion...

In [None]:
# Detailed frequency analysis to highlight the contradiction
print("🔍 FREQUENCY ANALYSIS: Where do Other_Faults Really Occur?")
print("=" * 60)

# Define thickness categories
def categorize_thickness(thickness):
    if thickness <= 60:
        return "Thin Plates (≤60mm)"
    elif thickness <= 100:
        return "Medium Plates (60-100mm)"
    elif thickness <= 150:
        return "Thick Plates (100-150mm)"
    else:
        return "Very Thick Plates (>150mm)"

# Apply categorization
other_faults_data['Thickness_Category'] = other_faults_data['Steel_Plate_Thickness'].apply(categorize_thickness)
normal_data['Thickness_Category'] = normal_data['Steel_Plate_Thickness'].apply(categorize_thickness)

# Count by category for Other_Faults
of_thickness_dist = other_faults_data['Thickness_Category'].value_counts()
normal_thickness_dist = normal_data['Thickness_Category'].value_counts()

print(f"Other_Faults Distribution by Thickness Category:")
for category, count in of_thickness_dist.items():
    percentage = count / len(other_faults_data) * 100
    print(f"   {category:<25}: {count:3d} samples ({percentage:5.1f}%)")

print(f"\nNormal Samples Distribution by Thickness Category:")
for category, count in normal_thickness_dist.items():
    percentage = count / len(normal_data) * 100
    print(f"   {category:<25}: {count:3d} samples ({percentage:5.1f}%)")

# Find the most frequent category for Other_Faults
most_frequent_category = of_thickness_dist.idxmax()
most_frequent_count = of_thickness_dist.max()
most_frequent_pct = most_frequent_count / len(other_faults_data) * 100

print(f"\n🎯 KEY OBSERVATION:")
print(f"   Most frequent Other_Faults category: {most_frequent_category}")
print(f"   Count: {most_frequent_count} samples ({most_frequent_pct:.1f}%)")

# The contradiction!
print(f"\n🚨 THE CONTRADICTION:")
print(f"   📊 Statistics say: Other_Faults average {of_mean:.1f}mm (suggesting thick plates)")
print(f"   📈 Distribution shows: Most Other_Faults are in {most_frequent_category}!")
print(f"   🤔 How can this be?")

## 🔍 Deep Dive: Understanding the Distribution Pattern

Let's explore this contradiction further with more detailed visualizations.

In [None]:
# Create a detailed visualization to show the distribution pattern
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 8))

# 1. Stacked bar chart by thickness category
categories = ['Thin Plates\n(≤60mm)', 'Medium Plates\n(60-100mm)', 'Thick Plates\n(100-150mm)', 'Very Thick Plates\n(>150mm)']
category_map = {
    "Thin Plates (≤60mm)": "Thin Plates\n(≤60mm)",
    "Medium Plates (60-100mm)": "Medium Plates\n(60-100mm)", 
    "Thick Plates (100-150mm)": "Thick Plates\n(100-150mm)",
    "Very Thick Plates (>150mm)": "Very Thick Plates\n(>150mm)"
}

of_counts = [of_thickness_dist.get(cat.replace('\n', ' '), 0) for cat in categories]
normal_counts = [normal_thickness_dist.get(cat.replace('\n', ' '), 0) for cat in categories]

x = np.arange(len(categories))
width = 0.35

bars1 = ax1.bar(x - width/2, normal_counts, width, label='Normal Samples', 
                color='lightblue', alpha=0.8)
bars2 = ax1.bar(x + width/2, of_counts, width, label='Other_Faults', 
                color='red', alpha=0.8)

ax1.set_title('Sample Distribution by Thickness Category\n(The Frequency Truth)', 
              fontsize=14, fontweight='bold')
ax1.set_xlabel('Thickness Category')
ax1.set_ylabel('Number of Samples')
ax1.set_xticks(x)
ax1.set_xticklabels(categories)
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        if height > 0:
            ax1.text(bar.get_x() + bar.get_width()/2., height + 5,
                     f'{int(height)}', ha='center', va='bottom', fontweight='bold')

# 2. Percentage composition
of_percentages = [count/len(other_faults_data)*100 for count in of_counts]
normal_percentages = [count/len(normal_data)*100 for count in normal_counts]

bars3 = ax2.bar(x - width/2, normal_percentages, width, label='Normal Samples', 
                color='lightblue', alpha=0.8)
bars4 = ax2.bar(x + width/2, of_percentages, width, label='Other_Faults', 
                color='red', alpha=0.8)

ax2.set_title('Percentage Distribution by Thickness Category\n(Revealing the Pattern)', 
              fontsize=14, fontweight='bold')
ax2.set_xlabel('Thickness Category')
ax2.set_ylabel('Percentage of Samples (%)')
ax2.set_xticks(x)
ax2.set_xticklabels(categories)
ax2.legend()
ax2.grid(True, alpha=0.3, axis='y')

# Add percentage labels
for bars, percentages in [(bars3, normal_percentages), (bars4, of_percentages)]:
    for bar, pct in zip(bars, percentages):
        if pct > 0:
            ax2.text(bar.get_x() + bar.get_width()/2., pct + 1,
                     f'{pct:.1f}%', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

# Highlight the key finding
max_of_category = categories[np.argmax(of_percentages)]
max_of_pct = max(of_percentages)

print(f"\n🎯 VISUAL REVELATION:")
print(f"   The histogram clearly shows Other_Faults concentrate in: {max_of_category.replace(chr(10), ' ')}")
print(f"   This category represents {max_of_pct:.1f}% of all Other_Faults")
print(f"   But the statistical mean suggested thick plates!")
print(f"\n💡 LESSON LEARNED:")
print(f"   📊 Statistical averages can be misleading when distributions are skewed")
print(f"   📈 Visual analysis reveals the true pattern: distribution shape matters more than mean")
print(f"   🔍 Always visualize your data - charts don't lie, but statistics can mislead!")

## 📊 Extended Visual Analysis: Other Key Features

Now that we've discovered the thickness distribution truth, let's examine other key features to see if similar patterns exist.

In [None]:
# Analyze other key features with the same visual approach
key_features = ['Sum_of_Luminosity', 'Pixels_Areas', 'X_Perimeter', 'Y_Perimeter']

fig, axes = plt.subplots(2, 2, figsize=(18, 14))
fig.suptitle('Other Key Features: Statistical vs Visual Analysis', fontsize=16, fontweight='bold')

for i, feature in enumerate(key_features):
    row = i // 2
    col = i % 2
    ax = axes[row, col]
    
    # Create overlapping histograms
    ax.hist(normal_data[feature], bins=30, alpha=0.7, 
            label='Normal Samples', color='lightblue', density=True)
    ax.hist(other_faults_data[feature], bins=30, alpha=0.7, 
            label='Other_Faults', color='red', density=True)
    
    # Add mean lines
    normal_mean = normal_data[feature].mean()
    of_mean = other_faults_data[feature].mean()
    
    ax.axvline(normal_mean, color='blue', linestyle='--', linewidth=2, alpha=0.8)
    ax.axvline(of_mean, color='darkred', linestyle='--', linewidth=2, alpha=0.8)
    
    # Calculate difference percentage
    diff_pct = abs(of_mean - normal_mean) / normal_mean * 100
    
    ax.set_title(f'{feature}\nDifference: {diff_pct:.1f}%', fontsize=12, fontweight='bold')
    ax.set_xlabel('Value')
    ax.set_ylabel('Density')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Add statistics text
    ax.text(0.02, 0.98, f'Normal Mean: {normal_mean:.0f}\nOF Mean: {of_mean:.0f}', 
            transform=ax.transAxes, verticalalignment='top',
            bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

print("📊 FEATURE ANALYSIS SUMMARY:")
print("=" * 50)
for feature in key_features:
    normal_mean = normal_data[feature].mean()
    of_mean = other_faults_data[feature].mean()
    diff_pct = abs(of_mean - normal_mean) / normal_mean * 100
    
    print(f"{feature:20s}: {diff_pct:6.1f}% difference")
    
    # Check if visual pattern might be different from statistical pattern
    if diff_pct > 50:
        print(f"                     ⚠️  Large statistical difference - verify with visual analysis!")

## 🎯 Phase 2 Conclusions: The Visualization Lesson

This phase has taught us a crucial lesson about data analysis and revealed an important contradiction.

In [None]:
print("📋 PHASE 2: VISUAL INSIGHTS SUMMARY")
print("=" * 60)

print(f"\n🔍 Major Discovery:")
print(f"   📊 Statistical Analysis Said: Other_Faults tend toward thicker plates (mean: {of_mean:.1f}mm)")
print(f"   📈 Visual Analysis Revealed: Other_Faults actually concentrate in {most_frequent_category}")
print(f"   🤯 This contradiction demands deeper investigation!")

print(f"\n💡 Key Learnings:")
print(f"   1. Statistical averages can be misleading with skewed distributions")
print(f"   2. Visualization reveals patterns that statistics might hide")
print(f"   3. Distribution shape is often more important than central tendency")
print(f"   4. Always question your initial findings - data exploration is iterative")

print(f"\n🔬 What This Means for Our Analysis:")
print(f"   • We need to reconsider our thickness hypothesis")
print(f"   • The real patterns might be more complex than we initially thought")
print(f"   • Other_Faults might not be simply about manufacturing processes")

print(f"\n❓ New Questions Emerged:")
print(f"   • Why do statistics and distributions tell different stories?")
print(f"   • What's causing the concentration in thin-to-medium plates?")
print(f"   • Are there hidden relationships we haven't discovered yet?")

print(f"\n🚀 Next Phase Preview:")
print(f"   We need to dive deeper into the relationships between Other_Faults")
print(f"   and other defect types. Maybe correlation analysis will provide")
print(f"   more clues about what's really happening...")

print(f"\n🎯 The Most Important Lesson:")
print(f"   In data science, the moment you think you understand the data")
print(f"   is exactly when you need to look deeper. Contradictions are")
print(f"   not problems - they're opportunities for breakthrough insights!")

---

## 🎯 End of Phase 2: The Plot Thickens

This phase has fundamentally changed our understanding of the data. What started as a confirmation exercise for our thickness hypothesis has turned into a revelation about the misleading nature of statistical averages.

**The contradiction we discovered raises a crucial question**: If Other_Faults don't simply correlate with thickness as statistics suggested, what IS the real pattern?

The distribution clearly shows concentration in thin-to-medium plates, but the statistical mean suggests otherwise. This paradox hints at something deeper - perhaps the relationship between Other_Faults and other variables is more complex than we initially imagined.

In the next notebook (`03_correlation_mystery.ipynb`), we'll explore the correlations between Other_Faults and other defect types. Maybe these relationships will provide the missing piece of the puzzle...

**The mystery deepens, but we're getting closer to the truth!** 🔍

---

*Continue to: [03_correlation_mystery.ipynb](./03_correlation_mystery.ipynb)*