# ❓ The Correlation Mystery: When Data Doesn't Make Sense
## Phase 3: The Deepest Puzzle - Negative Correlations and Logical Contradictions

**Author**: Yu-Ching, Chou | QA Engineer 
**Date**: 2025-07-28  
**Previous Discovery**: Visual analysis contradicted statistical averages  
**Current Mission**: Understand the relationships between defect types  

---

## 🔙 The Story So Far

Our journey has taken an unexpected turn:

### **Phase 1 Discovery:**
- Other_Faults represents 34.7% of all defects (highest category)
- Statistical analysis suggested thickness-related patterns

### **Phase 2 Revelation:**
- Visual analysis revealed the **contradiction**: statistics said "thick plates," but distributions showed **concentration in thin-medium plates**
- Learned that averages can be misleading with skewed distributions

### **The Current Puzzle:**
If Other_Faults are concentrated in the same thickness ranges as other defects (like K_Scratch and Bumps), shouldn't they show **positive correlations**? 

Let's investigate the relationships between different defect types and see if we can solve this mystery...

> *"The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'"* - Isaac Asimov

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ucimlrepo import fetch_ucirepo
from scipy import stats
from scipy.stats import chi2_contingency
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)
sns.set_palette("RdBu_r")

# Ensure reproducibility
np.random.seed(42)

print("✅ Correlation analysis environment ready!")
print("🔍 Time to uncover the relationship mysteries...")

In [None]:
# Load data (keeping notebooks independent)
print("📥 Loading dataset for correlation analysis...")

steel_plates_faults = fetch_ucirepo(id=198)
X = steel_plates_faults.data.features 
y = steel_plates_faults.data.targets
df = pd.concat([X, y], axis=1)

# Define defect columns
defect_columns = ['Pastry', 'Z_Scratch', 'K_Scratch', 'Stains', 'Dirtiness', 'Bumps', 'Other_Faults']

print(f"✅ Dataset loaded: {len(df)} samples")
print(f"🎯 Analyzing {len(defect_columns)} defect types")
print(f"📊 Defect columns: {defect_columns}")

## 🔍 Initial Correlation Analysis: The Shocking Discovery

Let's start by examining the correlations between all defect types, with special focus on Other_Faults.

In [None]:
# Calculate correlation matrix for all defect types
defect_correlations = df[defect_columns].corr()

print("📊 DEFECT CORRELATION MATRIX")
print("=" * 50)
print(defect_correlations.round(3))

# Focus on Other_Faults correlations
other_faults_corr = defect_correlations['Other_Faults'].drop('Other_Faults')

print(f"\n🎯 OTHER_FAULTS CORRELATIONS WITH OTHER DEFECTS:")
print("=" * 55)
for defect, correlation in other_faults_corr.items():
    direction = "📈 Positive" if correlation > 0 else "📉 Negative" if correlation < 0 else "➡️ Zero"
    strength = "Strong" if abs(correlation) > 0.3 else "Moderate" if abs(correlation) > 0.1 else "Weak"
    print(f"Other_Faults vs {defect:12s}: {correlation:6.3f} ({direction}, {strength})")

# Count negative correlations
negative_correlations = sum(1 for corr in other_faults_corr if corr < 0)
total_correlations = len(other_faults_corr)

print(f"\n🚨 SHOCKING OBSERVATION:")
print(f"   {negative_correlations} out of {total_correlations} correlations are NEGATIVE!")
print(f"   That's {negative_correlations/total_correlations*100:.0f}% negative correlations!")
print(f"\n🤔 This raises a fundamental question...")

In [None]:
# Create a comprehensive correlation heatmap
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# 1. Full correlation heatmap
mask = np.triu(np.ones_like(defect_correlations, dtype=bool))
sns.heatmap(defect_correlations, mask=mask, annot=True, cmap='RdBu_r', center=0,
            square=True, fmt='.3f', cbar_kws={"shrink": .8}, ax=ax1)
ax1.set_title('Defect Types Correlation Matrix\n(Lower Triangle Only)', fontsize=14, fontweight='bold')

# 2. Focus on Other_Faults correlations
other_faults_data_viz = other_faults_corr.values.reshape(-1, 1)
labels = other_faults_corr.index.tolist()

im = ax2.imshow(other_faults_data_viz, cmap='RdBu_r', aspect='auto', vmin=-0.5, vmax=0.5)
ax2.set_xticks([])
ax2.set_yticks(range(len(labels)))
ax2.set_yticklabels(labels)
ax2.set_title('Other_Faults Correlations\n(All Negative!)', fontsize=14, fontweight='bold')

# Add correlation values as text
for i, (label, corr) in enumerate(zip(labels, other_faults_corr.values)):
    color = 'white' if abs(corr) > 0.2 else 'black'
    ax2.text(0, i, f'{corr:.3f}', ha='center', va='center', 
             color=color, fontweight='bold', fontsize=12)

# Add colorbar for the second plot
cbar = plt.colorbar(im, ax=ax2, shrink=0.8)
cbar.set_label('Correlation Coefficient', rotation=270, labelpad=15)

plt.tight_layout()
plt.show()

# Highlight the most negative correlations
most_negative = other_faults_corr.nsmallest(3)
print(f"\n🔍 MOST NEGATIVE CORRELATIONS:")
for i, (defect, corr) in enumerate(most_negative.items(), 1):
    print(f"   {i}. Other_Faults vs {defect}: {corr:.3f}")

## 🤔 The Logical Contradiction: This Doesn't Make Sense!

Wait a minute... Let me think about this logically:

In [None]:
print("🧠 LOGICAL REASONING ANALYSIS")
print("=" * 50)

# Let's examine the thickness distributions again with this new information
other_faults_samples = df[df['Other_Faults'] == 1]
k_scratch_samples = df[df['K_Scratch'] == 1]
bumps_samples = df[df['Bumps'] == 1]

print(f"🔍 SAMPLE SIZES:")
print(f"   Other_Faults: {len(other_faults_samples)} samples ({len(other_faults_samples)/len(df)*100:.1f}%)")
print(f"   K_Scratch: {len(k_scratch_samples)} samples ({len(k_scratch_samples)/len(df)*100:.1f}%)")
print(f"   Bumps: {len(bumps_samples)} samples ({len(bumps_samples)/len(df)*100:.1f}%)")

# Check thickness distributions
print(f"\n📏 THICKNESS COMPARISONS:")
print(f"   Other_Faults median thickness: {other_faults_samples['Steel_Plate_Thickness'].median():.1f}mm")
print(f"   K_Scratch median thickness: {k_scratch_samples['Steel_Plate_Thickness'].median():.1f}mm")
print(f"   Bumps median thickness: {bumps_samples['Steel_Plate_Thickness'].median():.1f}mm")

# The contradiction!
print(f"\n🚨 THE CONTRADICTION:")
print(f"   📊 OBSERVATION 1: Other_Faults and K_Scratch both concentrate in similar thickness ranges")
print(f"   📊 OBSERVATION 2: Other_Faults vs K_Scratch correlation = {other_faults_corr['K_Scratch']:.3f} (NEGATIVE!)")
print(f"   📊 OBSERVATION 3: Other_Faults vs Bumps correlation = {other_faults_corr['Bumps']:.3f} (NEGATIVE!)")

print(f"\n🤔 LOGICAL PROBLEM:")
print(f"   IF two defect types occur in similar conditions (same thickness ranges),")
print(f"   THEN shouldn't they show POSITIVE correlation?")
print(f"   BUT we see strong NEGATIVE correlations instead!")

print(f"\n❓ CRITICAL QUESTIONS:")
print(f"   1. Why are correlations negative when distributions overlap?")
print(f"   2. Do these defects rarely occur together on the same sample?")
print(f"   3. Are we missing something fundamental about the data?")

## 🔍 Deep Dive: Co-occurrence Analysis

Let's investigate whether these defects actually occur together or if they're mutually exclusive.

In [None]:
# Analyze co-occurrence patterns
print("🔬 CO-OCCURRENCE ANALYSIS")
print("=" * 50)

# Focus on the most interesting pairs
key_defects = ['Other_Faults', 'K_Scratch', 'Bumps']

for i, defect1 in enumerate(key_defects):
    for defect2 in key_defects[i+1:]:
        # Create contingency table
        both_present = ((df[defect1] == 1) & (df[defect2] == 1)).sum()
        only_defect1 = ((df[defect1] == 1) & (df[defect2] == 0)).sum()
        only_defect2 = ((df[defect1] == 0) & (df[defect2] == 1)).sum()
        neither = ((df[defect1] == 0) & (df[defect2] == 0)).sum()
        
        total_defect1 = (df[defect1] == 1).sum()
        total_defect2 = (df[defect2] == 1).sum()
        
        # Calculate expected co-occurrence if independent
        expected_both = (total_defect1 * total_defect2) / len(df)
        
        print(f"\n📊 {defect1} vs {defect2}:")
        print(f"   Both present: {both_present} samples")
        print(f"   Only {defect1}: {only_defect1} samples")
        print(f"   Only {defect2}: {only_defect2} samples")
        print(f"   Neither: {neither} samples")
        print(f"   Expected co-occurrence (if independent): {expected_both:.1f}")
        print(f"   Actual vs Expected ratio: {both_present/expected_both:.2f}")
        
        if both_present < expected_both * 0.5:
            print(f"   🚨 MUTUAL EXCLUSIVITY DETECTED! Much fewer co-occurrences than expected")
        elif both_present > expected_both * 1.5:
            print(f"   🤝 POSITIVE ASSOCIATION: More co-occurrences than expected")
        else:
            print(f"   ➡️ INDEPENDENCE: Co-occurrence close to expected")

In [None]:
# Create a visualization of co-occurrence patterns
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

defect_pairs = [('Other_Faults', 'K_Scratch'), ('Other_Faults', 'Bumps'), ('K_Scratch', 'Bumps')]

for idx, (defect1, defect2) in enumerate(defect_pairs):
    # Create 2x2 contingency table
    both_present = ((df[defect1] == 1) & (df[defect2] == 1)).sum()
    only_defect1 = ((df[defect1] == 1) & (df[defect2] == 0)).sum()
    only_defect2 = ((df[defect1] == 0) & (df[defect2] == 1)).sum()
    neither = ((df[defect1] == 0) & (df[defect2] == 0)).sum()
    
    # Create contingency matrix
    contingency = np.array([[neither, only_defect2], [only_defect1, both_present]])
    
    # Plot heatmap
    sns.heatmap(contingency, annot=True, fmt='d', cmap='Blues',
                xticklabels=[f'No {defect2}', f'{defect2}'],
                yticklabels=[f'No {defect1}', f'{defect1}'],
                ax=axes[idx])
    
    axes[idx].set_title(f'{defect1} vs {defect2}\nCo-occurrence Matrix', 
                        fontsize=12, fontweight='bold')
    
    # Calculate correlation for reference
    correlation = df[defect1].corr(df[defect2])
    axes[idx].text(0.5, -0.15, f'Correlation: {correlation:.3f}', 
                   transform=axes[idx].transAxes, ha='center',
                   bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7))

plt.tight_layout()
plt.show()

# Summary of findings
print("\n📋 CO-OCCURRENCE SUMMARY:")
print("=" * 40)

for defect1, defect2 in defect_pairs:
    both_present = ((df[defect1] == 1) & (df[defect2] == 1)).sum()
    total_defect1 = (df[defect1] == 1).sum()
    total_defect2 = (df[defect2] == 1).sum()
    expected_both = (total_defect1 * total_defect2) / len(df)
    
    ratio = both_present / expected_both if expected_both > 0 else 0
    
    print(f"{defect1} & {defect2}:")
    print(f"   Actual co-occurrence: {both_present}")
    print(f"   Expected if independent: {expected_both:.1f}")
    print(f"   Ratio: {ratio:.2f} {'(Mutually exclusive!)' if ratio < 0.5 else '(Independent)' if ratio < 1.5 else '(Positively associated)'}")
    print()

## 🔍 Statistical Significance Testing

Let's use statistical tests to confirm whether these patterns are significant.

In [None]:
# Perform chi-square tests for independence
print("📊 CHI-SQUARE INDEPENDENCE TESTS")
print("=" * 50)

for defect1, defect2 in defect_pairs:
    # Create contingency table
    both_present = ((df[defect1] == 1) & (df[defect2] == 1)).sum()
    only_defect1 = ((df[defect1] == 1) & (df[defect2] == 0)).sum()
    only_defect2 = ((df[defect1] == 0) & (df[defect2] == 1)).sum()
    neither = ((df[defect1] == 0) & (df[defect2] == 0)).sum()
    
    contingency_table = np.array([[neither, only_defect2], [only_defect1, both_present]])
    
    # Perform chi-square test
    chi2, p_value, dof, expected = chi2_contingency(contingency_table)
    
    print(f"\n🧪 {defect1} vs {defect2}:")
    print(f"   Chi-square statistic: {chi2:.3f}")
    print(f"   P-value: {p_value:.6f}")
    print(f"   Degrees of freedom: {dof}")
    
    if p_value < 0.001:
        significance = "Highly significant (p < 0.001)"
    elif p_value < 0.01:
        significance = "Very significant (p < 0.01)"
    elif p_value < 0.05:
        significance = "Significant (p < 0.05)"
    else:
        significance = "Not significant (p ≥ 0.05)"
    
    print(f"   Result: {significance}")
    
    if p_value < 0.05:
        print(f"   📊 These defects are NOT independent!")
        if both_present < expected[1,1]:
            print(f"   🚨 They show MUTUAL EXCLUSIVITY (negative association)")
        else:
            print(f"   🤝 They show POSITIVE ASSOCIATION")
    else:
        print(f"   ➡️ These defects appear to be independent")

## 🎯 The Mystery Deepens: Formulating Hypotheses

Based on our correlation and co-occurrence analysis, several hypotheses emerge...

In [None]:
print("🔬 HYPOTHESIS FORMULATION")
print("=" * 50)

# Summarize key findings
total_samples = len(df)
other_faults_count = (df['Other_Faults'] == 1).sum()
other_faults_rate = other_faults_count / total_samples * 100

print(f"🔍 KEY FINDINGS SUMMARY:")
print(f"   • Other_Faults: {other_faults_count} samples ({other_faults_rate:.1f}%)")
print(f"   • ALL correlations with Other_Faults are negative")
print(f"   • Co-occurrence rates are lower than expected if independent")
print(f"   • Statistical tests confirm significant mutual exclusivity")

print(f"\n💡 POSSIBLE HYPOTHESES:")
print(f"\n🔬 Hypothesis 1: Classification System Logic")
print(f"   Maybe the classification system works like:")
print(f"   IF defect matches K_Scratch pattern → classify as K_Scratch")
print(f"   ELSE IF defect matches Bumps pattern → classify as Bumps")
print(f"   ELSE → classify as Other_Faults")
print(f"   This would explain mutual exclusivity!")

print(f"\n🏭 Hypothesis 2: Manufacturing Process Exclusivity")
print(f"   Different manufacturing conditions might produce different defect types:")
print(f"   • Condition A → K_Scratch defects")
print(f"   • Condition B → Other_Faults")
print(f"   • These conditions rarely occur simultaneously")

print(f"\n🤖 Hypothesis 3: Machine Learning Training Dataset")
print(f"   What if this dataset was designed for training classification algorithms?")
print(f"   • Each sample is labeled with ONE primary defect type")
print(f"   • 'Other_Faults' represents the 'catch-all' category")
print(f"   • This would naturally create mutual exclusivity")

print(f"\n❓ CRITICAL QUESTION:")
print(f"   Which hypothesis explains our observations best?")
print(f"   To find out, we need to investigate the original purpose")
print(f"   and design of this dataset...")

print(f"\n🎯 NEXT INVESTIGATION PRIORITY:")
print(f"   Research the dataset's original documentation")
print(f"   and understand its intended purpose!")

## 📊 Final Correlation Insights

Let's create one final comprehensive view of all the relationships we've discovered.

In [None]:
# Create a comprehensive summary visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(20, 16))

# 1. Correlation heatmap with annotations
sns.heatmap(defect_correlations, annot=True, cmap='RdBu_r', center=0,
            square=True, fmt='.3f', cbar_kws={"shrink": .8}, ax=ax1)
ax1.set_title('Complete Defect Correlation Matrix\n(Notice the Other_Faults Pattern)', 
              fontsize=14, fontweight='bold')

# 2. Other_Faults correlation bar chart
colors = ['red' if corr < 0 else 'green' for corr in other_faults_corr.values]
bars = ax2.bar(range(len(other_faults_corr)), other_faults_corr.values, color=colors, alpha=0.7)
ax2.set_xticks(range(len(other_faults_corr)))
ax2.set_xticklabels(other_faults_corr.index, rotation=45, ha='right')
ax2.set_ylabel('Correlation Coefficient')
ax2.set_title('Other_Faults Correlations\n(All Negative!)', fontsize=14, fontweight='bold')
ax2.axhline(y=0, color='black', linestyle='-', alpha=0.3)
ax2.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, value in zip(bars, other_faults_corr.values):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + (0.01 if height > 0 else -0.03),
             f'{value:.3f}', ha='center', va='bottom' if height > 0 else 'top', 
             fontweight='bold')

# 3. Sample size comparison
defect_counts = [df[defect].sum() for defect in defect_columns]
colors_size = ['red' if defect == 'Other_Faults' else 'lightblue' for defect in defect_columns]

bars_size = ax3.bar(defect_columns, defect_counts, color=colors_size)
ax3.set_title('Defect Type Sample Sizes\n(Other_Faults is the Largest)', 
              fontsize=14, fontweight='bold')
ax3.set_ylabel('Number of Samples')
ax3.tick_params(axis='x', rotation=45)

# Add percentage labels
for bar, count in zip(bars_size, defect_counts):
    percentage = count / total_samples * 100
    ax3.text(bar.get_x() + bar.get_width()/2., count + 10,
             f'{percentage:.1f}%', ha='center', va='bottom', fontweight='bold')

# 4. Co-occurrence summary
co_occurrence_data = []
pair_labels = []

for defect1, defect2 in defect_pairs:
    both_present = ((df[defect1] == 1) & (df[defect2] == 1)).sum()
    total_defect1 = (df[defect1] == 1).sum()
    total_defect2 = (df[defect2] == 1).sum()
    expected_both = (total_defect1 * total_defect2) / len(df)
    
    ratio = both_present / expected_both if expected_both > 0 else 0
    co_occurrence_data.append(ratio)
    pair_labels.append(f'{defect1[:6]}\nvs\n{defect2[:6]}')

colors_co = ['red' if ratio < 0.5 else 'yellow' if ratio < 1.5 else 'green' 
             for ratio in co_occurrence_data]

bars_co = ax4.bar(pair_labels, co_occurrence_data, color=colors_co, alpha=0.7)
ax4.set_title('Co-occurrence Ratios\n(Actual/Expected)', fontsize=14, fontweight='bold')
ax4.set_ylabel('Ratio (Actual/Expected)')
ax4.axhline(y=1.0, color='black', linestyle='--', alpha=0.5, label='Expected (Independent)')
ax4.axhline(y=0.5, color='red', linestyle=':', alpha=0.5, label='Mutual Exclusivity Threshold')
ax4.legend()
ax4.grid(True, alpha=0.3, axis='y')

# Add ratio labels
for bar, ratio in zip(bars_co, co_occurrence_data):
    ax4.text(bar.get_x() + bar.get_width()/2., ratio + 0.05,
             f'{ratio:.2f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("🎯 COMPREHENSIVE ANALYSIS COMPLETE!")
print("All evidence points to systematic mutual exclusivity between defect types.")
print("The next phase must investigate the fundamental nature of this dataset...")

## 🎯 Phase 3 Conclusions: The Deepest Mystery Yet

This correlation analysis has revealed the most puzzling aspect of our investigation.

In [None]:
print("📋 PHASE 3: CORRELATION MYSTERY SUMMARY")
print("=" * 60)

print(f"\n🔍 MAJOR DISCOVERIES:")
print(f"   📊 ALL Other_Faults correlations are negative (100% negative!)")
print(f"   🚨 Strongest negative correlations:")
for defect, corr in other_faults_corr.nsmallest(3).items():
    print(f"      • vs {defect}: {corr:.3f}")

print(f"   🔬 Statistical significance confirmed for mutual exclusivity")
print(f"   📈 Co-occurrence rates much lower than expected")

print(f"\n🤔 THE FUNDAMENTAL CONTRADICTION:")
print(f"   • Similar thickness distributions → Should correlate positively")
print(f"   • Actual correlations → All negative!")
print(f"   • This suggests something deeper than manufacturing processes")

print(f"\n💡 LEADING HYPOTHESIS:")
print(f"   The mutual exclusivity pattern strongly suggests this dataset")
print(f"   was designed for CLASSIFICATION TRAINING, where each sample")
print(f"   is labeled with exactly ONE defect type, and 'Other_Faults'")
print(f"   represents the 'catch-all' category for unclassifiable defects.")

print(f"\n❓ CRITICAL QUESTIONS FOR NEXT PHASE:")
print(f"   1. What was the original purpose of this dataset?")
print(f"   2. Is this a machine learning training dataset?")
print(f"   3. How does this change our approach to the Other_Faults problem?")

print(f"\n🚀 NEXT INVESTIGATION:")
print(f"   Time to investigate the dataset's original documentation")
print(f"   and discover the truth about its intended purpose!")

print(f"\n🎯 THE BREAKTHROUGH AWAITS:")
print(f"   If our hypothesis is correct, this completely reframes the problem.")
print(f"   Other_Faults isn't a manufacturing issue - it's a classification")
print(f"   system limitation! The solution lies in improving the AI model,")
print(f"   not the manufacturing process!")

print(f"\n🔍 DETECTIVE WORK CONTINUES...")
print(f"   The correlation mystery has given us the biggest clue yet.")
print(f"   Time to verify our hypothesis and discover the truth!")

---

## 🎯 End of Phase 3: The Plot Thickens Further

The correlation analysis has revealed the most significant clue in our investigation: **systematic mutual exclusivity** between all defect types and Other_Faults. This pattern is so consistent and strong that it cannot be explained by manufacturing processes alone.

**The negative correlations (-0.366 with K_Scratch, -0.372 with Bumps) combined with the co-occurrence analysis point to one inevitable conclusion**: this dataset likely follows a **classification system logic** where each sample is assigned exactly one defect label.

If this hypothesis is correct, it completely changes our understanding of the problem:
- ❌ **Wrong assumption**: Other_Faults is a manufacturing process issue
- ✅ **Likely reality**: Other_Faults represents classification system limitations

In the next notebook (`04_truth_discovery.ipynb`), we'll investigate the dataset's original documentation to verify this hypothesis. If confirmed, we'll need to completely reframe our approach from **process improvement** to **AI model enhancement**.

**The truth is within reach!** 🔍

---

*Continue to: [04_truth_discovery.ipynb](./04_truth_discovery.ipynb)*