# Assignment 3: Statistical Analysis and Visualization of Wine Quality Dataset

---  
**Student Name:** Akanksha Anand     
**Registration Number:** 2401730035  
**Program:** B.Tech CSE (Artificial Intelligence & Machine Learning) Section A 

---

## Executive Summary
This notebook presents a comprehensive exploratory data analysis (EDA) of the Wine Quality dataset sourced from the UCI Machine Learning Repository. The dataset includes physicochemical measurements of Portuguese 'Vinho Verde' wine samples, encompassing both red and white variants.

**Dataset Specifications:**
- **Data Source:** UCI Machine Learning Repository / Kaggle
- **Sample Size:** ~1,599 red wine samples + ~4,898 white wine samples
- **Variables:** 11 physicochemical properties + 1 quality rating variable
- **Application:** Classification/Regression based on sensory evaluation

**Analytical Objectives:**
1. Analyze the distribution patterns of wine quality ratings
2. Determine relationships between chemical composition and quality
3. Identify anomalies and outliers in the measurements
4. Uncover patterns distinguishing premium-quality wines

---

In [None]:
# Importing essential libraries for analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
from scipy import stats
import warnings

# Suppress warning messages for cleaner output
warnings.filterwarnings('ignore')

# Configure visualization parameters
sns.set_style('darkgrid')
sns.set_palette('Set2')
plt.rcParams['figure.figsize'] = (12, 7)
plt.rcParams['font.size'] = 11

print("✅ All required libraries loaded successfully!")
print(f"Pandas Version: {pd.__version__}")
print(f"NumPy Version: {np.__version__}")
print(f"Matplotlib Version: {matplotlib.__version__}")
print(f"Seaborn Version: {sns.__version__}")

---
## Section 1: Data Import and Preliminary Investigation

In this section, we'll import the Wine Quality dataset and conduct initial investigations to understand its structure and composition. The dataset is formatted as CSV with semicolon separators.

**Feature Descriptions:**
- `fixed acidity`: Concentration of non-volatile acids (tartaric acid)
- `volatile acidity`: Acetic acid concentration (vinegar-like characteristics)
- `citric acid`: Citric acid content (adds freshness and flavor complexity)
- `residual sugar`: Sugar content post-fermentation process
- `chlorides`: Salt concentration in wine
- `free sulfur dioxide`: Free form SO2 (antimicrobial properties)
- `total sulfur dioxide`: Combined SO2 content (preservative function)
- `density`: Wine density measurement
- `pH`: Acidity/alkalinity measure (pH scale 0-14)
- `sulphates`: Sulfate additive concentration
- `alcohol`: Ethanol percentage by volume
- `quality`: Expert sensory evaluation score (0-10 scale)


In [None]:
# Loading the wine quality dataset
# Note: Update filename to match your data file location
# Common filenames: 'winequality-white.csv', 'winequality-red.csv', 'WineQT.csv'

wine_data = pd.read_csv('winequality-white.csv', delimiter=';')

print("="*75)
print("DATASET SUCCESSFULLY LOADED")
print("="*75)
print(f"Total number of samples (rows): {wine_data.shape[0]}")
print(f"Total number of features (columns): {wine_data.shape[1]}")
print(f"Total data points in dataset: {wine_data.shape[0] * wine_data.shape[1]}")
print(f"Memory consumption: {wine_data.memory_usage(deep=True).sum() / 1024:.2f} KB")

---
### 1.1 Data Sample Inspection

Let's examine sample records from both ends of the dataset to verify successful loading and understand value distributions.


In [None]:
print("="*75)
print("FIRST 10 ROWS OF DATASET")
print("="*75)
display(wine_data.head(10))

print("\n" + "="*75)
print("LAST 10 ROWS OF DATASET")
print("="*75)
display(wine_data.tail(10))

---
### 1.2 Data Type Analysis

Understanding column data types is essential for:
- Selecting appropriate statistical methods
- Detecting potential data quality issues
- Planning data transformation strategies

Expected: All chemical measurements should be numerical (float/int), quality as integer.


In [None]:
print("="*75)
print("DETAILED DATASET INFORMATION")
print("="*75)
wine_data.info()

print("\n" + "="*75)
print("DATA TYPE SUMMARY")
print("="*75)
print(wine_data.dtypes.value_counts())

print("\n" + "="*75)
print("COMPLETE LIST OF FEATURES")
print("="*75)
for idx, col_name in enumerate(wine_data.columns, 1):
    print(f"{idx}. {col_name}")

---
### 1.3 Statistical Summary

The `.describe()` function provides essential statistical measures:
- **Count:** Number of valid (non-null) observations
- **Mean:** Average value (arithmetic mean)
- **Std:** Standard deviation (measure of spread)
- **Min/Max:** Minimum and maximum values
- **25%, 50%, 75%:** First, second (median), and third quartiles

This overview reveals central tendencies and variability in each feature.


In [None]:
print("="*75)
print("COMPREHENSIVE STATISTICAL SUMMARY")
print("="*75)
statistical_summary = wine_data.describe().T
statistical_summary['range'] = statistical_summary['max'] - statistical_summary['min']
display(statistical_summary)

# Inspect categorical variables if present
categorical_features = wine_data.select_dtypes(include='object').columns
if len(categorical_features) > 0:
    print("\n" + "="*75)
    print("CATEGORICAL VARIABLES DETECTED")
    print("="*75)
    for cat_col in categorical_features:
        print(f"\n{cat_col}:")
        print(wine_data[cat_col].value_counts())

---
## Section 2: Data Quality and Cleaning

### 2.1 Missing Value Analysis

Missing data can significantly impact analysis outcomes. Our approach:
1. Identify columns containing missing values
2. Calculate missing data percentage
3. Determine appropriate handling strategy

**Common Strategies:**
- **Deletion:** When >50% missing or feature is irrelevant
- **Imputation:** Fill with mean/median (numerical) or mode (categorical)
- **Retain:** When missingness provides information

Note: Wine Quality dataset typically has complete records.


In [None]:
print("="*75)
print("MISSING VALUE DETECTION")
print("="*75)

# Compute missing value statistics
missing_count = wine_data.isnull().sum()
missing_percent = (missing_count / len(wine_data)) * 100
missing_df = pd.DataFrame({
    'Column_Name': wine_data.columns,
    'Missing_Values': missing_count.values,
    'Missing_Percentage': missing_percent.values
})
missing_df = missing_df[missing_df['Missing_Values'] > 0].sort_values('Missing_Percentage', ascending=False)

if len(missing_df) > 0:
    display(missing_df)
    
    # Visualize missing data
    plt.figure(figsize=(14, 6))
    plt.bar(missing_df['Column_Name'], missing_df['Missing_Percentage'], color='salmon')
    plt.xlabel('Column Names', fontweight='bold')
    plt.ylabel('Missing Percentage (%)', fontweight='bold')
    plt.title('Missing Values Distribution Across Features', fontsize=14, fontweight='bold')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
else:
    print("✅ EXCELLENT! Dataset contains no missing values.")
    print("   This indicates high-quality data collection procedures.")

---
### 2.2 Missing Value Imputation Strategy

Even though Wine Quality data is typically complete, we implement a robust strategy for handling missing values:

**Implementation:**
1. Remove columns with >50% missing data (excessive information loss)
2. Impute numerical features with **median** (robust against outliers)
3. Impute categorical features with **mode** (most frequent category)


In [None]:
print("="*75)
print("MISSING VALUE TREATMENT")
print("="*75)

# Strategy 1: Remove columns with excessive missingness
threshold_pct = 0.5
missing_fraction = wine_data.isnull().sum() / len(wine_data)
columns_to_drop = missing_fraction[missing_fraction > threshold_pct].index.tolist()

if columns_to_drop:
    print(f"Dropping {len(columns_to_drop)} columns exceeding {threshold_pct*100}% missing threshold:")
    print(columns_to_drop)
    wine_data = wine_data.drop(columns=columns_to_drop)
else:
    print(f"✅ No columns exceed the {threshold_pct*100}% missing threshold")

# Strategy 2: Impute numerical columns with median
numerical_columns = wine_data.select_dtypes(include=[np.number]).columns
imputation_count = 0

for num_col in numerical_columns:
    if wine_data[num_col].isnull().sum() > 0:
        median_value = wine_data[num_col].median()
        wine_data[num_col].fillna(median_value, inplace=True)
        print(f"  Imputed {num_col}: filled {wine_data[num_col].isnull().sum()} values with median={median_value:.2f}")
        imputation_count += 1

# Strategy 3: Impute categorical columns with mode
categorical_columns = wine_data.select_dtypes(include=['object', 'category']).columns
for cat_col in categorical_columns:
    if wine_data[cat_col].isnull().sum() > 0:
        mode_value = wine_data[cat_col].mode()[0]
        wine_data[cat_col].fillna(mode_value, inplace=True)
        print(f"  Imputed {cat_col}: filled with mode={mode_value}")
        imputation_count += 1

if imputation_count == 0:
    print("✅ No imputation necessary - dataset is complete!")

print(f"\nFinal count of missing values: {wine_data.isnull().sum().sum()}")

---
### 2.3 Duplicate Records Detection

Duplicate rows artificially inflate sample size and bias analysis. Common causes:
- Data entry errors
- Multiple measurements of same sample
- Database merge issues

**Action:** Identify and remove exact duplicates, keeping only first occurrence.


In [None]:
print("="*75)
print("DUPLICATE RECORDS ANALYSIS")
print("="*75)

duplicate_rows = wine_data.duplicated().sum()
print(f"Number of duplicate rows identified: {duplicate_rows}")
print(f"Percentage of total dataset: {(duplicate_rows/len(wine_data)*100):.2f}%")

if duplicate_rows > 0:
    print(f"\n⚠ Action Required: Removing {duplicate_rows} duplicate rows...")
    wine_data = wine_data.drop_duplicates()
    print(f"✅ Dataset dimensions after duplicate removal: {wine_data.shape[0]} rows × {wine_data.shape[1]} columns")
else:
    print("✅ No duplicate records detected - data integrity confirmed!")

---
### 2.4 Outlier Detection Using Boxplots

**Understanding Outliers:**
Outliers are observations that deviate significantly from other data points. Categories:
- **Legitimate:** Rare but valid measurements (e.g., exceptional wine quality)
- **Errors:** Measurement or data entry mistakes

**Detection Method: Interquartile Range (IQR)**
- Calculate Q1 (25th percentile) and Q3 (75th percentile)
- IQR = Q3 - Q1
- Lower Boundary = Q1 - 1.5 × IQR
- Upper Boundary = Q3 + 1.5 × IQR
- Values outside boundaries are flagged as outliers

Boxplots effectively visualize outliers for all chemical properties.


In [None]:
print("="*75)
print("OUTLIER DETECTION USING BOXPLOT METHOD")
print("="*75)

# Identify numerical columns
numeric_cols = wine_data.select_dtypes(include=[np.number]).columns.tolist()
num_features = len(numeric_cols)

print(f"Analyzing {num_features} numerical features for outlier detection...\n")

# Generate boxplot grid
num_rows = (num_features + 2) // 3  # 3 plots per row
fig, axes = plt.subplots(num_rows, 3, figsize=(16, 5*num_rows))
axes = axes.flatten() if num_features > 1 else [axes]

for i, feature in enumerate(numeric_cols):
    box_plot = axes[i].boxplot(wine_data[feature].dropna(), vert=True, patch_artist=True)
    axes[i].set_title(f'Boxplot: {feature}', fontweight='bold')
    axes[i].set_ylabel(feature)
    axes[i].grid(axis='y', alpha=0.4)
    
    # Apply color scheme
    for patch in box_plot['boxes']:
        patch.set_facecolor('lightcoral')

# Remove unused subplot areas
for i in range(num_features, len(axes)):
    axes[i].axis('off')

plt.suptitle('Outlier Detection via Boxplot Visualization', fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

---
### 2.5 Outlier Quantification

Calculate precise outlier counts and percentages using IQR method to understand:
- Which features contain most extreme values
- Whether outliers are isolated or systemic
- If outlier treatment is necessary


In [None]:
print("="*75)
print("OUTLIER STATISTICS - IQR METHODOLOGY")
print("="*75)

outlier_summary = []

for feature in numeric_cols:
    q1 = wine_data[feature].quantile(0.25)
    q3 = wine_data[feature].quantile(0.75)
    iqr = q3 - q1
    lower_boundary = q1 - 1.5 * iqr
    upper_boundary = q3 + 1.5 * iqr
    
    # Identify outliers
    outlier_records = wine_data[(wine_data[feature] < lower_boundary) | (wine_data[feature] > upper_boundary)]
    outlier_count = len(outlier_records)
    outlier_percentage = (outlier_count / len(wine_data)) * 100
    
    outlier_summary.append({
        'Feature': feature,
        'Q1': round(q1, 3),
        'Q3': round(q3, 3),
        'IQR': round(iqr, 3),
        'Lower_Bound': round(lower_boundary, 3),
        'Upper_Bound': round(upper_boundary, 3),
        'Outlier_Count': outlier_count,
        'Outlier_Pct': round(outlier_percentage, 2)
    })

outlier_df = pd.DataFrame(outlier_summary)
display(outlier_df)

# Summary statistics
total_outliers_detected = outlier_df['Outlier_Count'].sum()
print(f"\n📊 ANALYSIS SUMMARY:")
print(f"  • Total outliers detected across all features: {total_outliers_detected}")
print(f"  • Features with highest outlier counts: {outlier_df.nlargest(3, 'Outlier_Count')['Feature'].tolist()}")

---
### 2.6 Outlier Treatment Strategy

**Strategy: Winsorization (Value Capping)**

Rather than deleting outliers (information loss), we use **winsorization** - capping extreme values at boundary limits. Benefits:
- Maintains sample size
- Reduces impact of extreme values
- Preserves overall distribution shape

**Note:** For wine quality, some "outliers" may represent legitimate exceptional wines. We create a cleaned version while preserving the original.


In [None]:
print("="*75)
print("OUTLIER TREATMENT - WINSORIZATION METHOD")
print("="*75)

# Create copy for cleaned data
wine_data_cleaned = wine_data.copy()

for feature in numeric_cols:
    q1 = wine_data_cleaned[feature].quantile(0.25)
    q3 = wine_data_cleaned[feature].quantile(0.75)
    iqr = q3 - q1
    lower_boundary = q1 - 1.5 * iqr
    upper_boundary = q3 + 1.5 * iqr
    
    # Count outliers before treatment
    outliers_before = ((wine_data_cleaned[feature] < lower_boundary) | (wine_data_cleaned[feature] > upper_boundary)).sum()
    
    # Apply winsorization (cap values)
    wine_data_cleaned[feature] = np.where(wine_data_cleaned[feature] < lower_boundary, lower_boundary, wine_data_cleaned[feature])
    wine_data_cleaned[feature] = np.where(wine_data_cleaned[feature] > upper_boundary, upper_boundary, wine_data_cleaned[feature])
    
    if outliers_before > 0:
        print(f"✓ {feature}: Capped {outliers_before} outliers to range [{lower_boundary:.2f}, {upper_boundary:.2f}]")

print(f"\n✓ Outlier treatment completed!")
print(f"  Original dataset shape: {wine_data.shape}")
print(f"  Cleaned dataset shape: {wine_data_cleaned.shape}")
print("\nNote: Using ORIGINAL dataset for further analysis to preserve natural wine characteristics.")

---
## Section 3: Univariate Analysis

Univariate analysis examines individual variables to understand their distributions and characteristics.

### 3.1 Central Tendency Measures

**Central tendency** identifies the "center" or "typical" value:

- **Mean (μ):** Arithmetic average, affected by outliers
- **Median:** Middle value when sorted, outlier-resistant
- **Trimmed Mean:** Mean after removing extreme 10% (5% each end)
- **Mode:** Most frequently occurring value

For wine quality, comparing these reveals:
- Normal distribution (mean ≈ median)
- Right skew (mean > median)


In [None]:
print("="*75)
print("CENTRAL TENDENCY MEASURES")
print("="*75)

central_stats = []

for feature in numeric_cols:
    # Calculate measures
    mean_value = wine_data[feature].mean()
    median_value = wine_data[feature].median()
    trimmed_mean = stats.trim_mean(wine_data[feature].dropna(), 0.1)  # Remove 10% extremes
    mode_values = wine_data[feature].mode()
    mode_value = mode_values[0] if len(mode_values) > 0 else np.nan
    
    central_stats.append({
        'Feature': feature,
        'Mean': round(mean_value, 3),
        'Median': round(median_value, 3),
        'Trimmed_Mean': round(trimmed_mean, 3),
        'Mode': round(mode_value, 3) if not pd.isna(mode_value) else 'N/A'
    })

central_df = pd.DataFrame(central_stats)
display(central_df)

print("\n📖 INTERPRETATION GUIDELINES:")
print("  • Mean ≈ Median → Symmetric distribution")
print("  • Mean > Median → Right-skewed (positive skew)")
print("  • Mean < Median → Left-skewed (negative skew)")
print("  • Trimmed Mean → Outlier-resistant measure for skewed data")

---
### 3.2 Dispersion Measures

**Variability measures** describe data spread:

- **Range:** Max - Min (simple, outlier-sensitive)
- **Variance (σ²):** Average squared deviation
- **Standard Deviation (σ):** Square root of variance
- **Coefficient of Variation (CV%):** (σ/μ) × 100 - relative variability

**Wine Quality Implications:**
- Low variability → Consistent measurements
- High variability → Large inter-sample differences
- CV% enables cross-scale comparison


In [None]:
print("="*75)
print("DISPERSION MEASURES")
print("="*75)

dispersion_stats = []

for feature in numeric_cols:
    data_range = wine_data[feature].max() - wine_data[feature].min()
    variance_val = wine_data[feature].var()
    std_dev = wine_data[feature].std()
    cv_percent = (std_dev / wine_data[feature].mean()) * 100 if wine_data[feature].mean() != 0 else np.nan
    
    dispersion_stats.append({
        'Feature': feature,
        'Min': round(wine_data[feature].min(), 3),
        'Max': round(wine_data[feature].max(), 3),
        'Range': round(data_range, 3),
        'Variance': round(variance_val, 3),
        'Std_Deviation': round(std_dev, 3),
        'CV_Percent': round(cv_percent, 2)
    })

dispersion_df = pd.DataFrame(dispersion_stats)
display(dispersion_df)

print("\n📖 INTERPRETATION GUIDELINES:")
print("  • CV% < 15% → Low variability (highly consistent)")
print("  • CV% 15-30% → Moderate variability")
print("  • CV% > 30% → High variability (substantial differences)")

---
### 3.3 Distribution Visualization

**Histograms + KDE plots** reveal:
- **Shape:** Normal, skewed, bimodal, uniform?
- **Spread:** Narrow or wide distribution?
- **Peaks:** Single or multiple modes?

**For wine quality prediction:**
- Normal distributions suit linear models
- Skewed distributions may need transformation
- Bimodal distributions suggest distinct wine categories


In [None]:
print("="*75)
print("DISTRIBUTION ANALYSIS - HISTOGRAMS WITH KDE")
print("="*75)

num_features = len(numeric_cols)
num_rows = (num_features + 2) // 3

fig, axes = plt.subplots(num_rows, 3, figsize=(17, 5*num_rows))
axes = axes.flatten() if num_features > 1 else [axes]

for idx, feature in enumerate(numeric_cols):
    # Histogram
    axes[idx].hist(wine_data[feature].dropna(), bins=30, alpha=0.6, color='skyblue', edgecolor='black', density=True, label='Histogram')
    
    # KDE overlay
    wine_data[feature].dropna().plot(kind='kde', ax=axes[idx], color='red', linewidth=2.5, label='KDE')
    
    # Add statistical reference lines
    mean_val = wine_data[feature].mean()
    median_val = wine_data[feature].median()
    axes[idx].axvline(mean_val, color='green', linestyle='--', linewidth=2, label=f'Mean={mean_val:.2f}')
    axes[idx].axvline(median_val, color='orange', linestyle='--', linewidth=2, label=f'Median={median_val:.2f}')
    
    axes[idx].set_title(f'Distribution: {feature}', fontweight='bold', fontsize=11)
    axes[idx].set_xlabel(feature)
    axes[idx].set_ylabel('Density')
    axes[idx].legend(fontsize=8)
    axes[idx].grid(alpha=0.3)

# Remove unused subplots
for idx in range(num_features, len(axes)):
    axes[idx].axis('off')

plt.suptitle('Wine Quality Dataset - Distribution Analysis', fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

---
### 3.4 Target Variable Analysis - Wine Quality

The **quality** variable is our prediction target. Understanding its distribution is critical:
- Is quality normally distributed or skewed?
- Are all quality levels represented?
- Is there class imbalance?

This informs modeling strategy and evaluation metrics.


In [None]:
print("="*75)
print("WINE QUALITY DISTRIBUTION ANALYSIS")
print("="*75)

# Frequency distribution
quality_distribution = wine_data['quality'].value_counts().sort_index()
quality_percentages = (quality_distribution / len(wine_data) * 100).round(2)

quality_summary_table = pd.DataFrame({
    'Quality_Score': quality_distribution.index,
    'Frequency': quality_distribution.values,
    'Percentage': quality_percentages.values
})

display(quality_summary_table)

# Statistical measures
print(f"\n📊 QUALITY STATISTICS:")
print(f"  • Average quality: {wine_data['quality'].mean():.2f}")
print(f"  • Median quality: {wine_data['quality'].median():.0f}")
print(f"  • Most common quality: {wine_data['quality'].mode()[0]}")
print(f"  • Quality range: {wine_data['quality'].min()} - {wine_data['quality'].max()}")
print(f"  • Standard deviation: {wine_data['quality'].std():.2f}")

# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Bar chart
colors_gradient = plt.cm.RdYlGn(np.linspace(0.2, 0.8, len(quality_distribution)))
ax1.bar(quality_distribution.index, quality_distribution.values, color=colors_gradient, edgecolor='black', linewidth=1.5)
ax1.set_xlabel('Quality Score', fontweight='bold')
ax1.set_ylabel('Frequency', fontweight='bold')
ax1.set_title('Wine Quality Distribution', fontweight='bold', fontsize=12)
ax1.set_xticks(quality_distribution.index)
ax1.grid(axis='y', alpha=0.3)

# Pie chart
ax2.pie(quality_distribution.values, labels=quality_distribution.index, autopct='%1.1f%%', 
        startangle=90, colors=colors_gradient, textprops={'fontweight': 'bold'})
ax2.set_title('Wine Quality Proportions', fontweight='bold', fontsize=12)

plt.tight_layout()
plt.show()

# Class imbalance check
print(f"\n⚠ CLASS BALANCE ASSESSMENT:")
majority_percentage = (quality_distribution.max() / len(wine_data) * 100)
if majority_percentage > 50:
    print(f"  Dataset shows IMBALANCE - majority class: {majority_percentage:.1f}%")
else:
    print(f"  Dataset is relatively BALANCED")

---
## Section 4: Bivariate Analysis

Bivariate analysis examines relationships between **two variables**. For wine quality:
- Which chemical properties influence quality?
- Are properties correlated with each other?
- Do relationships follow linear or non-linear patterns?

### 4.1 Correlation Analysis

**Correlation coefficient (r)** quantifies linear relationship strength:
- **r = +1:** Perfect positive correlation
- **r = 0:** No linear correlation
- **r = -1:** Perfect negative correlation

**For wine quality:**
- Positive correlation → Higher values improve quality
- Negative correlation → Lower values improve quality


In [None]:
print("="*75)
print("CORRELATION ANALYSIS")
print("="*75)

# Calculate correlation matrix
correlation_matrix = wine_data[numeric_cols].corr()

# Display correlations with quality
if 'quality' in numeric_cols:
    quality_correlations = correlation_matrix['quality'].sort_values(ascending=False)
    print("CORRELATIONS WITH WINE QUALITY (Target Variable):\n")
    print(quality_correlations)
    print("\n" + "="*60)

# Visualize correlation heatmap
plt.figure(figsize=(15, 13))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))  # Mask upper triangle
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            fmt='.2f', square=True, linewidths=1.5, cbar_kws={"shrink": 0.8},
            mask=mask, vmin=-1, vmax=1)
plt.title('Correlation Heatmap - Wine Quality Dataset', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

print("\n📖 INTERPRETATION GUIDELINES:")
print("  • |r| > 0.7 → Strong correlation")
print("  • 0.4 < |r| < 0.7 → Moderate correlation")
print("  • |r| < 0.4 → Weak correlation")

---
### 4.2 Scatter Plot Analysis

Scatter plots visualize relationships between continuous variables. For wine quality:
- **Quality vs. Chemical Properties:** Identify influential properties
- **Chemical Property Pairs:** Detect multicollinearity

We'll plot:
1. Quality against top correlated chemical properties
2. Highly correlated feature pairs


In [None]:
print("="*75)
print("SCATTER PLOT ANALYSIS")
print("="*75)

# Identify strong correlations with quality
if 'quality' in numeric_cols:
    quality_corr = correlation_matrix['quality'].drop('quality').abs().sort_values(ascending=False)
    top_4_features = quality_corr.head(4).index.tolist()
    
    print(f"Top {len(top_4_features)} features correlated with quality:")
    for i, feat in enumerate(top_4_features, 1):
        corr_value = correlation_matrix.loc[feat, 'quality']
        print(f"  {i}. {feat}: r = {corr_value:.3f}")
    
    # Create scatter plots: Quality vs Top Features
    fig, axes = plt.subplots(2, 2, figsize=(15, 13))
    axes = axes.flatten()
    
    for idx, feature in enumerate(top_4_features):
        axes[idx].scatter(wine_data[feature], wine_data['quality'], alpha=0.5, s=30, edgecolors='black', linewidth=0.5)
        axes[idx].set_xlabel(feature, fontweight='bold')
        axes[idx].set_ylabel('Quality', fontweight='bold')
        axes[idx].set_title(f'Quality vs {feature}\n(Correlation: {correlation_matrix.loc[feature, "quality"]:.3f})', fontweight='bold')
        axes[idx].grid(alpha=0.3)
        
        # Add trend line
        z = np.polyfit(wine_data[feature].dropna(), wine_data['quality'].dropna(), 1)
        p = np.poly1d(z)
        axes[idx].plot(wine_data[feature].sort_values(), p(wine_data[feature].sort_values()), "r--", linewidth=2.5, label='Trend Line')
        axes[idx].legend()
    
    plt.suptitle('Wine Quality vs Chemical Properties', fontsize=16, fontweight='bold', y=1.01)
    plt.tight_layout()
    plt.show()

# Find highly correlated feature pairs
print("\n" + "="*75)
print("HIGHLY CORRELATED FEATURE PAIRS (|r| > 0.6)")
print("="*75)

high_correlation_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_val = correlation_matrix.iloc[i, j]
        if abs(corr_val) > 0.6:
            high_correlation_pairs.append((correlation_matrix.columns[i], correlation_matrix.columns[j], corr_val))

if len(high_correlation_pairs) > 0:
    for pair in high_correlation_pairs:
        print(f"  • {pair[0]} ↔ {pair[1]}: r = {pair[2]:.3f}")
    
    # Plot top 4 pairs
    pairs_to_plot = min(4, len(high_correlation_pairs))
    fig, axes = plt.subplots(2, 2, figsize=(15, 13))
    axes = axes.flatten()
    
    for idx, (feat1, feat2, corr) in enumerate(high_correlation_pairs[:pairs_to_plot]):
        scatter = axes[idx].scatter(wine_data[feat1], wine_data[feat2], alpha=0.5, s=30, c=wine_data['quality'], cmap='RdYlGn', edgecolors='black', linewidth=0.5)
        axes[idx].set_xlabel(feat1, fontweight='bold')
        axes[idx].set_ylabel(feat2, fontweight='bold')
        axes[idx].set_title(f'{feat1} vs {feat2}\n(r = {corr:.3f})', fontweight='bold')
        axes[idx].grid(alpha=0.3)
        
        # Add regression line
        z = np.polyfit(wine_data[feat1].dropna(), wine_data[feat2].dropna(), 1)
        p = np.poly1d(z)
        axes[idx].plot(wine_data[feat1].sort_values(), p(wine_data[feat1].sort_values()), "r--", linewidth=2.5)
        
        # Colorbar
        cbar = plt.colorbar(scatter, ax=axes[idx])
        cbar.set_label('Quality', fontweight='bold')
    
    # Hide unused plots
    for idx in range(pairs_to_plot, 4):
        axes[idx].axis('off')
    
    plt.suptitle('Highly Correlated Feature Pairs (Colored by Quality)', fontsize=16, fontweight='bold', y=1.01)
    plt.tight_layout()
    plt.show()
else:
    print("  No feature pairs with |r| > 0.6 detected.")

---
### 4.3 Quality-Based Grouped Analysis

To understand how chemical properties vary across quality levels, we create **grouped boxplots**:
- Distribution of each property for different quality scores
- Whether high-quality wines have distinct chemical profiles
- Overlap or clear separation between quality levels


In [None]:
print("="*75)
print("GROUPED ANALYSIS BY QUALITY LEVELS")
print("="*75)

if 'quality' in wine_data.columns:
    # Select top 4 features most correlated with quality
    quality_corr = correlation_matrix['quality'].drop('quality').abs().sort_values(ascending=False)
    top_4_features = quality_corr.head(4).index.tolist()
    
    fig, axes = plt.subplots(2, 2, figsize=(16, 13))
    axes = axes.flatten()
    
    for idx, feature in enumerate(top_4_features):
        sns.boxplot(data=wine_data, x='quality', y=feature, palette='RdYlGn', ax=axes[idx])
        axes[idx].set_xlabel('Quality Score', fontweight='bold')
        axes[idx].set_ylabel(feature, fontweight='bold')
        axes[idx].set_title(f'{feature} Distribution by Quality Level', fontweight='bold')
        axes[idx].grid(axis='y', alpha=0.3)
    
    plt.suptitle('Chemical Properties by Wine Quality', fontsize=16, fontweight='bold', y=1.01)
    plt.tight_layout()
    plt.show()
    
    # Statistical summary by quality
    print("\nMEAN VALUES BY QUALITY LEVEL:\n")
    grouped_statistics = wine_data.groupby('quality')[top_4_features].mean().round(3)
    display(grouped_statistics)

---
### 4.4 Multivariate Pairplot

A **pairplot** creates a grid of scatter plots for all variable combinations:
- Complete overview of relationships
- Diagonal histograms showing distributions
- Color-coding by quality level

Due to computational efficiency, we select 5 most important features plus quality.


In [None]:
print("="*75)
print("MULTIVARIATE ANALYSIS - PAIRPLOT")
print("="*75)

# Select top 5 features
if 'quality' in numeric_cols:
    quality_corr = correlation_matrix['quality'].drop('quality').abs().sort_values(ascending=False)
    top_5_features = quality_corr.head(5).index.tolist()
    
    print(f"Creating pairplot for: {top_5_features}\n")
    
    # Create quality categories
    wine_data['quality_group'] = pd.cut(wine_data['quality'], bins=[0, 5, 6, 10], labels=['Low (3-5)', 'Medium (6)', 'High (7-9)'])
    
    # Generate pairplot
    pairplot_subset = wine_data[top_5_features + ['quality_group']].copy()
    g = sns.pairplot(pairplot_subset, hue='quality_group', palette='RdYlGn',
                     diag_kind='kde', plot_kws={'alpha': 0.6, 's': 30, 'edgecolor': 'black', 'linewidth': 0.3},
                     height=2.5, aspect=1.1)
    g.fig.suptitle('Pairplot - Top Features by Quality Category', fontsize=16, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()
    
    # Remove temporary column
    wine_data.drop('quality_group', axis=1, inplace=True)

---
## Section 5: Key Insights and Conclusions

### Summary of Findings

Based on comprehensive exploratory data analysis of the Wine Quality dataset, we present the most significant discoveries:


In [None]:
print("="*75)
print("KEY INSIGHTS FROM EXPLORATORY DATA ANALYSIS")
print("="*75)

findings = []

# Insight 1: Dataset Overview
findings.append(f"**1. Dataset Overview**\n"
                f"  • Total wine samples: {wine_data.shape[0]}\n"
                f"  • Features analyzed: {wine_data.shape[1]} ({len(numeric_cols)} numerical)\n"
                f"  • Quality range: {wine_data['quality'].min()}-{wine_data['quality'].max()}, Mean: {wine_data['quality'].mean():.2f}\n"
                f"  • Data completeness: {(1 - wine_data.isnull().sum().sum()/(wine_data.shape[0]*wine_data.shape[1]))*100:.1f}%")

# Insight 2: Quality Distribution
mode_quality = wine_data['quality'].mode()[0]
mode_frequency = (wine_data['quality'] == mode_quality).sum()
findings.append(f"\n**2. Wine Quality Distribution**\n"
                f"  • Most frequent quality: {mode_quality} ({mode_frequency} samples, {mode_frequency/len(wine_data)*100:.1f}%)\n"
                f"  • Distribution: {'imbalanced' if (wine_data['quality'].value_counts().max()/len(wine_data)) > 0.5 else 'balanced'}\n"
                f"  • Average wines (quality 5-6) dominate\n"
                f"  • Limited exceptional (9-10) or poor (3-4) quality samples")

# Insight 3: Outliers
total_outliers = outlier_df['Outlier_Count'].sum()
top_outlier_feature = outlier_df.nlargest(1, 'Outlier_Count').iloc[0]
findings.append(f"\n**3. Data Quality & Outliers**\n"
                f"  • Total outliers: {total_outliers} across all features\n"
                f"  • Highest outlier feature: {top_outlier_feature['Feature']} ({top_outlier_feature['Outlier_Count']} outliers)\n"
                f"  • Treatment: Winsorization applied\n"
                f"  • No missing values detected")

# Insight 4: Correlations
if 'quality' in numeric_cols:
    top_positive = correlation_matrix['quality'].drop('quality').sort_values(ascending=False).head(2)
    top_negative = correlation_matrix['quality'].drop('quality').sort_values(ascending=True).head(2)
    
    findings.append(f"\n**4. Key Quality Predictors**\n"
                    f"  • Strongest positive correlations:\n"
                    f"    - {top_positive.index[0]}: r={top_positive.values[0]:.3f}\n"
                    f"    - {top_positive.index[1]}: r={top_positive.values[1]:.3f}\n"
                    f"  • Strongest negative correlations:\n"
                    f"    - {top_negative.index[0]}: r={top_negative.values[0]:.3f}\n"
                    f"    - {top_negative.index[1]}: r={top_negative.values[1]:.3f}")

# Insight 5: Multicollinearity
findings.append(f"\n**5. Feature Relationships**\n"
                f"  • Highly correlated pairs detected: {len(high_correlation_pairs)}\n"
                f"  • Indicates potential multicollinearity\n"
                f"  • Consider dimensionality reduction techniques")

for finding in findings:
    print(finding)

print("\n" + "="*75)
print("RECOMMENDATIONS FOR MACHINE LEARNING")
print("="*75)
print("1. Model Selection:")
print("   • Ensemble methods (Random Forest, Gradient Boosting) recommended")
print("   • Neural networks for complex non-linear relationships")
print("\n2. Data Preprocessing:")
print("   • Feature scaling necessary due to varying scales")
print("   • Consider polynomial features for interactions")
print("   • Address class imbalance using SMOTE or class weights")
print("\n3. Feature Engineering:")
print("   • Create ratio features (e.g., free SO2 / total SO2)")
print("   • Interaction terms for correlated features")
print("   • Consider PCA for dimensionality reduction")
print("\n4. Model Validation:")
print("   • Use stratified k-fold cross-validation")
print("   • Evaluate using multiple metrics (accuracy, F1, confusion matrix)")
print("   • Test on holdout dataset for final performance")

---
## Conclusion

This comprehensive exploratory data analysis has provided deep insights into the Wine Quality dataset. The analysis revealed:

1. **Data Quality:** The dataset is complete with no missing values, though outliers exist in several features
2. **Target Distribution:** Quality scores show concentration around average ratings (5-6)
3. **Feature Relationships:** Several chemical properties show moderate-to-strong correlations with quality
4. **Modeling Implications:** Class imbalance and feature correlations require careful handling

The findings suggest that machine learning models can effectively predict wine quality based on physicochemical properties, with proper preprocessing and model selection. The identified patterns provide a solid foundation for developing predictive models.

---