# 🍷 Wine Dataset EDA - 30-Minute Exercise

This notebook walks through a structured 30-minute exploratory data analysis of the `wine` dataset from `sklearn`.

## 📋 Objectives
- Load and inspect the data
- Perform univariate analysis
- Visualize class distributions and correlations
- Comment on findings relevant to ML modeling

In [None]:
# 📦 Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine

# 🔧 Configure plots
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

## 1️⃣ Data Loading & Initial Inspection

In [None]:
# Load the wine dataset
wine_dataset = load_wine()

# Create DataFrame with explicit type handling
df = pd.DataFrame(wine_dataset.data, columns=wine_dataset.feature_names)
df['target'] = wine_dataset.target

print("Dataset shape:", df.shape)
print("\nTarget classes:", wine_dataset.target_names)
print("\nFirst few rows:")
df.head()

In [None]:
# Basic info about the dataset
print("Dataset Info:")
df.info()

print("\n" + "="*50)
print("Missing values:")
print(df.isnull().sum().sum())

print("\n" + "="*50)
print("Target distribution:")
print(df['target'].value_counts().sort_index())

## 2️⃣ Univariate Analysis

In [None]:
# Descriptive statistics
print("Descriptive Statistics:")
df.describe()

In [None]:
# Distribution of target classes
plt.figure(figsize=(8, 5))
target_counts = df['target'].value_counts().sort_index()
plt.bar(range(len(target_counts)), target_counts.values)
plt.xlabel('Wine Class')
plt.ylabel('Count')
plt.title('Distribution of Wine Classes')
plt.xticks(range(len(wine_dataset.target_names)), wine_dataset.target_names)
plt.show()

print("Class distribution:")
for i, name in enumerate(wine_dataset.target_names):
    count = target_counts[i]
    percentage = (count / len(df)) * 100
    print(f"{name}: {count} samples ({percentage:.1f}%)")

## 3️⃣ Feature Distributions

In [None]:
# Plot histograms for all features
fig, axes = plt.subplots(4, 4, figsize=(15, 12))
axes = axes.ravel()

feature_cols = [col for col in df.columns if col != 'target']

for i, col in enumerate(feature_cols[:16]):
    axes[i].hist(df[col], bins=20, alpha=0.7, edgecolor='black')
    axes[i].set_title(col, fontsize=10)
    axes[i].tick_params(labelsize=8)

# Hide extra subplots if we have fewer than 16 features
for i in range(len(feature_cols), 16):
    axes[i].set_visible(False)

plt.tight_layout()
plt.show()

## 4️⃣ Correlation Analysis

In [None]:
# Correlation matrix (excluding target)
plt.figure(figsize=(12, 10))
correlation_matrix = df.drop('target', axis=1).corr()
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Find highly correlated pairs
print("Highly correlated feature pairs (|correlation| > 0.8):")
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_val = correlation_matrix.iloc[i, j]
        if abs(float(corr_val)) > 0.8:
            high_corr_pairs.append((correlation_matrix.columns[i], 
                                   correlation_matrix.columns[j], 
                                   corr_val))

for pair in high_corr_pairs:
    print(f"{pair[0]} <-> {pair[1]}: {pair[2]:.3f}")

## 5️⃣ Class-wise Analysis

In [None]:
# Box plots for key features by wine class
key_features = ['alcohol', 'flavanoids', 'color_intensity', 'proline']

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for i, feature in enumerate(key_features):
    sns.boxplot(data=df, x='target', y=feature, ax=axes[i])
    axes[i].set_title(f'{feature} by Wine Class')
    axes[i].set_xlabel('Wine Class')

plt.tight_layout()
plt.show()

In [None]:
# Pairplot for selected features
selected_features = ['alcohol', 'flavanoids', 'color_intensity', 'target']
sns.pairplot(df[selected_features], hue='target', diag_kind='hist')
plt.suptitle('Pairplot of Key Features by Wine Class', y=1.02)
plt.show()

## 6️⃣ Summary & ML Modeling Insights

In [None]:
# Feature importance analysis using correlation with target
feature_target_corr = df.drop('target', axis=1).corrwith(df['target']).abs().sort_values(ascending=False)

print("Features most correlated with target (absolute correlation):")
print(feature_target_corr.head(10))

# Visualize top correlated features
plt.figure(figsize=(10, 6))
top_features = feature_target_corr.head(10)
plt.barh(range(len(top_features)), top_features.values)
plt.yticks(range(len(top_features)), [str(label) for label in top_features.index])
plt.xlabel('Absolute Correlation with Target')
plt.title('Top 10 Features by Correlation with Wine Class')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 📝 Key Findings & ML Modeling Recommendations

**Dataset Characteristics:**
- 178 samples, 13 features, 3 wine classes
- No missing values - clean dataset
- Slightly imbalanced classes but not severely

**Feature Insights:**
- Several features show strong correlations (potential multicollinearity)
- Features like flavanoids, proline, and alcohol show good class separation
- Some features have skewed distributions

**ML Modeling Recommendations:**
1. **Feature Selection**: Consider removing highly correlated features
2. **Scaling**: Features have different scales - normalization recommended
3. **Class Balance**: Classes are reasonably balanced, no special handling needed
4. **Model Choice**: Given clear class separation, tree-based models or SVM should work well
5. **Cross-validation**: Small dataset - use stratified k-fold CV

**Next Steps:**
- Feature scaling/normalization
- Feature selection (remove highly correlated features)
- Try multiple algorithms (Random Forest, SVM, Logistic Regression)
- Hyperparameter tuning with cross-validation