In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style("whitegrid")

In [None]:
from sklearn.datasets import load_wine

In [None]:
df=load_wine(as_frame=True).frame

In [None]:
df.head()

In [None]:
wine_data=load_wine()

In [None]:
df['target'] = wine_data.target

Why? We load the data and immediately add the target column. This column holds the three types of wine (cultivars 0, 1, and 2), which is what we ultimately want to predict.

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

What to Look For:

    Shape: You should see 178 rows and 14 columns (13 features + 1 target).

    Info: Confirm that all columns are non-null (it's a clean dataset, like Iris).

    Describe: Look closely at the mean and std (standard deviation). You'll notice a massive difference in scale:

        magnesium might be around 100.

        color_intensity might be around 5.

        ash might be around 2.

        This difference in scale is a key challenge for this dataset (Feature Scaling!).

#### Task 3: Target Balance Check (Univariate)

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x='target', data=df)
plt.title('Distribution of Wine Cultivars (Target)')
plt.show()

Why? If one class is too small (e.g., only 10 samples), any model built later will be heavily biased toward the larger classes.

#### Phase 3: Bivariate Analysis (Finding Predictors)

In [None]:
'''
Task 4: Numerical Correlation (Heatmap)

Since there are 13 features, visualizing all correlations is best done with a Heatmap.
'''

# Calculate the correlation matrix
corrmat = df.corr()

sns.heatmap(corrmat, annot=False, cmap='coolwarm')
plt.title('Correlation Matrix of Wine Features')
plt.show()

 correlation is a statistical measure that quantifies the extent to which two or more variables are linearly related. In the wine dataset, it describes how different chemical properties relate to each other and, crucially, to the wine's target class. 
  

The relationship is measured with a correlation coefficient, a value between -1 and +1. Interpreting correlation coefficients Positive Correlation (\(r>0\)): As the value of one variable increases, the other tends to increase as well. For example, a positive correlation between alcohol and quality might mean that higher alcohol content is associated with higher quality wine.Negative Correlation (\(r<0\)): As one variable's value increases, the other's value tends to decrease. For instance, if volatile acidity has a negative correlation with quality, it suggests that higher volatile acidity is associated with lower wine quality.No Correlation (\(r\approx 0\)): There is no discernible linear relationship between the variables.

What to Look For: Look for areas where the colors are very dark red (high positive correlation ≈+1) or very dark blue (high negative correlation ≈−1). This shows which features measure similar things. For example, alcohol and OD280/OD315 of diluted wines are often positively correlated.


In [None]:
'''
Task 5: Feature vs. Target (Box Plot)

Let's find the best single predictor for the wine class.
'''

plt.figure(figsize=(12, 6))
# Analyze 'alcohol' vs. the 'target' class
sns.boxplot(x='target', y='alcohol', data=df)
plt.title('Alcohol Content Distribution by Wine Cultivar')
plt.show()


Why? We check if the alcohol content (a key feature) is enough to visually separate the three classes.

Standard Interpretation:

    Class 0 (High): Generally has the highest median alcohol content.

    Class 1 (Middle): Generally has the middle median alcohol content.

    Class 2 (Low): Generally has the lowest median alcohol content.

#### Phase 4: Multivariate Analysis (Finding Feature Interaction)

In [None]:
'''
Task 6: Visualize Feature Interactions (Pair Plot) 

A Pair Plot shows every numerical feature plotted against every other numerical feature,
 colored by the target variable (target). This helps us find combinations of features that best separate the three classes.

Create the Pair Plot: Since there are 13 features, we'll only plot the four most important/least scaled features to keep the plot manageable. 
Let's use alcohol, malic_acid, proanthocyanins, and color_intensity.
'''

# Select a subset of features for a manageable pair plot
subset_features = ['alcohol', 'malic_acid', 'proanthocyanins', 'color_intensity', 'target']

sns.pairplot(df[subset_features], hue='target', height=2.5)
plt.suptitle('Pair Plot of Key Wine Features', y=1.02)
plt.show()