# 🍷 Wine Classification Workshop — Answer Key

In [None]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA

plt.rcParams['figure.figsize'] = (6, 4)


## Load the dataset

In [None]:
wine = load_wine()
df = pd.DataFrame(wine.data, columns=wine.feature_names)
df['target'] = wine.target
df.head()


## EDA

In [None]:
print("Shape:", df.shape)
df.describe()


In [None]:
df.hist(figsize=(10, 8))
plt.tight_layout()
plt.show()


## 🍇 The Three Wine Cultivars (Target Classes)

The Wine dataset contains chemical analyses of wines from the same region in Italy, but from **three different grape varieties (cultivars)**.  
In scikit‑learn these are labeled `0`, `1`, and `2`.

- **Class 0 → Nebbiolo** — bold, high tannin & acidity, ages well; often higher **alcohol** and **malic acid**.
- **Class 1 → Barbera** — approachable, fruit‑forward, lighter tannins, higher acidity; often higher **flavanoids** and **color intensity**.
- **Class 2 → Grignolino** — light, pale color, lower alcohol & tannins; often higher **proline** and distinct phenol profiles.


## 🔍 PCA (Principal Component Analysis)

We have 13 features. **PCA** finds new axes (*principal components*) that capture the most variance and lets us project the data to **2D** for visualization.

**Your task:** reduce the dataset to 2 components and plot them colored by class.


In [None]:
X = df.drop('target', axis=1)
pca = PCA(n_components=2)
components = pca.fit_transform(X)

plt.scatter(components[:, 0], components[:, 1], c=df['target'])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Wine Data PCA (2D)")
plt.show()


## 🌳 Decision Tree Classifier & Metrics

A **Decision Tree** predicts by asking if/else questions on features and splitting data to increase purity.
- Common criteria: **Gini** (default), **Entropy**
- Key hyperparameters: `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features`

### Evaluation metrics (per class unless noted)
- **Precision = TP / (TP + FP)** — of predicted positives, how many were correct?
- **Recall = TP / (TP + FN)** — of true positives, how many did we find?
- **F1 = 2·(P·R)/(P+R)** — balance of precision & recall
- **Support** — number of true instances of the class
- **Accuracy** (overall) — fraction of all predictions that are correct


In [None]:
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


In [None]:
dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(X_train, y_train)


In [None]:
y_pred = dt.predict(X_test)
print(classification_report(y_test, y_pred, target_names=load_wine().target_names))


## 📊 Confusion Matrix

- **Rows**: true classes, **Columns**: predicted classes  
- Diagonal = correct predictions; off‑diagonal = misclassifications.  
Use this to see **which classes** are confused.


In [None]:
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=dt.classes_)
disp.plot(cmap='Blues', values_format='d')
plt.title("Decision Tree - Confusion Matrix")
plt.show()
