# Wine Classification with RandomForest — Step by Step

This notebook walks you through a complete **multiclass classification** workflow using scikit-learn's built-in **Wine** dataset and a **RandomForestClassifier**.

**What you'll do:**
1. Install and import dependencies
2. Load and inspect the dataset
3. Split data into train/test sets
4. Train a RandomForest model
5. Evaluate accuracy and inspect predictions
6. Visualize the confusion matrix
7. Print a classification report
8. Analyze feature importances
9. Run 5-fold cross-validation
10. Plot ROC curves (one-vs-rest) with AUC for each class

## Step 0 — (Optional) Install dependencies

Run this cell **only if** you don't have the required packages installed.

In [None]:
# If needed, uncomment and run:
# !pip install scikit-learn matplotlib

## Step 1 — Import libraries

We import NumPy for numerical operations, Matplotlib for plotting, and scikit-learn for data loading, modeling, and evaluation.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
    roc_auc_score,
    roc_curve
)

## Step 2 — Load and inspect the dataset

We use scikit-learn's `load_wine()` to get features `X`, labels `y`, and metadata like feature and target names.

In [None]:
wine_data = load_wine()
X = wine_data.data
y = wine_data.target

print(f"Shape of X: {X.shape}")
print(f"Number of classes: {len(wine_data.target_names)}")
print("Classes:", wine_data.target_names)
print("First 5 feature names:", wine_data.feature_names[:5])

## Step 3 — Train/Test split

We split the dataset into training and test sets so we can evaluate the model on unseen data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
print(f"Train size: {X_train.shape[0]}, Test size: {X_test.shape[0]}")

## Step 4 — Initialize and train the RandomForest

We create a `RandomForestClassifier` with a fixed random seed for reproducibility and fit it to the training data.

In [None]:
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

## Step 5 — Make predictions and compute accuracy

We predict on the test set and measure **accuracy** (fraction of correct predictions).

In [None]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

## Step 6 — Confusion matrix (numeric and heatmap)

The confusion matrix shows how often predictions match the true labels per class.
We print the raw matrix and then visualize it as a heatmap using **Matplotlib**.

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix (numeric):\n", conf_matrix)

# Plot confusion matrix with Matplotlib (no seaborn)
plt.figure(figsize=(8, 6))
im = plt.imshow(conf_matrix, interpolation='nearest')
plt.title("Confusion Matrix")
plt.colorbar(im, fraction=0.046, pad=0.04)
tick_marks = np.arange(len(wine_data.target_names))
plt.xticks(tick_marks, wine_data.target_names, rotation=45, ha='right')
plt.yticks(tick_marks, wine_data.target_names)

# Add counts to each cell
thresh = conf_matrix.max() / 2.0 if conf_matrix.max() > 0 else 0.5
for i in range(conf_matrix.shape[0]):
    for j in range(conf_matrix.shape[1]):
        plt.text(j, i, format(conf_matrix[i, j], 'd'),
                 ha="center", va="center",
                 color="white" if conf_matrix[i, j] > thresh else "black")

plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.tight_layout()
plt.show()

## Step 7 — Classification report

The classification report provides **precision**, **recall**, and **F1-score** for each class, along with macro/micro averages.

In [None]:
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=wine_data.target_names))

## Step 8 — Feature importances

Random forests provide an estimate of **feature importance**. We display the importance values and plot them as a bar chart.

In [None]:
feature_importances = model.feature_importances_
indices = np.argsort(feature_importances)[::-1]

print("Feature Importances:")
for i in range(X.shape[1]):
    print(f"{wine_data.feature_names[indices[i]]}: {feature_importances[indices[i]]:.4f}")

plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
plt.bar(range(X.shape[1]), feature_importances[indices], align="center")
plt.xticks(range(X.shape[1]), [wine_data.feature_names[i] for i in indices], rotation=90)
plt.tight_layout()
plt.show()

## Step 9 — 5-fold cross-validation

We evaluate the model with **5-fold cross-validation** on the full dataset to assess stability across different splits.

In [None]:
cross_val = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation accuracy: {cross_val.mean():.2f} ± {cross_val.std():.2f}")

## Step 10 — ROC curves and AUC (one-vs-rest)

For multiclass problems, we compute **one-vs-rest** ROC curves: for each class, treat it as positive vs. all others as negative.
We use predicted probabilities from the RandomForest to compute **AUC** per class and plot the ROC curves.

In [None]:
# Predicted probabilities for test set
y_prob = model.predict_proba(X_test)

# Compute ROC and AUC per class
fpr = {}
tpr = {}
roc_auc = {}
n_classes = len(wine_data.target_names)

for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test == i, y_prob[:, i])
    roc_auc[i] = roc_auc_score(y_test == i, y_prob[:, i])

# Plot ROC curves
plt.figure(figsize=(10, 6))
for i in range(n_classes):
    plt.plot(fpr[i], tpr[i], label=f"Class {wine_data.target_names[i]} (AUC = {roc_auc[i]:.2f})")
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves — Wine Classification (One-vs-Rest)')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

## 📘 What each code block does

**Step 0 — Install dependencies (optional)**  
- Provides a `pip install` command (commented out) in case required libraries are missing.

**Step 1 — Import libraries**  
- Imports NumPy for numeric operations and Matplotlib for plotting.  
- Imports scikit-learn utilities: dataset loader, model selection helpers, the RandomForest model, and metrics for evaluation.

**Step 2 — Load and inspect the dataset**  
- Loads the Wine dataset into `X` (features) and `y` (labels).  
- Prints dataset shape, class names, and some feature names to quickly verify what we’re working with.

**Step 3 — Train/Test split**  
- Splits the data into a training set (70%) and a test set (30%).  
- `random_state=42` ensures reproducibility of the split.

**Step 4 — Initialize and train the RandomForest**  
- Creates a `RandomForestClassifier` with 100 trees.  
- Fits the model on the training data (`X_train`, `y_train`).

**Step 5 — Make predictions and compute accuracy**  
- Uses the trained model to predict `y_pred` on `X_test`.  
- Computes overall **accuracy** — the share of correct predictions.

**Step 6 — Confusion matrix (numeric and heatmap)**  
- Computes a confusion matrix to see per-class correctness and error types.  
- Plots the matrix using Matplotlib and annotates each cell with counts.

**Step 7 — Classification report**  
- Prints **precision**, **recall**, **F1-score**, and **support** for each class, plus macro/weighted averages.

**Step 8 — Feature importances**  
- Extracts `model.feature_importances_` to see which features the RandomForest considered most informative.  
- Sorts and displays the list, then plots a bar chart of importances.

**Step 9 — 5-fold cross-validation**  
- Evaluates the model using 5-fold CV over the **entire dataset** (train/test splits vary inside CV).  
- Reports mean accuracy and its standard deviation as a robustness check.

**Step 10 — ROC curves and AUC (one-vs-rest)**  
- Gets class probability estimates with `predict_proba`.  
- For each class, computes an ROC curve by treating that class as positive vs. the others as negative.  
- Plots ROC curves and shows **AUC** per class (area under the curve).

## ✅ Results & Interpretation (from your run)

**Overall accuracy:** `1.00`  
The model predicted **all 54 test samples correctly**.

**Confusion Matrix (perfect classification):**
```
[[19  0  0]
 [ 0 21  0]
 [ 0  0 14]]
```
- Every class (0, 1, 2) is predicted without any mistakes.

**Classification Report (all 1.00):**
- Precision, Recall, F1-score are **1.00 for each class** and overall.

**Feature Importances (top features):**
- `color_intensity`: 0.1802  
- `flavanoids`: 0.1659  
- `alcohol`: 0.1420  
- `proline`: 0.1261  
- (`od280/od315_of_diluted_wines`, `hue`, `total_phenols` follow)  

**5-fold Cross-Validation:** `0.97 ± 0.02`  
- Indicates **strong generalization** across different splits.

**ROC / AUC:**  
- AUC = **1.00 for each class** → ROC curves hug the top-left corner.

### Visuals from your run
(If you ran and saved screenshots, you can view them below.)  
<br>

<img src="https://raw.githubusercontent.com/Lucas-Peterson/my-analysis/main/analys/wine-quality/Confusion_Matrix.png" width="520" alt="Confusion Matrix" />  
<br>

<img src="https://raw.githubusercontent.com/Lucas-Peterson/my-analysis/main/analys/wine-quality/feature_importances.png" width="720" alt="Feature Importances" />  
<br>

<img src="https://raw.githubusercontent.com/Lucas-Peterson/my-analysis/main/analys/wine-quality/ROC_Curves_for_Wine_Classification.png" width="720" alt="ROC Curves" />  

### Takeaways
- The **Wine** dataset is well-separated; RandomForest can achieve **perfect** test accuracy on some splits.  
- Cross-validation at ~**97%** confirms it's not just luck on one split.  
- Most discriminative features here are **color intensity**, **flavanoids**, **alcohol**, and **proline**.  
- For comparison or classroom use, try also **LogisticRegression** and **SVM** to see performance vs. model complexity.