# üç∑ Wine Classification Notebook

Clasificaci√≥n binaria de vinos tintos usando el dataset **Wine Quality (Red Wine)**.

- **1** = buen vino (quality ‚â• 7)
- **0** = vino normal / no tan bueno

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
    RocCurveDisplay
)

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (6, 4)

## üì• Cargar el dataset de vinos

In [2]:
df = pd.read_csv('../Datasets/processed/winequality-red.csv')
df.head()

In [3]:
df.info()

## üéØ Crear variable objetivo binaria `good_wine`

- `good_wine = 1` si `quality ‚â• 7`
- `good_wine = 0` en caso contrario

In [4]:
df['good_wine'] = (df['quality'] >= 7).astype(int)
df['good_wine'].value_counts(), df['good_wine'].value_counts(normalize=True).round(3)

## üß© Definir X e y

Usamos como features todas las variables f√≠sico-qu√≠micas y como target `good_wine`.

In [5]:
feature_cols = [
    'fixed acidity', 'volatile acidity', 'citric acid',
    'residual sugar', 'chlorides', 'free sulfur dioxide',
    'total sulfur dioxide', 'density', 'pH',
    'sulphates', 'alcohol'
]

X = df[feature_cols].copy()
y = df['good_wine'].copy()

## üß™ Train/Test Split (80/20 con estratificaci√≥n)

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train.shape, X_test.shape

## üìè Escalado de variables num√©ricas

Usamos `StandardScaler` para KNN, MLP y tambi√©n para la Regresi√≥n Log√≠stica.

In [7]:
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

## ü§ñ Definir modelos a comparar

1. Regresi√≥n Log√≠stica
2. √Årbol de Decisi√≥n
3. kNN
4. MLP (Red Neuronal)

In [8]:
models = {
    'LogisticRegression': LogisticRegression(max_iter=2000, class_weight='balanced'),
    'DecisionTree': DecisionTreeClassifier(max_depth=6, class_weight='balanced', random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=15),
    'MLP': MLPClassifier(hidden_layer_sizes=(32,16), max_iter=500, random_state=42)
}

models

## üìä Entrenamiento y evaluaci√≥n de modelos

In [10]:
results = []
probas = {}

for name, clf in models.items():
    print(f"\nEntrenando {name}...")
    
    # Usamos datos escalados para todos excepto que quisieras comparar sin escalado
    if name in ['KNN', 'LogisticRegression', 'MLP']:
        clf.fit(X_train_s, y_train)
        y_pred = clf.predict(X_test_s)
        y_proba = clf.predict_proba(X_test_s)[:,1]
    else:  # DecisionTree no necesita escalado
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        y_proba = clf.predict_proba(X_test)[:,1]

    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    auc = roc_auc_score(y_test, y_proba)

    results.append({
        'modelo': name,
        'accuracy': acc,
        'precision': prec,
        'recall': rec,
        'f1': f1,
        'roc_auc': auc
    })

    probas[name] = y_proba

results_df = pd.DataFrame(results)
results_df.sort_values(by='f1', ascending=False)

## üìà Curvas ROC de los modelos

In [11]:
plt.figure(figsize=(7,5))
for name, y_proba in probas.items():
    RocCurveDisplay.from_predictions(y_test, y_proba, name=name, ax=plt.gca())

plt.plot([0,1],[0,1],'k--', label='Azar')
plt.title('ROC ‚Äì Wine Classification')
plt.legend()
plt.show()

## üßÆ Matriz de confusi√≥n del mejor modelo (seg√∫n F1)

In [12]:
best_model_name = results_df.sort_values(by='f1', ascending=False).iloc[0]['modelo']
print('Mejor modelo seg√∫n F1:', best_model_name)

best_clf = models[best_model_name]

if best_model_name in ['KNN', 'LogisticRegression', 'MLP']:
    best_clf.fit(X_train_s, y_train)
    y_pred_best = best_clf.predict(X_test_s)
else:
    best_clf.fit(X_train, y_train)
    y_pred_best = best_clf.predict(X_test)

cm = confusion_matrix(y_test, y_pred_best)
cm_norm = cm / cm.sum(axis=1, keepdims=True)

fig, ax = plt.subplots(1, 2, figsize=(10, 4))

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title(f'Matriz de confusi√≥n ‚Äì {best_model_name}')
ax[0].set_xlabel('Predicci√≥n')
ax[0].set_ylabel('Real')

sns.heatmap(cm_norm, annot=True, fmt='.2f', cmap='Greens', ax=ax[1])
ax[1].set_title('Matriz normalizada')
ax[1].set_xlabel('Predicci√≥n')
ax[1].set_ylabel('Real')

plt.tight_layout()
plt.show()