# Wine Classification with Supervised Learning

This project builds and evaluates supervised learning models to classify wines into three quality-related classes based on their chemical properties.

We follow a complete **ML workflow**:

1. Load and explore the data  
2. Perform exploratory data analysis (EDA)  
3. Engineer features and handle correlations  
4. Train and tune multiple models using cross-validation  
5. Select a final model and evaluate it on a held-out test set  
6. Compare with additional baseline models and discuss results


## 1. Problem and Dataset

We use the classic **Wine** dataset, where each row represents a single wine sample with numeric features such as alcohol, malic acid, magnesium, flavanoids, and more.  
The target variable `target` encodes the wine class (three possible classes).

Our goal is to learn a model that, given the chemical composition of a wine, predicts its class as accurately as possible.

Key characteristics:

- **Task type:** Multiclass classification  
- **Features:** 13 continuous chemical measurements  
- **Target:** `target` âˆˆ {0, 1, 2} (wine classes)  
- **Metric focus:** F1-macro and accuracy


## 2. Imports and configuration

In [None]:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, cross_validate, StratifiedKFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score, precision_score, recall_score
from sklearn.pipeline import Pipeline

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

import warnings
warnings.filterwarnings('ignore')


## 3. Data loading

We load separate training and test sets.  
On a real project or in production code, you would typically:

- Store the CSV files in the project repository, or  
- Download them from a known URL.

Here we assume the CSVs are available on disk.


In [None]:
#loading test and train
df_train = pd.read_csv('drive/MyDrive/DataBasesForMachineLearning/wine_train.csv')
df_test = pd.read_csv('drive/MyDrive/DataBasesForMachineLearning/wine_test.csv')

#5 first lines of the test and train
print("TRAIN SET:")
display(df_train.head())

print("\nTEST SET:")
display(df_test.head())
df_train.info()
df_train_original = df_train.copy()
df_test_original = df_test.copy()

## 4. Exploratory Data Analysis (EDA)

First, we explore the training data to understand:

- Feature correlations  
- Feature distributions and potential outliers  
- Class balance of the target variable


In [None]:
plt.figure(figsize=(12, 10))
sns.heatmap(df_train.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('correlation')
plt.show()
#We will check the correlation between different categories.

In [None]:
df_train.hist(bins=15, figsize=(15, 10), color='skyblue', edgecolor='black')
plt.suptitle('Distribution of Features')
plt.show()
# We will identify outliers here.

In [None]:
# the suspects of having outliers
suspicious_features = ['malic_acid', 'magnesium', 'nonflavanoid_phenols',
                       'od280/od315_of_diluted_wines', 'proline']

# draw a plotbox for each one of them
for feature in suspicious_features:
    plt.figure(figsize=(8, 5))
    sns.boxplot(data=df_train, x='target', y=feature)
    plt.title(f'{feature} distribution by wine class')
    plt.xlabel('Wine Class (target)')
    plt.ylabel(feature)
    plt.grid(True)
    plt.tight_layout()
    plt.show()


In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(x='target', data=df_train, palette='pastel')
plt.title('Distribution of Wine Classes (target)')
plt.xlabel('Wine Class')
plt.ylabel('Count')
plt.grid(True, axis='y')
plt.show()
#check if there are enough samples of each class

## 5. Feature engineering and correlation handling

From the correlation heatmap, we identify highly correlated features.  
In particular, `flavanoids` is strongly correlated with `total_phenols` and also with the target.

To reduce redundancy and potential overfitting, we drop `flavanoids` from the feature set and recompute the correlation matrix.


In [None]:
df_train_original.drop(['flavanoids'], axis=1, inplace=True)
plt.figure(figsize=(12, 10))
sns.heatmap(df_train_original.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Corolation')
plt.show()
#we saw a high corrolation between flavanoids and total_phenoids so we removed flavanoids because it has a higher correlation with the target.

## 6. Model selection and hyperparameter tuning

We frame model selection as an experiment between:

- **K-Nearest Neighbors (KNN)** with standardized features  
- **Decision Tree** with entropy criterion  

We use `GridSearchCV` with 5-fold cross-validation and optimize for **F1-macro**, which balances performance across all three classes.


In [None]:
#The data
X = df_train_original.drop('target', axis=1)
y = df_train_original['target']

# pipeline for each model
pipeline_knn = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

pipeline_tree = Pipeline([
    ('tree', DecisionTreeClassifier(random_state=42, criterion='entropy'))
])

# checking multiple hyper parameters
param_grid_knn = {
    'knn__n_neighbors': [3, 5, 7],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['euclidean', 'manhattan']
}

param_grid_tree = {
    'tree__max_depth': [3, 5, 7],
    'tree__min_samples_leaf': [1, 3, 5]
}

# GridSearch for the best hyper parameters
grid_knn = GridSearchCV(pipeline_knn, param_grid_knn, cv=5, scoring='f1_macro')
grid_tree = GridSearchCV(pipeline_tree, param_grid_tree, cv=5, scoring='f1_macro')

grid_knn.fit(X, y)
grid_tree.fit(X, y)

# Final table and outcomes
comparison_df = pd.DataFrame([
    {
        'Model': 'K-Nearest Neighbors',
        'Best Params': grid_knn.best_params_,
        'F1-macro Score': round(grid_knn.best_score_, 4)
    },
    {
        'Model': 'Decision Tree',
        'Best Params': grid_tree.best_params_,
        'F1-macro Score': round(grid_tree.best_score_, 4)
    }
])
print("\nBest Parameters for KNN:")
print(grid_knn.best_params_)

print("\nBest Parameters for Decision Tree:")
print(grid_tree.best_params_)

print("\nModel Comparison Table:")
print(comparison_df)

best_model_row = comparison_df.loc[comparison_df['F1-macro Score'].idxmax()]
print(f"\nBest Model: {best_model_row['Model']}")
print(f"Best F1-macro Score: {best_model_row['F1-macro Score']}")
print(f"Best Parameters: {best_model_row['Best Params']}")


## 7. Final model training

Based on the cross-validation experiment, we select the best-performing configuration.  
Here, we train a final `KNeighborsClassifier` model using the chosen hyperparameters on the full training set.


In [None]:
#x and y train for the training of the model
X_train = df_train_original.drop('target', axis=1)
y_train = df_train_original['target']

#building the model with the hyper parameters chosen
final_knn = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(
        n_neighbors=3,
        weights='uniform',
        metric='manhattan'
    ))
])
final_knn.fit(X_train, y_train)


## 8. Evaluation on held-out test set

We apply the same preprocessing step (dropping `flavanoids`) to the test set, evaluate the final model, and report:

- Classification report (per-class precision, recall, F1)  
- Overall F1-macro  
- Overall accuracy  
- Confusion matrix


In [None]:
#removing flavanoids in the test data base:
df_test_original.drop(['flavanoids'], axis=1, inplace=True)


#x and y test for the testing of the model
X_test = df_test_original.drop('target', axis=1)
y_test = df_test_original['target']

# predictiong on the test samples with the trained model
y_pred = final_knn.predict(X_test)

#results:
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print(f"\nF1-macro Score: {round(f1_score(y_test, y_pred, average='macro'), 4)}")
print(f"Accuracy Score: {round(accuracy_score(y_test, y_pred), 4)}")

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nFirst 5 predictions vs true labels:")
for pred, true in zip(y_pred[:5], y_test[:5]):
    print(f"Predicted: {pred}, True: {true}")

## 9. Additional baseline models and comparison

To better understand how our KNN model performs, we compare it with two common baselines:

- **Logistic Regression** with feature scaling  
- **Random Forest** (tree-based ensemble)

We train these models on the same training data and evaluate them on the same test set, then summarize the results in a comparison table.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Reuse X_train, y_train, X_test, y_test from previous cells

# Logistic Regression pipeline with scaling
logreg_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=1000, multi_class='ovr', random_state=42))
])
logreg_pipeline.fit(X_train, y_train)
y_pred_logreg = logreg_pipeline.predict(X_test)

# Random Forest (no scaling needed)
rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    random_state=42
)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Collect metrics
def compute_metrics(y_true, y_pred, model_name):
    return {
        "Model": model_name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "F1-macro": f1_score(y_true, y_pred, average='macro'),
        "Precision-macro": precision_score(y_true, y_pred, average='macro'),
        "Recall-macro": recall_score(y_true, y_pred, average='macro')
    }

results = []
results.append(compute_metrics(y_test, y_pred, "KNN (final model)"))
results.append(compute_metrics(y_test, y_pred_logreg, "Logistic Regression"))
results.append(compute_metrics(y_test, y_pred_rf, "Random Forest"))

results_df = pd.DataFrame(results)
print("Test set performance comparison:")
display(results_df.sort_values(by="F1-macro", ascending=False))



## 10. Conclusions and future work

In this project we implemented a complete supervised learning workflow for wine classification:

- Performed EDA to understand feature distributions, correlations, and class balance  
- Reduced redundancy by removing a highly correlated feature  
- Tuned KNN and Decision Tree models with cross-validation  
- Trained a final KNN model and evaluated it on a held-out test set  
- Compared the final model with Logistic Regression and Random Forest baselines

Possible next steps:

- Try more advanced models such as Gradient Boosting or XGBoost  
- Perform more systematic hyperparameter tuning (wider search space)  
- Apply dimensionality reduction (e.g. PCA) and compare results  
- Explore model-agnostic explainability tools (e.g. SHAP) to better understand feature impact
