<a href="https://colab.research.google.com/github/Demosthene-OR/Student-AI-and-Data-Management/blob/main/Wine Classification Competition/Wine Classification Starter Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://prof.totalenergies.com/wp-content/uploads/2024/09/TotalEnergies_TPA_picto_DegradeRouge_RVB-1024x1024.png" height="150" width="150">

<hr style="border-width:2px;border-color:#75DFC1">
<center><h1>🍷 Wine Quality Classification Starter Guide</h1></center>
<hr style="border-width:2px;border-color:#75DFC1">


## 🎯 Objective
You are a Data Scientist at a winery. Your task is to **predict the quality of wine** (integer scores such as 3–8) based on its physicochemical properties.  
You will experiment with **multiple classification techniques** and compare their performance.

---
This notebook is a **starter guide** for the Wine Quality Classification competition.  
It provides the basic structure you can build upon:
1. Load and explore the dataset  
2. Preprocess and split the data  
3. Train a simple model  
4. Generate a submission file  

## 📥 1. Import Libraries
In this section, we import the necessary Python libraries for:
- Data manipulation and visualization (`pandas`, `numpy`, `matplotlib`, `seaborn`)  
- Machine learning models and preprocessing (`sklearn`)  

In [2]:
# 📚 Libraries

# Data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from scipy.stats import mode

import warnings
warnings.filterwarnings("ignore")



## 📊 2. Load and Explore the Dataset
- Load the training data using `pd.read_csv`.  
- Use `head()` and `info()` to preview the data.  
- Visualize the target distribution and feature correlations to understand patterns.  



In [9]:
# === a. Load the training data ===
url = "https://raw.githubusercontent.com/Demosthene-OR/Student-AI-and-Data-Management/main/Wine%20Classification%20Competition/"
# url = "/kaggle/input/wine-itba2025/"

df = pd.read_csv(url+"train.csv", sep=',', index_col='id')  # Replace with your dataset



## 🔎 3. Exploratory Data Analysis
Here we quickly analyze:
- The distribution of the target variable `quality`.  
- Correlations between features using a heatmap.  
This helps to identify important variables and relationships.  


## 🧹 4. Data Preparation
Steps:
- Split the dataset into features (`X`) and target (`y`).  
- Use `train_test_split` to create training and validation sets.  
- Standardize features using `StandardScaler` for better model performance.  


## 🤖 5. Train a first Model
We will:
- Use, for example, **Logistic Regression** as a baseline classifier.  
- Fit the model to the training set.  
- Evaluate it on the validation set using Accuracy.  


## 🤖 6. Train a second Model
We will:
- Use another classifier.  
- Fit the model to the training set.  
- Evaluate it on the validation set using Accuracy.  
- Compare its performance with the first classifier.  


## 📤 7. Generate Submission File
Steps:
- Load the test dataset.  
- Scale the test features using the same scaler as training.  
- Predict wine quality for the test data.  
- Create a DataFrame with columns `id` and `quality`.  
- Save predictions as `submission.csv`.  
This file can now be uploaded to the competition for scoring.  


In [11]:
# === Load the test dataset ===
test_df = pd.read_csv(url+"test.csv", sep=',', index_col='id')  

# .....


# Prepare submission (include 'id' as the first column)
submission = pd.DataFrame({
    "id": test_df.index,
    # "quality": test_preds    # replace test_preds by your prediction variable
})

# Save submission file
submission.to_csv("submission.csv", index=False)
# If running from colab, replace the previous line with the 2 following ones
    # from google.colab import files
    # files.download('submission.csv.')  # Change the file name according to the model you want to download
    
print("✅ Submission file saved as submission.csv")

✅ Submission file saved as submission.csv


## 💡 7. Tips and Next Steps
- Try different models: **Random Forest**, **SVM**, **kNN**, or **Neural Networks**.  
- Experiment with feature engineering or hyperparameter tuning to improve results.  
- Use cross-validation for more robust evaluation.  
- Document your process with Markdown cells to explain your approach.  
