# Project Title: Red Wine Quality Prediction
## Group Details: 
-  **Team Lead:** Henry Mbata, X00185535, X00185535@mytudublin.ie
-  **Member 2:** Jose Lo Mantilla, X00220311, X00220311@mytudublin.ie
-  **Member 3:** Richard Idowu, X00215256, X00220311@mytudublin.ie

## Project Summary / Proposal
-  **Objective:** Predict red wine quality (0-10) from physicochemical properties.
-  **Dataset:** Red Wine Quality Dataset (Cortez et al., 2009)
-  **Demo:** User enters a new wine sample (acidity, pH, alcohol, etc.) and AI predicts quality score (0-10) and category (Poor/Good/Excellent)

## Dataset Details
-  **Source:** UCI Machine Learning Repository / Kaggle mirror
-  **Rows:** 1599
-  **Columns:** 12 (11 features + 1 output)


# Importing Libraries 
 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc
import joblib

# Data Cleaning and Wrangling
   - Loading dataset
   - Skipping header row and removing all zero rows



In [None]:
data = np.genfromtxt("winequality-red.csv",delimiter=",")
data = data[1:,:]  # Skip header row
data = data[~np.all(data == 0, axis=1)]  # Remove all-zero rows
X = data[:, :-1]
y = data[:, -1]

print("Data shape:", data.shape)
print("Feature shape:", X.shape, "Target shape:", y.shape)

# Feature Selection and Pre-Processing
- Select the 11 physiochemical features as inputs (x). 
- Seperate features and target - Use wine qaulity as the output. 
- Split dataset into train, validation and test sets.

In [None]:

#Separate features and target
X = data[:, :-1]  
y = data[:, -1]  

print("Data shape:", data.shape)
print("Feature shape:", X.shape, "Target shape:", y.shape)

#Split into train (60%) and temp (40%)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, random_state=42
)

#Split temp into validation (20%) and test (20%)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)

#Print shapes
print("Train set:", X_train.shape, y_train.shape)
print("Validation set:", X_val.shape, y_val.shape)
print("Test set:", X_test.shape, y_test.shape)

##  Model and Model Comparison
### Linear Regression
- Simple linear model to predict wine quality
- Good baseline and good for interpretable models

In [None]:
#Someone do this

### Random Forest Regressor
- Ensemble of decision trees
- Handles nonlinear relationships and feature interactions
- Usually better performance on numeric structured data

In [None]:
#Someone do this

# Model Comparison

### Approach
We trained both **Linear Regression** and **Random Forest** on our training data and tested them using **R²** and **Mean Squared Error (MSE)**. This lets us see which model predicts wine quality more accurately and consistently.  

### Observations
Linear Regression is simple and easy to understand but it struggles with non-linear relationships between the wine features and quality. Random Forest  handles complex patterns better and gives more reliable predictions.  

### Outcome
From looking at the results, Random Forest clearly performs better than Linear Regression so we will use it for our final demo to predict wine quality.

In [None]:
#Someone do this code
#Maybe a bar chart?

## Evaluation

To check how well our models predict wine quality we evaluate them using **two performance measures**:

-  **R² Score**: Shows how much of the variation in wine quality the model can explain. Higher is better.  
-  **Mean Squared Error**: Measures how far the model’s predictions are from the actual values on average. Lower is better.  

We apply these measures to both Linear Regression and Random Forest models on the validation and test sets. This helps us see which model performs best and which should be used in the demo.


In [None]:
from sklearn.metrics import r2_score, mean_squared_error

y_val_pred_rf = rf.predict(X_val)
y_test_pred_rf = rf.predict(X_test)

r2_val = r2_score(y_val, y_val_pred_rf)
r2_test = r2_score(y_test, y_test_pred_rf)

mse_val = mean_squared_error(y_val, y_val_pred_rf)
mse_test = mean_squared_error(y_test, y_test_pred_rf)

print("Random Forest - Validation R²:", r2_val, "MSE:", mse_val)
print("Random Forest - Test R²:", r2_test, "MSE:", mse_test)

## Saving the Model for Demo

Save the best model (Random Forest) using **joblib**.  
This allows the model to be loaded later in the demo without training again, ensuring fast and consistent predictions.

In [None]:
# Save the trained Random Forest model
joblib.dump(rf, "random_forest_wine_model.pkl")

# Code to load in demo
# rf_loaded = joblib.load("random_forest_wine_model.pkl")