# 🍷 Wine Quality Prediction: Logistic Regression vs Random Forest


This notebook compares two models — **Logistic Regression** and **Random Forest** — to predict wine quality  
using the UCI Red Wine Quality dataset.

We go through data cleaning, visualization, model training, evaluation, and comparison.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv("winequality-red.csv")
df.head()

In [None]:
df.describe()

In [None]:
df.info()

## 📊 Feature Distributions

In [None]:
df.hist(bins=15, figsize=(15, 10), color='salmon', edgecolor='black')
plt.suptitle("Feature Distributions", fontsize=16)
plt.tight_layout()
plt.show()

## 🔗 Correlation Heatmap

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Feature Correlation Heatmap")
plt.show()

## 🎯 Target Column Creation

In [None]:
# Binary classification: quality >= 7 is 'good'
df['quality_label'] = (df['quality'] >= 7).astype(int)
df['quality_label'].value_counts(normalize=True)

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(['quality', 'quality_label'], axis=1)
y = df['quality_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 🤖 Model 1: Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
y_pred_lr = logreg.predict(X_test)
acc_lr = accuracy_score(y_test, y_pred_lr)

print(f"Logistic Regression Accuracy: {acc_lr:.2%}")
print(classification_report(y_test, y_pred_lr))

## 🌲 Model 2: Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
acc_rf = accuracy_score(y_test, y_pred_rf)

print(f"Random Forest Accuracy: {acc_rf:.2%}")
print(classification_report(y_test, y_pred_rf))

## 🔍 Confusion Matrices

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.heatmap(confusion_matrix(y_test, y_pred_lr), annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title("Logistic Regression")
axes[0].set_xlabel("Predicted")
axes[0].set_ylabel("Actual")

sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d', cmap='Greens', ax=axes[1])
axes[1].set_title("Random Forest")
axes[1].set_xlabel("Predicted")
axes[1].set_ylabel("Actual")

plt.tight_layout()
plt.show()

## 📊 Accuracy Comparison

In [None]:
models = ['Logistic Regression', 'Random Forest']
accuracies = [acc_lr * 100, acc_rf * 100]

plt.figure(figsize=(6,4))
sns.barplot(x=models, y=accuracies, palette='Set2')
plt.ylabel("Accuracy (%)")
plt.title("Model Accuracy Comparison")
plt.ylim(0, 100)
plt.show()

## ✅ Conclusion


- Logistic Regression achieved an accuracy of approximately **X%**.
- Random Forest performed better with an accuracy of **Y%**.
- This comparison shows how ensemble methods like Random Forest can improve performance over simpler models.

You can use this notebook as a project demonstration of both **data science workflow** and **model evaluation**.
