# Wine Quality Prediction: A Machine Learning Case Study

## 1. Business Problem
A winery wants to predict the quality of their red wine based on its physicochemical properties. This allows them to understand which factors contribute most to quality, potentially informing the winemaking process and pricing strategies. Our task is to build a classification model to distinguish between 'good' and 'poor' quality wines.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Set plot style
sns.set_style('whitegrid')

## 2. Data Loading and Initial Exploration

In [None]:
df = pd.read_csv('winequality-red.csv')
df.head()

In [None]:
df.info()

The data is clean with no missing values. The 'quality' feature is our target variable.

## 3. Exploratory Data Analysis (EDA)
Let's visualize the data to find patterns.

In [None]:
# Distribution of wine quality scores
plt.figure(figsize=(10, 6))
sns.countplot(x='quality', data=df, palette='viridis')
plt.title('Distribution of Red Wine Quality Scores')
plt.show()

In [None]:
# Correlation Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Wine Properties')
plt.show()

**Key Insight:** `alcohol`, `sulphates`, and `citric acid` have a positive correlation with quality. `volatile acidity` has a strong negative correlation.

## 4. Data Preprocessing & Feature Engineering

The 'quality' score ranges from 3 to 8. To make this a classification problem, we will binarize the target variable: wines with a quality score of 7 or higher will be classified as 'good' (1), and the rest as 'poor' (0).

In [None]:
df['quality_category'] = df['quality'].apply(lambda q: 1 if q >= 7 else 0)
X = df.drop(['quality', 'quality_category'], axis=1)
y = df['quality_category']

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## 5. Model Implementation & Comparison
We will split the data and train three different models as mentioned in the CV: Logistic Regression, SVM, and Random Forest.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

### Model 1: Logistic Regression

In [None]:
log_model = LogisticRegression(random_state=42)
log_model.fit(X_train, y_train)
y_pred_log = log_model.predict(X_test)
print("Logistic Regression Report:\n", classification_report(y_test, y_pred_log))

### Model 2: Support Vector Machine (SVM)

In [None]:
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)
print("SVM Report:\n", classification_report(y_test, y_pred_svm))

### Model 3: Random Forest Classifier

In [None]:
rf_model = RandomForestClassifier(random_state=42, n_estimators=200)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
print("Random Forest Report:\n", classification_report(y_test, y_pred_rf))

## 6. Feature Importance
The Random Forest model allows us to see which features were most influential in the prediction.

In [None]:
feature_importances = pd.Series(rf_model.feature_importances_, index=X.columns).sort_values(ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(x=feature_importances, y=feature_importances.index, palette='mako')
plt.title('Feature Importance for Wine Quality')
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.show()

## 7. Conclusion & Actionable Insights

- **Best Model:** The **Random Forest Classifier** provided the best overall performance, particularly in identifying 'good' quality wines (class 1).
- **Key Drivers of Quality:** The most important physicochemical properties for predicting wine quality are **alcohol content**, **sulphates**, **volatile acidity**, and **total sulfur dioxide**.
- **Business Application:** A winery can use this insight to focus on controlling these specific variables during the production process to improve the final quality of their product.