# **Wine Quality Prediction**
This notebook explores multiple regression models to predict the quality of wine based on various chemical properties. We will train models, evaluate them, and save the best one for future use.

## Importing Required Libraries

In [5]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
import pickle
import os


## Creating a Directory for Saved Models
We ensure that a directory exists to store our trained models and the scaler.

In [6]:

os.makedirs('models', exist_ok=True)


## Loading and Exploring the Dataset
We load the dataset and display basic information such as shape, column names, summary statistics, and missing values to understand the data.

In [7]:

# Load the wine quality dataset
print("Loading dataset...")
data = pd.read_csv('data/train.csv')

# Display basic information
print("\nDataset Information:")
print(f"Shape: {data.shape}")
print("\nColumns:", data.columns.tolist())
print("\nSummary Statistics:")
print(data.describe())

# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())


Loading dataset...

Dataset Information:
Shape: (15000, 13)

Columns: ['id', 'fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']

Summary Statistics:
                 id  fixed acidity  volatile acidity   citric acid  \
count  15000.000000   15000.000000      15000.000000  15000.000000   
mean    7499.500000       8.150753          0.504877      0.232211   
std     4330.271354       1.420983          0.135287      0.176862   
min        0.000000       4.000000          0.180000      0.000000   
25%     3749.750000       7.200000          0.400000      0.050000   
50%     7499.500000       7.800000          0.500000      0.240000   
75%    11249.250000       8.900000          0.600000      0.380000   
max    14999.000000      37.000000          1.340000      0.760000   

       residual sugar     chlorides  free sulfur dioxide  \
count    15000.000000  15000.

## Splitting Features and Target Variable
We separate the features (X) from the target variable (`quality`) and split the data into training and test sets.

In [8]:

X = data.drop(columns=['id', 'quality'])
y = data['quality']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Standardizing the Features
We scale the feature values using `StandardScaler` to improve the performance of models that are sensitive to feature magnitudes.

In [9]:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Save the scaler for future use
with open('models/scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)


## Training Regression Models
We train multiple regression models to predict wine quality and evaluate their performance using Mean Squared Error (MSE) and R² score.

### Define Models

In [10]:

models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
}


### Train and Evaluate Models

In [11]:

model_results = {}
trained_models = {}
for name, model in models.items():
    print(f'Training {name}...')
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    model_results[name] = {'MSE': mse, 'R2': r2}
    trained_models[name] = model

    with open(f'models/{name.lower().replace(' ', '_')}.pkl', 'wb') as f:
        pickle.dump(model, f)

print('\nModel Performance Comparison:')
for model, metrics in model_results.items():
    print(f'{model}: MSE = {metrics['MSE']:.4f}, R2 = {metrics['R2']:.4f}')

best_model = min(model_results.items(), key=lambda x: x[1]['MSE'])
print(f'\nBest Model: {best_model[0]} (MSE = {best_model[1]['MSE']:.4f}, R2 = {best_model[1]['R2']:.4f})')


Training Linear Regression...
Training Ridge Regression...
Training Random Forest...
Training Gradient Boosting...

Model Performance Comparison:
Linear Regression: MSE = 0.5350, R2 = 0.1696
Ridge Regression: MSE = 0.5350, R2 = 0.1696
Random Forest: MSE = 0.5380, R2 = 0.1649
Gradient Boosting: MSE = 0.5286, R2 = 0.1795

Best Model: Gradient Boosting (MSE = 0.5286, R2 = 0.1795)


## Loading Test Data & Making Predictions
We load the test dataset from `test.csv` and make the predictions. All the predictions will be written to `submission.csv` file.

In [12]:
data_eval = pd.read_csv('data/test.csv')
X_eval = data_eval.drop(columns=['id'])
X_eval_scaled = scaler.transform(X_eval)

predictions_df = pd.DataFrame({'id': data_eval['id']})

for model_name, model in trained_models.items():
    predictions_df[model_name] = model.predict(X_eval_scaled)

predictions_df['quality'] = predictions_df.iloc[:, 1:].mean(axis=1).round(3)
predictions_df[['id', 'quality']].to_csv('submission.csv', index=False)
print('\nSubmission file generated successfully.')


Submission file generated successfully.
