# Assignment 4 - Wine Quality Prediction

<b>Objective:</b> Improve the ml-wines-tf-start.ipynb so that the accuracy is above 90%. Create a new notebook and begin from the data import part. You may use any preprocessing, optimizer, neural network structure you prefer. Provide detailed markdown.

<b>Approach</b>

We are applying both classification and regression techniques to understand whether predicting discrete categories or continuous scores leads to better accuracy and more reliable predictions for wine quality.

- Classification Approach:
    First, we will treat wine quality as a categorical variable, dividing it into distinct quality categories (e.g., 3 to 8). This allows us to predict specific categories of wine quality and evaluate how well the model classifies wines into these predefined groups.

- Regression Approach:
    In parallel, we will treat wine quality as a continuous variable, approaching it as a regression problem. This will allow the model to predict a more precise quality score, instead of limiting it to fixed categories.

At the end of both approaches, we will compare the accuracy, loss, and general performance to determine which method is more effective for predicting wine quality, providing insight into which approach better captures the complexity of the dataset.

In this cell, a variety of essential libraries are imported for data analysis and machine learning tasks:
- `numpy` and `pandas` for numerical operations and data manipulation.
- `plotly` and `matplotlib` for visualizations, with `seaborn` for more advanced statistical plots.
- `scipy.stats` for statistical analysis.
- `sklearn` modules such as `train_test_split` and various preprocessing methods for preparing data.
- TensorFlow and Keras for building neural networks, and `keras_tuner` for hyperparameter optimization.

## Classification Approach

In [1]:
import numpy as np
import pandas as pd
import math

import plotly
import plotly.express as px
plotly.offline.init_notebook_mode (connected = True)

import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats as st

pd.set_option("display.max_columns", None)

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import cohen_kappa_score, accuracy_score

from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, Normalizer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import make_column_transformer

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import keras_tuner as kt



### Loading the Dataset

The dataset, likely related to wine data, is loaded into a DataFrame and the first few rows are displayed to inspect its structure. The dataset contains various chemical properties such as acidity, sulfur dioxide levels, alcohol, and a target quality variable.

In [2]:
import pandas as pd

wine_data = pd.read_csv('data/winequality-red.csv')
wine_data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [3]:
pip install plotly

Note: you may need to restart the kernel to use updated packages.


Here, the dataset is split into features (X) and the target variable (y). The target is the quality column, and the features consist of the other variables.

In [4]:
X = wine_data.drop(columns=['quality'])
y = wine_data['quality']


This cell identifies outliers in the dataset based on certain chemical thresholds. These are compiled into a DataFrame, and duplicate rows are removed to maintain uniqueness.

In [5]:
outliers = pd.DataFrame()

# Concatenate the outliers based on different thresholds for each column
outliers = pd.concat([outliers, wine_data.loc[wine_data['volatile acidity'] > 1.4]], axis=0)
outliers = pd.concat([outliers, wine_data.loc[wine_data['citric acid'] > 0.9]], axis=0)
outliers = pd.concat([outliers, wine_data.loc[wine_data['chlorides'] > 0.5]], axis=0)
outliers = pd.concat([outliers, wine_data.loc[wine_data['free sulfur dioxide'] > 60]], axis=0)
outliers = pd.concat([outliers, wine_data.loc[wine_data['total sulfur dioxide'] > 200]], axis=0)
outliers = pd.concat([outliers, wine_data.loc[wine_data['sulphates'] > 1.75]], axis=0)
outliers = pd.concat([outliers, wine_data.loc[wine_data['alcohol'] > 14]], axis=0)

# Remove duplicates from the outliers DataFrame
outliers = outliers[~outliers.duplicated()]

# Display the outliers
outliers

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1299,7.6,1.58,0.0,2.1,0.137,5.0,9.0,0.99476,3.5,0.4,10.9,3
151,9.2,0.52,1.0,3.4,0.61,32.0,69.0,0.9996,2.74,2.0,9.4,4
258,7.7,0.41,0.76,1.8,0.611,8.0,45.0,0.9968,3.06,1.26,9.4,5
396,6.6,0.735,0.02,7.9,0.122,68.0,124.0,0.9994,3.47,0.53,9.9,5
1244,5.9,0.29,0.25,13.4,0.067,72.0,160.0,0.99721,3.33,0.54,10.3,6
1558,6.9,0.63,0.33,6.7,0.235,66.0,115.0,0.99787,3.22,0.56,9.5,5
1079,7.9,0.3,0.68,8.3,0.05,37.5,278.0,0.99316,3.01,0.51,12.3,7
1081,7.9,0.3,0.68,8.3,0.05,37.5,289.0,0.99316,3.01,0.51,12.3,7
86,8.6,0.49,0.28,1.9,0.11,20.0,136.0,0.9972,2.93,1.95,9.9,6
92,8.6,0.49,0.29,2.0,0.11,19.0,133.0,0.9972,2.93,1.98,9.8,5


In this cell, we drop the indices of the outliers from the independent features as well as the target feature.

In [6]:
X = X.drop(outliers.index)
y = y.drop(outliers.index)

As we are taking the classification approach, we perform one hot encoding to convert the categorical values in a type of a mask, which is then used to train the dataset

In [7]:
y_clean = pd.get_dummies(y)

In [8]:
y_clean.shape

(1588, 6)

Feature Scaling:
The features in X are standardized using StandardScaler to ensure they have a mean of 0 and a standard deviation of 1, which improves the performance of many machine learning algorithms.

Data Splitting:
The dataset is split into training (80%) and testing (20%) sets using train_test_split, ensuring the model can be trained and later evaluated on unseen data.

Reproducibility:
The random_state is set to 42 to ensure that the data split is consistent and reproducible across different runs.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


# Scaling the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Splitting the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_clean, test_size=0.2, random_state=42)


In [10]:
EPOCHS = 500
BATCH_SIZE = 32

METRICS = [
    keras.metrics.MeanSquaredError(),
]
metrics=METRICS

normalize = layers.Normalization(axis=-1)

Model Architecture:
The model includes layers with Gaussian noise, two hidden layers (156, 256 units), and dropout (0.3) for regularization. The output layer uses softmax for multi-class classification.

Optimization:
Adam optimizer with a learning rate of 0.001 is used, along with the CategoricalCrossentropy loss function for classification tasks.

Metrics:
The model tracks performance using CategoricalAccuracy, which measures how often predictions match the true labels.

Training:
The model is trained for 500 epochs with a batch size of 32, using both training and validation data for performance evaluation

In [11]:
import tensorflow as tf
from tensorflow.keras import layers

# Parameters
EPOCHS = 500
BATCH_SIZE = 32
LEARNING_RATE = 0.001

# Metrics
METRICS = [tf.keras.metrics.CategoricalAccuracy()]

# Model definition for categorical output
def get_compiled_model():
    model = tf.keras.Sequential([
        layers.GaussianNoise(0.05),  # Add GaussianNoise(0.05)
        layers.Dense(156, activation='relu'),  # 128 units in the first layer
        layers.Dropout(0.3),  # Add Dropout(0.5)
        layers.Dense(256, activation = 'relu'),
        layers.Dropout(0.3),
        layers.Dense(6, activation='softmax')  # Output layer with 6 nodes for 6 categories
    ])
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),  # Adam optimizer with learning rate 0.02
        loss=tf.keras.losses.CategoricalCrossentropy(),  # Loss function for categorical classification
        metrics=METRICS  # Categorical Accuracy metric
    )
    return model

# Initialize and compile the model
model = get_compiled_model()

# Display model summary
model.summary()

# Train the model (without validation, just train and test data)
history = model.fit(
    X_train, y_train,  # Training data (one-hot encoded)
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(X_test, y_test),  # Testing data (one-hot encoded)
    verbose=1
)



Epoch 1/500
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - categorical_accuracy: 0.4406 - loss: 1.4035 - val_categorical_accuracy: 0.5535 - val_loss: 0.9346
Epoch 2/500
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - categorical_accuracy: 0.6127 - loss: 1.0187 - val_categorical_accuracy: 0.5755 - val_loss: 0.9014
Epoch 3/500
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - categorical_accuracy: 0.5744 - loss: 1.0345 - val_categorical_accuracy: 0.5755 - val_loss: 0.8893
Epoch 4/500
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - categorical_accuracy: 0.6326 - loss: 0.9399 - val_categorical_accuracy: 0.6164 - val_loss: 0.8747
Epoch 5/500
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - categorical_accuracy: 0.6408 - loss: 0.9123 - val_categorical_accuracy: 0.6101 - val_loss: 0.8712
Epoch 6/500
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m

The model is evaluated on both the test and training sets using the evaluate() function, which returns the loss and accuracy.

In [12]:
# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(X_test, y_test)
train_loss, train_accuracy = model.evaluate(X_train, y_train)
print(f'Test Loss: {test_loss}, Test Accuracy: {test_accuracy}')
print(f'Train Loss: {train_loss}, Train Accuracy: {train_accuracy}')

[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 724us/step - categorical_accuracy: 0.6161 - loss: 1.7676
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 866us/step - categorical_accuracy: 0.9955 - loss: 0.0482
Test Loss: 1.420691967010498, Test Accuracy: 0.6666666865348816
Train Loss: 0.05206191912293434, Train Accuracy: 0.9921259880065918


The model achieved a training accuracy of 99.2%, indicating that it performed very well on the training data. On the test set, the accuracy was 66.6%, showing that the model is less accurate when predicting on unseen data. This difference highlights the model's performance gap between the training and test sets.

## Regression Approach


The winequality-red.csv dataset is loaded into a pandas DataFrame, and wines.info() provides a summary of the dataset, including data types and non-null values. 

In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam

# Load dataset
wines = pd.read_csv('data/winequality-red.csv',header=0)
wines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [14]:
wines.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5



Dropping any rows with missing values, if applicable

In [15]:
wines = wines.dropna(axis=0)

Creating training and validation splits

In [16]:
df_train = wines.sample(frac=0.7, random_state=0)
df_test = wines.drop(df_train.index)

Scales the features to a range between 0 and 1 using min-max normalization, which is essential for training neural networks effectively.

In [17]:
min_val = df_train.min(axis=0)
max_val = df_train.max(axis=0)

## Feature scaling
Separates the feature columns (input data) from the target column (quality), creating training and testing datasets for both features and targets.

scaled_value= (original_value−min_value)/(max_value−min_value)
​
 


In [18]:
# Scaled data
df_train = (df_train - min_val) / (max_val - min_val)
df_test = (df_test - min_val) / (max_val - min_val)


Separates the feature columns (input data) from the target column (quality), creating training and testing datasets for both features and targets.

In [19]:
x_train = df_train.drop('quality', axis=1)
x_test = df_test.drop('quality', axis=1)
y_train = df_train['quality']
y_test = df_test['quality']

## Neural Network
A sequential neural network model is built with four hidden layers, each containing 512 units and using ReLU activation. The output layer has 6 neurons, corresponding to the 6 possible wine quality categories, and no activation function is applied in the output layer as it's intended for regression.

In [20]:
from tensorflow import keras
from tensorflow.keras import layers, callbacks

# Building the Deep Neural Network model
model = keras.Sequential([
    layers.Dense(512, input_shape=[11], activation='relu'),
    layers.Dense(512, activation='relu'),
    layers.Dense(512, activation='relu'),
    layers.Dense(512, activation='relu'),
    layers.Dense(6)  # Output layer: 6 neurons for 6 possible wine quality scores (after scaling)
])


Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.



In [21]:

# Summary of the model
model.summary()

Sets up early stopping to prevent overfitting. It monitors the validation loss and stops training if there is no improvement over a specified number of epochs (patience). The best weights are restored afterward.a

In [22]:
# Early stopping to prevent overfitting
early_stopping = callbacks.EarlyStopping(
    min_delta=0.001,  # Minimum amount of change to count as improvement
    patience=20,  # Number of epochs to wait for improvement
    restore_best_weights=True  # Restore the best weights after stopping
)

In [23]:
# Compiling the model
model.compile(
    optimizer='adam',  # Adam optimizer
    loss='mae'  # Mean Absolute Error for regression-type problems
)

The model is trained for 50 epochs using the training data (x_train, y_train), with validation performed on the test data (x_test, y_test). The batch size is set to 256, and early stopping is applied to halt training if the model stops improving. The training progress is logged with verbose=1.

In [24]:

# Training the model
history = model.fit(
    x_train, y_train,
    validation_data=(x_test, y_test),
    batch_size=256,
    epochs=50,
    callbacks=[early_stopping],  # Using early stopping as a callback
    verbose=1  # Turning on the training log
)

Epoch 1/50
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 34ms/step - loss: 0.4140 - val_loss: 0.1889
Epoch 2/50
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - loss: 0.1591 - val_loss: 0.1567
Epoch 3/50
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - loss: 0.1391 - val_loss: 0.1317
Epoch 4/50
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step - loss: 0.1267 - val_loss: 0.1256
Epoch 5/50
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - loss: 0.1199 - val_loss: 0.1085
Epoch 6/50
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - loss: 0.1097 - val_loss: 0.1051
Epoch 7/50
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - loss: 0.1101 - val_loss: 0.1114
Epoch 8/50
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - loss: 0.1090 - val_loss: 0.1024
Epoch 9/50
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [

Creating a history dataframe to track progress

In [25]:
history_df = pd.DataFrame(history.history)

Checking the training loss of the model

In [26]:
train_loss = model.evaluate(x_train, y_train, verbose=0)
print(f"Train Loss: {train_loss}")

Train Loss: 0.08607341349124908


Checking the testing loss

In [27]:
test_loss = model.evaluate(x_test, y_test, verbose=0)
print(f"Test Loss: {test_loss}")

Test Loss: 0.0939880907535553


Calculating train accuracy based on loss

In [28]:
train_accuracy = 100 - (train_loss * 100)
test_accuracy = 100 - (test_loss * 100)

In [29]:
# Printing Train and Test Accuracy
print(f"Train Accuracy: {train_accuracy:.2f}%")
print(f"Test Accuracy: {test_accuracy:.2f}%")

Train Accuracy: 91.39%
Test Accuracy: 90.60%


The model achieved a training accuracy of 91.39%, indicating strong performance on the training data. The test accuracy of 90.60% shows that the model generalizes well to unseen data, with only a small difference between training and test performance. This suggests that the model is well-balanced and not overfitting.

## Summary

- Classification Accuracy:
    When we treated wine quality as a categorical variable, we achieved a test accuracy of 66.6%, with high training accuracy, suggesting a focus on predicting specific quality categories.

- Regression Accuracy:
    Treating wine quality as a continuous value (regression) improved test accuracy to 90.60%, reflecting a better fit when predicting a broader range of possible quality scores.

- Loss Comparison:
    Classification had a higher test loss (1.43) compared to regression, which indicates that predicting exact quality scores in regression better captures the nuances of the data.

- Approach Difference:
    Classification simplified the problem into discrete categories, while regression allowed more granularity, which is reflected in the improved accuracy and more precise predictions.

This shows that treating quality as a continuous value in regression led to better overall performance. 