<h1>5. Classifying Wine Quality Using Decision Trees</h1>
<h3><b>Preprocessing Steps:</b></h3>
<ul>
    <li>Handle missing values if any.</li>
    <li>Standardize features.</li>
    <li>Encode categorical variables if present.</li>
</ul>
<h3><b>Task:</b> Implement a decision tree classifier to classify wine quality (good/bad) and evaluate the model using accuracy and ROC-AUC.</h3>

In [33]:
# Importing Libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

In [25]:
# Loading the dataset
wine_dataset = pd.read_csv('..\\..\\Datasets\\WineQuality.csv')
print(wine_dataset.shape, '\n')
wine_dataset.head()

(1599, 12) 



Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [26]:
# Printing the basic statistics of the dataset
wine_dataset.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [27]:
# Printing the basic information of the dataset
wine_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


<h2>Data Preprocessing</h2>

<h3>1. Handling Missing Values</h3>

In [28]:
# Checking for missing values in the dataset
wine_dataset.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

-> Since there are no missing values in the dataset, we can proceed to the next step of preprocessing i.e, <b>Standardizing Numerical Features</b>.

<h3>2. Standardizing Features</h3>

In [29]:
# Since all the features are numerical, separating the features and target variable first
X = wine_dataset.drop('quality', axis=1)
Y = wine_dataset['quality']

In [30]:
# Applying Standardization
scaler = StandardScaler()
X_scaled = X.copy()

for features in X.columns:
    X_scaled[features] = scaler.fit_transform(X[[features]])

In [31]:
# Printing basic statistics after scaling the features
X_scaled.describe().round(2)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-2.14,-2.28,-1.39,-1.16,-1.6,-1.42,-1.23,-3.54,-3.7,-1.94,-1.9
25%,-0.7,-0.77,-0.93,-0.45,-0.37,-0.85,-0.74,-0.61,-0.66,-0.64,-0.87
50%,-0.24,-0.04,-0.06,-0.24,-0.18,-0.18,-0.26,0.0,-0.01,-0.23,-0.21
75%,0.51,0.63,0.77,0.04,0.05,0.49,0.47,0.58,0.58,0.42,0.64
max,4.36,5.88,3.74,9.2,11.13,5.37,7.38,3.68,4.53,7.92,4.2


-> So now all the features have mean of 0 and standard deviation of 1. It means that all the features have been scaled.

<h3>3. Encoding Categorical Variables</h3>

In [32]:
# There is only one categorical variable 'quality' which has to be encoded in good(1) or bad(0).
# Setting the threshlod that if rating is greater than 5, it'll be good(1), else bad (0).
Y = Y.apply(lambda x:1 if x>5 else 0)

print('Categories of quality:', Y.nunique(), ': ', Y.unique())
print(Y.value_counts())

Categories of quality: 2 :  [0 1]
quality
1    855
0    744
Name: count, dtype: int64


-> Since there are no other variables to be encoded. Hence the data preprocessing part is completed.

<h2>Model Training</h2>

In [34]:
# Splitting the dataset into train and test sets in 80/20 ratio
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=42)

In [35]:
# Implementing the model
dtc_model = DecisionTreeClassifier()
dtc_model.fit(X_train, Y_train)

In [37]:
# Predicting the target variable
Y_pred = dtc_model.predict(X_test)
print(Y_pred.shape, '\n')

(320,) 



In [38]:
Y_pred_proba = dtc_model.predict_proba(X_test)[:, 1]
print(Y_pred_proba.shape, '\n')

(320,) 



<h2>Model Evaluation</h2>

<h3>1. Accuracy Score</h3>

In [40]:
# Calculating the accuracy of the model
accuracy = accuracy_score(Y_test, Y_pred)
print('Accuracy of the model:', accuracy)

Accuracy of the model: 0.725


<h3>2. ROC-AUC Score</h3>

In [41]:
# Calculating the roc-auc score of the model
roc_auc = roc_auc_score(Y_test, Y_pred_proba)
print('ROC-AUC score of the model:', roc_auc)

ROC-AUC score of the model: 0.7225722096755023


-> The model's accuracy of 0.725 indicates that it correctly classifies wine quality as "good" or "bad" about 72.5% of the time. The ROC-AUC score of 0.723 suggests that the model has a fair ability to distinguish between the two classes. While these metrics indicate a reasonable performance, there is room for improvement, possibly through feature engineering, or exploring more complex models.

<hr>