# Wine dataset project
In this practice project, I have done some work on the famous wine dataset from Kaggle. I have used this notebook primarily as a testing playgorund. You can check the different scores I received while testing with various different models at the bottom of this notebook. 

Importing Libraries

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score,KFold,train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.pipeline import Pipeline

Loading Dataset

In [3]:
# Install dependencies first if not installed
# pip install kagglehub pandas

import kagglehub
import pandas as pd

# Download the dataset
path = kagglehub.dataset_download("yasserh/wine-quality-dataset")

# Load the CSV file (KaggleHub returns the dataset folder path)
df = pd.read_csv(f"{path}/WineQT.csv")  # The file name in that dataset is WineQT.csv


## Preprocessing

In [4]:
df.shape

(1143, 13)

In [5]:
df.isna().value_counts()

fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  free sulfur dioxide  total sulfur dioxide  density  pH     sulphates  alcohol  quality  Id   
False          False             False        False           False      False                False                 False    False  False      False    False    False    1143
Name: count, dtype: int64

No null rows were found. I also checked with data wrangler. df looks clean

In [6]:
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'Id'],
      dtype='object')

In [7]:
df["quality"].value_counts()

quality
5    483
6    462
7    143
4     33
8     16
3      6
Name: count, dtype: int64

Encoding the target classes. The following rule is applied :
- The Wine with a quality of **3** or **4** becomes Class **0**

- The Wine with a quality of **5** or **6** becomes Class **1**

- The Wine with a quality of **7** or **8** becomes Class **2**

In [8]:
def modifcation(x):
  if x in [3,4]:
    return 0
  elif x in [5,6]:
    return 1
  elif x in [7,8]:
    return 2

df["label_quality"] = df["quality"].apply(modifcation)
df.tail(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id,label_quality
1133,6.7,0.32,0.44,2.4,0.061,24.0,34.0,0.99484,3.29,0.8,11.6,7,1584,2
1134,7.5,0.31,0.41,2.4,0.065,34.0,60.0,0.99492,3.34,0.85,11.4,6,1586,1
1135,5.8,0.61,0.11,1.8,0.066,18.0,28.0,0.99483,3.55,0.66,10.9,6,1587,1
1136,6.3,0.55,0.15,1.8,0.077,26.0,35.0,0.99314,3.32,0.82,11.6,6,1590,1
1137,5.4,0.74,0.09,1.7,0.089,16.0,26.0,0.99402,3.67,0.56,11.6,6,1591,1
1138,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6,1592,1
1139,6.8,0.62,0.08,1.9,0.068,28.0,38.0,0.99651,3.42,0.82,9.5,6,1593,1
1140,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.9949,3.45,0.58,10.5,5,1594,1
1141,5.9,0.55,0.1,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6,1595,1
1142,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5,1597,1


Train Test Split

In [9]:
X = df.iloc[:,:-1].drop(["quality","Id"],axis=1)
names = X.columns
X = X.to_numpy()
y = df.iloc[:,-1].to_numpy()

x_train,x_test,y_train,y_test = train_test_split(
  X,y,test_size=1/3,stratify=y,random_state=91,shuffle=True)


Pipeline building

In [None]:
# RnadomForestClassifier
model = RandomForestClassifier(n_estimators=1000,random_state=101,class_weight="balanced")
sfs = SequentialFeatureSelector(estimator=model,n_features_to_select="auto",tol=1e-5,direction="backward",cv=3)

preprocessor = Pipeline([
  ("imputation",SimpleImputer(strategy="median")),
  ("scaling",StandardScaler())
])
pipe = Pipeline([
  ("preprocessing",preprocessor),
  ("sfs",sfs),
  ("model",model)
])

In [None]:
# # SVC 
# model = SVC(C=10**5,kernel="rbf",class_weight="balanced",random_state=21)
# sfs = SequentialFeatureSelector(estimator=model,n_features_to_select="auto",tol=1e-5,direction="backward",cv=3)

# preprocessor = Pipeline([
#   ("imputation",SimpleImputer(strategy="median")),
#   ("scaling",StandardScaler())
# ])
# pipe = Pipeline([
#   ("preprocessing",preprocessor),
#   ("sfs",sfs),
#   ("model",model)
# ]) 

In [None]:
# # KNeighborClassifier 
# model = KNeighborsClassifier(n_neighbors=10,metric="minkowski",p=2)
# sfs = SequentialFeatureSelector(estimator=model,n_features_to_select="auto",tol=1e-5,direction="backward",cv=3)

# preprocessor = Pipeline([
#   ("imputation",SimpleImputer(strategy="median")),
#   ("scaling",StandardScaler())
# ])
# pipe = Pipeline([
#   ("preprocessing",preprocessor),
#   ("sfs",sfs),
#   ("model",model)
# ]) 

In [None]:
# # LogisticRgeression 
# model = LogisticRegression(solver="lbfgs",penalty="l2",C=10**5,class_weight="balanced",random_state=213)
# sfs = SequentialFeatureSelector(estimator=model,n_features_to_select="auto",tol=1e-5,direction="backward",cv=3)

# preprocessor = Pipeline([
#   ("imputation",SimpleImputer(strategy="median")),
#   ("scaling",StandardScaler())
# ])
# pipe = Pipeline([
#   ("preprocessing",preprocessor),
#   ("sfs",sfs),
#   ("model",model)
# ]) 

If you want to try out one of these, comment out the previous one (currently RandomFC) and uncomment the new one.

## Evaluation

In [None]:
kfold = KFold(n_splits=4,shuffle=True,random_state=1001)
score = cross_val_score(pipe,X,y,cv=kfold)
print(score)
print(f"Average = {score.mean()}")         # Latest Run: Random FC

[0.56643357 0.61188811 0.5        0.57192982]
Average = 0.5625628757207705


### Cross validation score comparison (4 folds, average score):

RandomForestClassifier 
=> 0.8696386946386946

SVC (with RBF kernal) 
=> 0.8223990921359342

Logistic Regression (lbfgs solver, l2 penalty, C = 10^3) 
=> 0.5625628757207705

KNeighborsClassifier 
=> 0.8451570359465096