# Introduction

Building an AI pipeline to train a model on the Wine Quality dataset from the UC Irvine Machine Learning Repository. Purpose is to use the features defined in the database to predict the quality of  

In [1]:
pip install ucimlrepo
pip install scikit-learn

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


# Dataset Overview:

First the data was fetched from the datavase and split into the features and target values. There are 6497 data points with 11 features determining quality with a value between 3 and 9.

In [47]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Fetch dataset
wine_quality = fetch_ucirepo(id=186)

# Separate the features and target values
X = wine_quality.data.features
y = wine_quality.data.targets
y = y['quality']

# 11 features with 6497 instances
print(X.shape)
print(y.shape)
print(X.columns)


print(' The first few rows of features:')
print(X.head())
print(' The first few rows of target values:')
print(y.head())

(6497, 11)
(6497,)
Index(['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
       'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
       'pH', 'sulphates', 'alcohol'],
      dtype='object')
 The first few rows of features:
   fixed_acidity  volatile_acidity  citric_acid  residual_sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free_sulfur_dioxide  total_sulfur_dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54

In [48]:
print(f" The max value for quality is: {np.max(y)}")
print(f" The max value for quality is: {np.min(y)}")

 The max value for quality is: 9
 The max value for quality is: 3


In [49]:
# Splitting the dataset in training and teting sets
XTrain, XTest, yTrain, yTest = train_test_split(X,y,stratify = y, test_size = 0.1, random_state= 20)

# Pipeline

The model was trained using a KNeghibors classifier to make prediction for the quality. Preprocessing steps included mean imputation for missing values and normalizing the features. The database does confirm that there are no missing values, but for good practices, I included the imputation function.

In [50]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(
    SimpleImputer(strategy = 'mean'),
    MinMaxScaler(),
    KNeighborsClassifier(n_neighbors = 3)
)

pipe.fit(XTrain,yTrain)
yPred = pipe.predict(XTest)
report = classification_report(yTest,yPred)
print(report)


              precision    recall  f1-score   support

           3       0.00      0.00      0.00         3
           4       0.13      0.09      0.11        22
           5       0.58      0.66      0.62       214
           6       0.60      0.61      0.60       284
           7       0.59      0.47      0.53       108
           8       0.50      0.26      0.34        19

    accuracy                           0.57       650
   macro avg       0.40      0.35      0.37       650
weighted avg       0.57      0.57      0.57       650



# Discussion

I tried different values of k to see if this would improve the model but it seems that the KNeighbors algorithm is ineffective at predicting the quality without further adjustment such as SMOTE for oversampling. I will implement SMOTE to see if this helps the algorithm, and will also try an alternative method of random forests to copmare and constrast the two models for this database.