# Wine Quality DataSet Prediction

## Introduction: 

- Having exceptional wine taste preferences has become a very revered skill over time, that few people make a career out of. Determining the 'quality' of a wine is based on human preference but these preferences are often influenced by physicochemical and sensory variables (Cortez, P., et al, 2009). We want to see if we can create a more data-driven approach to the classification of wine quality. Similar models have been created and their systems ranked the wines very similarly to experts (Petropoulos, S., et al, 2017).

- Our model will answer the question, what quality ranking will a wine receive based on its pH and alcohol levels?

- The data set we will be using is the ‘Wine Quality Data Set’ found on UCI and created by researchers at the University of Minho in Portugal. The data set focuses on red Portuguese ‘Vinho Verde’ wines. It has input variables based on physicochemical tests such as acidity, pH, alcohol level, etc. which all lead to the output of a quality score from 0-10.

- Input variables (based on physicochemical tests):
1. **fixed acidity:** most acids involved with wine or fixed or nonvolatile
2. **volatile acidity:** the amount of acetic acid in wine
3. **citric acid:** found in small quantities, citric acid can add 'freshness' and flavor to wines
4. **residual sugar:** the amount of sugar remaining after fermentation stops
5. **chlorides:** the amount of salt in the wine
6. **free sulfur dioxide:** the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion
7. **total sulfur dioxide:** amount of free and bound forms of S02
8. **density:** the density of water is close to that of water depending on the percent alcohol and sugar content
9. **pH:** describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic)
10. **sulphates:** a wine additive which can contribute to sulfur dioxide gas (S02) levels
11. **alcohol:** the percent alcohol content of the wine
- Output variable (based on sensory data):
12. **quality:** output variable (based on sensory data, score between 0 and 10)



## Preliminary exploratory data analysis:

In [2]:
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt


In [3]:
wine_quality_data = pd.read_csv("winequality-red.csv",sep=";")#.columns.str.replace(' ', '_')
wine_quality_data.columns = wine_quality_data.columns.str.replace(' ','_',regex=True)
wine_quality_data

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [4]:
wine_train, wine_test = train_test_split(
    wine_quality_data, train_size = 0.75, stratify = wine_quality_data["quality"]
)
wine_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1199 entries, 1381 to 604
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         1199 non-null   float64
 1   volatile_acidity      1199 non-null   float64
 2   citric_acid           1199 non-null   float64
 3   residual_sugar        1199 non-null   float64
 4   chlorides             1199 non-null   float64
 5   free_sulfur_dioxide   1199 non-null   float64
 6   total_sulfur_dioxide  1199 non-null   float64
 7   density               1199 non-null   float64
 8   pH                    1199 non-null   float64
 9   sulphates             1199 non-null   float64
 10  alcohol               1199 non-null   float64
 11  quality               1199 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 121.8 KB


In [5]:
wine_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 400 entries, 227 to 131
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         400 non-null    float64
 1   volatile_acidity      400 non-null    float64
 2   citric_acid           400 non-null    float64
 3   residual_sugar        400 non-null    float64
 4   chlorides             400 non-null    float64
 5   free_sulfur_dioxide   400 non-null    float64
 6   total_sulfur_dioxide  400 non-null    float64
 7   density               400 non-null    float64
 8   pH                    400 non-null    float64
 9   sulphates             400 non-null    float64
 10  alcohol               400 non-null    float64
 11  quality               400 non-null    int64  
dtypes: float64(11), int64(1)
memory usage: 40.6 KB


In [6]:
#Calculate the counts of each quality appear in the training dataset
wine_train['quality'].value_counts()

5    511
6    478
7    149
4     40
8     13
3      8
Name: quality, dtype: int64

In [7]:
predictor_means = wine_train[['pH', 'alcohol']].mean()
predictor_means

pH          3.308766
alcohol    10.422241
dtype: float64

In [8]:
# This shows that there is 0 rows that has missing data in the training dataset 
wine_train.isnull().sum()

fixed_acidity           0
volatile_acidity        0
citric_acid             0
residual_sugar          0
chlorides               0
free_sulfur_dioxide     0
total_sulfur_dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [96]:
mean_sum_table = pd.DataFrame()
wine_vars = ['fixed_acidity','volatile_acidity','citric_acid',
               'residual_sugar','chlorides','free_sulfur_dioxide',
               'total_sulfur_dioxide','density','pH','sulphates','alcohol']
for var in wine_vars: 
    mean_sum_table['mean_',var] = wine_train.groupby(wine_train['quality'])[var].mean()
#mean_sum_table.columns = mean_sum_table.columns.str.replace('(','_')
mean_sum_table


Unnamed: 0_level_0,"(mean_, fixed_acidity)","(mean_, volatile_acidity)","(mean_, citric_acid)","(mean_, residual_sugar)","(mean_, chlorides)","(mean_, free_sulfur_dioxide)","(mean_, total_sulfur_dioxide)","(mean_, density)","(mean_, pH)","(mean_, sulphates)","(mean_, alcohol)"
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3,8.2375,0.928125,0.155,2.84375,0.127375,7.0,19.0,0.997409,3.41125,0.53625,9.99375
4,7.6825,0.67525,0.1765,2.8275,0.095425,11.05,32.325,0.996468,3.38275,0.5985,10.34375
5,8.134247,0.575665,0.242114,2.523483,0.093708,16.988258,56.617417,0.99706,3.302779,0.623014,9.899804
6,8.387238,0.495973,0.274247,2.479812,0.084073,15.803347,40.715481,0.996615,3.31772,0.678389,10.67371
7,8.822819,0.404228,0.369396,2.830201,0.077067,14.550336,35.697987,0.996082,3.293624,0.741544,11.455034
8,8.784615,0.415385,0.399231,2.761538,0.067769,11.230769,27.923077,0.995446,3.266923,0.778462,12.023077


**Training Data Visualization**

In [23]:
# To see how data is distributed for every column: we create distribution plots for each of the predictor varibles 

wine_vars = ['fixed_acidity','volatile_acidity','citric_acid',
               'residual_sugar','chlorides','free_sulfur_dioxide',
               'total_sulfur_dioxide','density','pH','sulphates','alcohol']


# #melted = pd.melt(wine_train, value_vars=wine_vars, var_name='predictor_variable')
# melted = wine_train.melt(id_vars="quality",
#                         var_name = "potential_predictor_vars")
# melted = melted.drop(columns=['value'])
# melted
# alt.data_transformers.enable('default', max_rows=None)
# chart = (alt.Chart(melted)
#          .mark_bar()
#          .encode(
#              x=alt.X('quality', bin=True),
#              y=alt.Y('count()',
#                     stack = "normalize",
#                     axis = alt.Axis(format = "%"),
#                     title = "per"),
#              fill = alt.Fill("potential_predictor_vars"))
#          )
           

var_plots = [] 
for var in wine_vars: 
    var_plot = (
    alt.Chart(wine_train)
    .mark_bar()
    .encode( x= alt.X(var, title = (var.str.replace('_',' '), "value")),
            y=alt.Y("count()", title = ("density")),
            opacity=alt.value(0.5),
            color = alt.value('purple')
           )
    
)
    var_plots.append(var_plot) 

for var_plot in var_plots:
    var_plot.display() 

AttributeError: 'str' object has no attribute 'str'

In [160]:
# # let's see how data is distributed for the predictor variables we plan to use in the analysis.
# # For pH: 
# pH_plot = (
#     alt.Chart(wine_train)
#     .mark_bar()
#     .encode( x= alt.X("pH", title = "pH value"),
#             y=alt.Y("count()", title = "count for pH"),
#             opacity=alt.value(0.2),
#             color = alt.value('red')
#            )
    
# )
# #for alcohol:
# alcohol_plot = (
#     alt.Chart(wine_train)
#     .mark_bar()
#     .encode( x= alt.X("alcohol", title = "alcohol value"),
#             y=alt.Y("count()", title = "count for alcohol"),
#             opacity=alt.value(0.2),
#             color = alt.value('green')
#            )
# )

# pH_plot
# fixed_acidity_plot = (
#     alt.Chart(wine_train)
#     .mark_bar()
#     .encode( x= alt.X("fixed_acidity", title = "alcohol value"),
#             y=alt.Y("count()", title = "count for alcohol"),
#             opacity=alt.value(0.2),
#             color = alt.value('blue')
#            )
# )


In [22]:
# pH_plot + alcohol_plot

## Methods: 
- We will use a KNN classifier to predict the wine quality using the pH and alcohol columns since they are two common factors that contribute to the wine quality. Although the wine quality is a numeric quantity in the dataset, we will use a classifier rather than regression since the quality is actually an ordinal variable rating (integers from 0-10) so we will treat it as a class/category. We will find the best k-value between 2-14 using cross-validation and grid search with the training set. After doing so, we will use the best k-value to build a model on the entire training set and use it to predict on the test set to determine our classifier's accuracy.

- As an intermediate step, we can visualize which k-value gave us the best accuracy during cross-validation with grid search by creating a graph of mean test score vs. k-value. This will allow us to easily determine the best k-value as well as see how the accuracy changes with different k-values.

- We will visualize our results using a confusion matrix to see when and how many times we have predicted the correct label vs. the incorrect label.



## Expected Outcomes and Significance:
- Our expected outcome from this strategy is to have a model that can predict the quality of a Portuguese “Vinho Verde” as accurately similarily to wine experts as possible.

- Using a data mining approach to classifying wine qualities could have a huge significance to the wine industry. When it is time for new wines to be certified, many countries require by law for the sensory analysis to be done by human testers. However, all testers have their own unique experience and thus their analysis is inherently biased. Our approach to classification remains objective. Some researchers suggest these data-driven approaches could aid in the efficiency of wine evaluation; for example, an expert has to repeat their evaluation only if there is a significant difference between their classification and the model’s (Cortez, P., et al, 2009). Looking to the future, could classification models like ours aid new winemakers in legitimizing their products without the need for expensive evaluations?


## Citations:

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J. (2009). Modelling wine preference by data mining from physicochemical properties. Decision Support Systems, 47(4), 547-553. https://doi.org/10.1016/j.dss.2009.05.016

Petropoulos, S., Karavas, C. S., Balafoutis, A. T., Paraskevopoulos, I., Kallithraka, S., Kotseridis, Y. (2017). Fuzzy logic tool for wine quality classification. Computers and Electronics in Agriculture, 142(Part B), 552-562. https://doi.org/10.1016/j.compag.2017.11.015
