# Wine Quality Classification

## 1. Overview

### Multinomial Logistic Regression

<strong>Type</strong>: Classification </p>
<strong>UCI Open Source Dataset</strong>: [Wine Quality](https://archive.ics.uci.edu/dataset/186/wine+quality) </p>

This dataset contains red and white vinho verde wine samples, from the north of Portugal, and wine quality data based on physicochemical tests [Cortez et al., 2009](http://www3.dsi.uminho.pt/pcortez/wine/). 

<strong>Problem</strong>: Imagine you are a wine specialist who is looking for an automated way to categorize the wines you find based on wine quality data from physicochemical tests. You could use a machine learning algorithm to train a model that would be able to predict the quality of a wine based on its physicochemical properties. This would allow you to quickly and easily categorize new wines that you find, without having to manually taste them.

Here are some of the benefits of using an automated wine categorization system:

- <strong>Speed</strong>: An automated system can categorize wines much faster than a human can. This is especially beneficial for wine retailers and distributors who need to quickly categorize large numbers of wines.
- <strong>Accuracy</strong>: An automated system can be more accurate than a human when it comes to categorizing wines. This is because the system is not influenced by personal biases or preferences.
- <strong>Consistency</strong>: An automated system will consistently categorize wines in the same way, which can help to ensure that customers are getting the wines they expect.

If you are a wine specialist who is looking for an efficient and accurate way to categorize wines, then an automated system may be the perfect solution for you.

## 2. Setup

In [1]:
from google.cloud import storage, aiplatform, exceptions
from pyspark.sql import SparkSession

### Download the dataset

In [2]:
TRAIN_DATASET_URL = "http://www3.dsi.uminho.pt/pcortez/wine/winequality.zip"
BUCKET_NAME = "dataproc-workspaces-notebooks-wine-quality"

#### Create a bucket to handle training

In [3]:
# Creating directory to store dataset
def create_bucket_class_location(bucket_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    bucket.storage_class = "STANDARD"
    try:
        new_bucket = storage_client.create_bucket(bucket, location="us")
        print(
            "Created bucket {} in {} with storage class {}".format(
                new_bucket.name, new_bucket.location, new_bucket.storage_class
            )
        )
        return new_bucket
    except exceptions.Conflict:
        print(
            "Bucket {} already created".format(bucket_name)
        )
    finally:
        return bucket_name

bucket_name = create_bucket_class_location(BUCKET_NAME)

Bucket dataproc-workspaces-notebooks-wine-quality already created


In [4]:
# Download the raw .zip data by copying the data to cloud storage bucket.
!wget $TRAIN_DATASET_URL
!unzip -o winequality.zip 
!gsutil cp winequality/* gs://$BUCKET_NAME

--2023-06-16 02:06:39--  http://www3.dsi.uminho.pt/pcortez/wine/winequality.zip
Resolving www3.dsi.uminho.pt (www3.dsi.uminho.pt)... 193.136.11.133
Connecting to www3.dsi.uminho.pt (www3.dsi.uminho.pt)|193.136.11.133|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96005 (94K) [application/x-zip-compressed]
Saving to: ‘winequality.zip.18’


2023-06-16 02:06:40 (396 KB/s) - ‘winequality.zip.18’ saved [96005/96005]

Archive:  winequality.zip
  inflating: winequality/winequality-names.txt  
  inflating: winequality/winequality-names.txt.bak  
  inflating: winequality/winequality-red.csv  
  inflating: winequality/winequality-white.csv  
Copying file://winequality/winequality-names.txt [Content-Type=text/plain]...
Copying file://winequality/winequality-names.txt.bak [Content-Type=application/x-trash]...
Copying file://winequality/winequality-red.csv [Content-Type=text/csv]...       
Copying file://winequality/winequality-white.csv [Content-Type=text/csv]...     
- [

## 3. Exploratory Data Analysis

In [5]:
spark = SparkSession.builder \
    .appName("Multinomial logistic regression Wine Quality") \
    .enableHiveSupport() \
    .getOrCreate()

### Import and parse the training dataset

In [6]:
df = (
    spark.read.\
    options(inferSchema='True',delimiter=';',header='True'). \
    csv("gs://{}/winequality-white.csv".
        format(BUCKET_NAME)
       )
     )

                                                                                

In [7]:
df.show(3)

+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+
|fixed acidity|volatile acidity|citric acid|residual sugar|chlorides|free sulfur dioxide|total sulfur dioxide|density|  pH|sulphates|alcohol|quality|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+
|          7.0|            0.27|       0.36|          20.7|    0.045|               45.0|               170.0|  1.001| 3.0|     0.45|    8.8|      6|
|          6.3|             0.3|       0.34|           1.6|    0.049|               14.0|               132.0|  0.994| 3.3|     0.49|    9.5|      6|
|          8.1|            0.28|        0.4|           6.9|     0.05|               30.0|                97.0| 0.9951|3.26|     0.44|   10.1|      6|
+-------------+----------------+-----------+--------------+---------+-------------------+-----------

### DataFrame Column Data Types

DataFrames may have heterogenous or "mixed" data types, that is, some columns are numbers, some are strings, and some are dates etc. Because CSV files do not contain information on what data types are contained in each column, Pandas infers the data types when loading the data, e.g. if a column contains only numbers, Pandas will set that column’s data type to numeric: integer or float.

Run the next cell to see information on the DataFrame.

In [8]:
df.printSchema()

root
 |-- fixed acidity: double (nullable = true)
 |-- volatile acidity: double (nullable = true)
 |-- citric acid: double (nullable = true)
 |-- residual sugar: double (nullable = true)
 |-- chlorides: double (nullable = true)
 |-- free sulfur dioxide: double (nullable = true)
 |-- total sulfur dioxide: double (nullable = true)
 |-- density: double (nullable = true)
 |-- pH: double (nullable = true)
 |-- sulphates: double (nullable = true)
 |-- alcohol: double (nullable = true)
 |-- quality: integer (nullable = true)



### Summary Statistics 

At this point, we have all columns contains numerical values. For features which contain numerical values, we are often interested in various statistical measures relating to those values.

In [9]:
df.describe().show()

[Stage 3:>                                                          (0 + 1) / 1]

+-------+------------------+-------------------+-------------------+-----------------+--------------------+-------------------+--------------------+--------------------+-------------------+-------------------+------------------+------------------+
|summary|     fixed acidity|   volatile acidity|        citric acid|   residual sugar|           chlorides|free sulfur dioxide|total sulfur dioxide|             density|                 pH|          sulphates|           alcohol|           quality|
+-------+------------------+-------------------+-------------------+-----------------+--------------------+-------------------+--------------------+--------------------+-------------------+-------------------+------------------+------------------+
|  count|              4898|               4898|               4898|             4898|                4898|               4898|                4898|                4898|               4898|               4898|              4898|              4898|
|   mean

                                                                                

Let's investigate a bit more of our target data by using the .groupby() function.

In [10]:
from pyspark.sql.functions import col, countDistinct, isnan, sum, when, count
import pyspark.sql.functions as F

In [11]:
df.groupby(
    col('quality')).\
    count().\
    show()

+-------+-----+
|quality|count|
+-------+-----+
|      6| 2198|
|      3|   20|
|      5| 1457|
|      9|    5|
|      4|  163|
|      8|  175|
|      7|  880|
+-------+-----+



We can see here that the data is <b>imbalanced</b> for our target. <b>Imbalanced</b> data is a common problem in machine learning, where the number of samples in one class is much larger than the number of samples in another class. This can make it difficult to train a model that can accurately predict the minority class. There are a number of techniques that can be used to handle imbalanced data, including:

- <b>Resampling</b>: This involves increasing the number of samples in the minority class or decreasing the number of samples in the majority class. This can be done by oversampling the minority class (creating new samples), undersampling the majority class (removing samples), or a combination of both.
- <b>Cost-sensitive learning</b>: This involves assigning different costs to misclassifications of different classes. This can help to focus the model on correctly classifying the minority class.
- <b>Ensemble learning</b>: This involves training multiple models on different subsets of the data and then combining the predictions of the models. This can help to improve the accuracy of the model on the minority class.

We need to <b>resample</b> the data to balance the dataset. However, before we do that, we need to check if there are any issues with the data that need to be resolved. For example, we need to make sure that there are no missing values in the data. We also need to make sure that the data is not corrupted. Once we have resolved any issues with the data, we can then resample it to balance the dataset.

### DataFrame Column Data Types

DataFrames may have heterogenous or "mixed" data types, that is, some columns are numbers, some are strings, and some are dates etc. Because CSV files do not contain information on what data types are contained in each column, Pandas infers the data types when loading the data, e.g. if a column contains only numbers, Pandas will set that column’s data type to numeric: integer or float.

Run the next cell to see information on the DataFrame.

In [12]:
df.printSchema()

root
 |-- fixed acidity: double (nullable = true)
 |-- volatile acidity: double (nullable = true)
 |-- citric acid: double (nullable = true)
 |-- residual sugar: double (nullable = true)
 |-- chlorides: double (nullable = true)
 |-- free sulfur dioxide: double (nullable = true)
 |-- total sulfur dioxide: double (nullable = true)
 |-- density: double (nullable = true)
 |-- pH: double (nullable = true)
 |-- sulphates: double (nullable = true)
 |-- alcohol: double (nullable = true)
 |-- quality: integer (nullable = true)



### Summary Statistics 

At this point, we have all columns contains numerical values. For features which contain numerical values, we are often interested in various statistical measures relating to those values.

In [13]:
df.describe().show()

+-------+------------------+-------------------+-------------------+-----------------+--------------------+-------------------+--------------------+--------------------+-------------------+-------------------+------------------+------------------+
|summary|     fixed acidity|   volatile acidity|        citric acid|   residual sugar|           chlorides|free sulfur dioxide|total sulfur dioxide|             density|                 pH|          sulphates|           alcohol|           quality|
+-------+------------------+-------------------+-------------------+-----------------+--------------------+-------------------+--------------------+--------------------+-------------------+-------------------+------------------+------------------+
|  count|              4898|               4898|               4898|             4898|                4898|               4898|                4898|                4898|               4898|               4898|              4898|              4898|
|   mean

Let's investigate a bit more of our target data by using the .groupby() function.

In [14]:
from pyspark.sql.functions import col, countDistinct, isnan, sum, when, count
import pyspark.sql.functions as F

In [15]:
df.groupby(
    col('quality')).\
    count().\
    show()

+-------+-----+
|quality|count|
+-------+-----+
|      6| 2198|
|      3|   20|
|      5| 1457|
|      9|    5|
|      4|  163|
|      8|  175|
|      7|  880|
+-------+-----+



We can see here that the data is <b>imbalanced</b> for our target. <b>Imbalanced</b> data is a common problem in machine learning, where the number of samples in one class is much larger than the number of samples in another class. This can make it difficult to train a model that can accurately predict the minority class. There are a number of techniques that can be used to handle imbalanced data, including:

- <b>Resampling</b>: This involves increasing the number of samples in the minority class or decreasing the number of samples in the majority class. This can be done by oversampling the minority class (creating new samples), undersampling the majority class (removing samples), or a combination of both.
- <b>Cost-sensitive learning</b>: This involves assigning different costs to misclassifications of different classes. This can help to focus the model on correctly classifying the minority class.
- <b>Ensemble learning</b>: This involves training multiple models on different subsets of the data and then combining the predictions of the models. This can help to improve the accuracy of the model on the minority class.

We need to <b>resample</b> the data to balance the dataset. However, before we do that, we need to check if there are any issues with the data that need to be resolved. For example, we need to make sure that there are no missing values in the data. We also need to make sure that the data is not corrupted. Once we have resolved any issues with the data, we can then resample it to balance the dataset.

### Let's summarize our data by row, column, features, unique, and missing values.

In [16]:
# In Python shape() is used in pandas to give the number of rows/columns.
# The number of rows is given by .shape[0]. The number of columns is given by .shape[1].
# Thus, shape() consists of an array having two arguments -- rows and columns

print ("Rows     : " ,df.count())
print ("Columns  : " ,len(df.columns))
print ("\nFeatures : \n" ,df.columns)
print ("\n Count Distinct values : ", "")
expression = [countDistinct(c).alias(c) for c in df.columns]
print ("\nUnique values :  \n", df.select(*expression).show())
print ("\nMissing values :  ", "")
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

Rows     :  4898
Columns  :  12

Features : 
 ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']

 Count Distinct values :  
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+---+---------+-------+-------+
|fixed acidity|volatile acidity|citric acid|residual sugar|chlorides|free sulfur dioxide|total sulfur dioxide|density| pH|sulphates|alcohol|quality|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+---+---------+-------+-------+
|           68|             125|         87|           310|      160|                132|                 251|    890|103|       79|    103|      7|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+---+---------+-------+-------

There no missing values, or other data issue. So we can ressample the data.

### Rename a Feature Column 

Our feature columns have different "capitalizations" in their names, e.g. both upper and lower "case".  In addition, there are "spaces" in some of the column names. 

In [17]:
df = df.withColumnRenamed("fixed acidity","fixed_acidity")\
.withColumnRenamed("volatile acidity","volatile_acidity")\
.withColumnRenamed("citric acid","citric_acid")\
.withColumnRenamed("residual sugar","residual_sugar")\
.withColumnRenamed("chlorides","chlorides")\
.withColumnRenamed("free sulfur dioxide","free_sulfur_dioxide")\
.withColumnRenamed("total sulfur dioxide","total_sulfur_dioxide")\
.withColumnRenamed("density","density")\
.withColumnRenamed("pH","pH")\
.withColumnRenamed("sulphates","sulphates")\
.withColumnRenamed("alcohol","alcohol")\
.withColumnRenamed("quality","quality")

In [18]:
df.printSchema()

root
 |-- fixed_acidity: double (nullable = true)
 |-- volatile_acidity: double (nullable = true)
 |-- citric_acid: double (nullable = true)
 |-- residual_sugar: double (nullable = true)
 |-- chlorides: double (nullable = true)
 |-- free_sulfur_dioxide: double (nullable = true)
 |-- total_sulfur_dioxide: double (nullable = true)
 |-- density: double (nullable = true)
 |-- pH: double (nullable = true)
 |-- sulphates: double (nullable = true)
 |-- alcohol: double (nullable = true)
 |-- quality: integer (nullable = true)



### Resampling

In [19]:
#TODO

## 4. Feature engineering

<b>Feature engineering</b> is the process of transforming raw data into features that are more informative and useful for machine learning algorithms. This can involve a variety of tasks, such as:

- <b>Data transformation</b>: This involves transforming the data into a format that is more suitable for machine learning algorithms. For example, categorical data can be encoded as numerical data, and continuous data can be discretized.
- <b>Feature selection</b>: This involves selecting the most important features from the data set. This can be done using a variety of techniques, such as statistical significance tests and feature importance scores.
- <b>Feature creation</b>: This involves creating new features from the existing data. This can be done by combining existing features, or by creating derived features that are based on the relationships between different features.

In [25]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T

In [44]:
@F.udf(returnType=T.StringType())
def create_quality_groups(score):
    if score in [1,2,3]:
        return 'poor'
    elif score in [4,5]:
        return 'normal'
    elif score in [6,7,8]:
        return 'good'
    elif score in [9]:
        return 'excelent'
    return 'not defined'

In [45]:
quality_group = F.udf(
    lambda q: create_quality_groups(q),
    T.StringType()
)

In [47]:
df_transf = df.withColumn("quality", create_quality_groups("quality"))
df_transf.show(20)

+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+
|fixed_acidity|volatile_acidity|citric_acid|residual_sugar|chlorides|free_sulfur_dioxide|total_sulfur_dioxide|density|  pH|sulphates|alcohol|quality|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+
|          7.0|            0.27|       0.36|          20.7|    0.045|               45.0|               170.0|  1.001| 3.0|     0.45|    8.8|   good|
|          6.3|             0.3|       0.34|           1.6|    0.049|               14.0|               132.0|  0.994| 3.3|     0.49|    9.5|   good|
|          8.1|            0.28|        0.4|           6.9|     0.05|               30.0|                97.0| 0.9951|3.26|     0.44|   10.1|   good|
|          7.2|            0.23|       0.32|           8.5|    0.058|               47.0|           

## 5. Model Choice

<b>Multinomial logistic regression</b> is a type of logistic regression that can be used for multi-class classification problems. In the case of wine quality classification, there are 4 classes (poor, normal, good and excelent) so multinomial logistic regression is a good choice for modeling this problem.

The physicochemical tests can be used to measure the various properties of wine, such as acidity, alcohol content, and sugar content. These properties can then be used as features in the multinomial logistic regression model.

Here are some of the advantages of using multinomial logistic regression for wine quality classification:

- It is a relatively simple model that is easy to understand and interpret.
- It is a very flexible model that can be used to model a variety of different types of data.
- It is a very efficient model that can be estimated quickly and easily.

Here is some of the disadvantages of using multinomial logistic regression for wine quality classification:

- It may not be as accurate as some other models, such as support vector machines or decision trees.
- It may not be able to capture the nonlinear relationships between the features and the class labels.

In addition to multinomial logistic regression, there are a number of other models that could be used for wine quality classification. Some of these other models include support vector machines, decision trees, and random forests. However, multinomial logistic regression is a good starting point for wine quality classification because it is a simple, flexible, and efficient model.


## 6. Model Training

### Split data  

In [20]:
# Split training and test data
training, test = df.randomSplit([0.8, 0.2])

In [21]:
print ("training instances", training.count(), "test instances", test.count())

training instances 3929 test instances 969


In [22]:
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors

training_data=training.rdd.map(lambda x:(Vectors.dense(x[0:-1]), x[-1])).toDF(["features", "label"])
training_data.show()

                                                                                

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[3.8,0.31,0.02,11...|    6|
|[4.2,0.17,0.36,1....|    7|
|[4.2,0.215,0.23,5...|    3|
|[4.4,0.32,0.39,4....|    8|
|[4.4,0.46,0.1,2.8...|    6|
|[4.5,0.19,0.21,0....|    5|
|[4.6,0.445,0.0,1....|    5|
|[4.7,0.145,0.29,1...|    6|
|[4.7,0.335,0.14,1...|    5|
|[4.7,0.455,0.18,1...|    7|
|[4.7,0.67,0.09,1....|    5|
|[4.7,0.785,0.0,3....|    6|
|[4.8,0.13,0.32,1....|    7|
|[4.8,0.17,0.28,2....|    7|
|[4.8,0.21,0.21,10...|    7|
|[4.8,0.225,0.38,1...|    6|
|[4.8,0.26,0.23,10...|    7|
|[4.8,0.29,0.23,1....|    6|
|[4.8,0.33,0.0,6.5...|    5|
|[4.8,0.34,0.0,6.5...|    6|
+--------------------+-----+
only showing top 20 rows



### Train phase

In [23]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(regParam=0.3, elasticNetParam=0.1, family="multinomial")

# Fit the model
lrModel = lr.fit(training_data)

# Print the coefficients and intercept for multinomial logistic regression
print("Coefficients: \n" + str(lrModel.coefficientMatrix))
print("Intercept: " + str(lrModel.interceptVector))

trainingSummary = lrModel.summary

# Obtain the objective per iteration
objectiveHistory = trainingSummary.objectiveHistory
print("objectiveHistory:")
for objective in objectiveHistory:
    print(objective)

# for multiclass, we can inspect metrics on a per-label basis
print("False positive rate by label:")
for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print("True positive rate by label:")
for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print("Precision by label:")
for i, prec in enumerate(trainingSummary.precisionByLabel):
    print("label %d: %s" % (i, prec))

print("Recall by label:")
for i, rec in enumerate(trainingSummary.recallByLabel):
    print("label %d: %s" % (i, rec))

print("F-measure by label:")
for i, f in enumerate(trainingSummary.fMeasureByLabel()):
    print("label %d: %s" % (i, f))

accuracy = trainingSummary.accuracy
falsePositiveRate = trainingSummary.weightedFalsePositiveRate
truePositiveRate = trainingSummary.weightedTruePositiveRate
fMeasure = trainingSummary.weightedFMeasure()
precision = trainingSummary.weightedPrecision
recall = trainingSummary.weightedRecall
print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
      % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))

23/06/16 02:07:19 WARN com.github.fommil.netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
23/06/16 02:07:20 WARN com.github.fommil.netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
                                                                                

KeyboardInterrupt: 

## 7. Model Evaluation

## 8. Prediction