# Predicting Red Wine Quality
## Phase 1: Data Preparation & Visualisation

### Group Name: Group 21

### Names(s) & ID(s) of Group Members: Michael Xie (s3943224), Samuel Lausberg (s3948914)

## Table of Contents
<ul>
    <li>
    <a href='#introduction'>Introduction</a>
    </li>
    <ul>
        <li>
        <a href='#data_source'>Dataset Source</a>
        </li>
        <li>
        <a href='#data_details'>Dataset Details</a>
        </li>
        <li>
        <a href='#data_features'>Dataset Features</a>
        </li>
        <li>
        <a href='#target_features'>Target Features</a>
        </li>
    </ul>
    </li>
    <li>
    <a href='#goals'>Goals and Objectives</a>
    </li>
    <li>
    <a href='#data_cleaning'>Data Cleaning</a>
    </li>
    <li>
    <a href='#data_exploration'>Data Exploration and Visualisation</a>
    </li>
    <li>
    <a href='#summary'>Summary and Conclusions</a>
    </li>
    <li>
    <a href='#references'>References</a>
    </li>
</ul>

## <u>Introduction</u><a id='introduction'></a>


### Dataset Source <a id='data_source'></a>
The Red Wine Quality dataset used in the study was sourced from UCI Machine Learning Repository (P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis., 2009). The dataset contains 2009 red wine qualities scored between 0 and 10.

### Dataset Details  <a id='data_details'></a>


In [1]:
# libraries

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# read csv
file_name = "Phase1_Group21.csv"
wine_df = pd.read_csv(file_name, sep=';')

In [3]:
# see a sample
wine_df.sample(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
393,8.4,0.665,0.61,2.0,0.112,13.0,95.0,0.997,3.16,0.54,9.1,5
1223,10.5,0.36,0.47,2.2,0.074,9.0,23.0,0.99638,3.23,0.76,12.0,6
814,12.6,0.41,0.54,2.8,0.103,19.0,41.0,0.99939,3.21,0.76,11.3,6
1362,11.6,0.475,0.4,1.4,0.091,6.0,28.0,0.99704,3.07,0.65,10.033333,6
1253,7.9,0.66,0.0,1.4,0.096,6.0,13.0,0.99569,3.43,0.58,9.5,5
931,7.4,0.61,0.01,2.0,0.074,13.0,38.0,0.99748,3.48,0.65,9.8,5
104,7.2,0.49,0.24,2.2,0.07,5.0,36.0,0.996,3.33,0.48,9.4,5
402,12.2,0.48,0.54,2.6,0.085,19.0,64.0,1.0,3.1,0.61,10.5,6
1188,6.7,0.64,0.23,2.1,0.08,11.0,119.0,0.99538,3.36,0.7,10.9,5
738,9.0,0.46,0.23,2.8,0.092,28.0,104.0,0.9983,3.1,0.56,9.2,5


### Dataset Features  <a id='data_features'></a>

The features of the red wine dataset are described in the following table.
The units are taken from the research paper, "Modeling wine preferences by data mining from physicochemical properties" (P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis., 2009).
The descriptions were taken from the Kaggle data source.

In [4]:
from tabulate import tabulate

col_desc = [
    ['Name', 'Data Type', 'Units', 'Description'],
    ['Fixed acidity', 'Numeric', 'g(tartaric acid)/dm^3', 'Most acids involved with wine or fixed or\nnonvolatile (do not evaporate readily)'],
    ['Volatile acidity', 'Numeric', 'g(acetic acid)/dm^3', 'The amount of acetic acid in wine,\nwhich at too high of levels can lead to\nan unpleasant, vinegar taste'],
    ['Citric acid', 'Numeric', 'g/dm^3', "Found in small quantities, citric acid can\nadd 'freshness' and flavor to wines"],
    ['Residual sugar', 'Numeric', 'g/dm^3', "the amount of sugar remaining after\nfermentation stops, it's rare to find\nwines with less than 1 gram/liter and\nwines with greater than 45 grams/liter\nare considered sweet"],
    ['Chlorides', 'Numeric', 'g(sodium chlorid)/dm^3', 'The amount of salt in the wine'],
    ['Free sulfur dioxide', 'Numeric', 'mg/dm^3', 'The free form of SO2 exists in equilibrium\nbetween molecular SO2 (as a dissolved gas)\nand bisulfite ion; it prevents microbial growth\nand the oxidation of wine'],
    ['Total sulfur dioxide', 'Numeric', 'mg/dm^3', 'Amount of free and bound forms of S02;\nin low concentrations, SO2 is mostly undetectable\nin wine, but at free SO2 concentrations\nover 50 ppm, SO2 becomes evident\nin the nose and taste of wine'],
    ['Density', 'Numeric', 'g/cm^3', 'The density of water is close to that of water\ndepending on the percent alcohol and sugar content'],
    ['pH', 'Numeric', 'pH', 'Describes how acidic or basic a wine is\non a scale from 0 (very acidic)\nto 14 (very basic); most wines\nare between 3-4 on the pH scale'],
    ['Sulphates', 'Numeric', 'g(potassium suphate)/dm^3', 'A wine additive which can contribute\nto sulfur dioxide gas (S02) levels,\nwich acts as an antimicrobial and antioxidant'],
    ['Alcohol', 'Numeric', 'vol%', 'The percent alcohol content of the wine'],
    ['Quality', 'Ordinal categorical', 'N/A', 'Score between 0 and 10']
]

print(tabulate(col_desc, headers='firstrow', tablefmt='fancy_grid'))

╒══════════════════════╤═════════════════════╤═══════════════════════════╤════════════════════════════════════════════════════╕
│ Name                 │ Data Type           │ Units                     │ Description                                        │
╞══════════════════════╪═════════════════════╪═══════════════════════════╪════════════════════════════════════════════════════╡
│ Fixed acidity        │ Numeric             │ g(tartaric acid)/dm^3     │ Most acids involved with wine or fixed or          │
│                      │                     │                           │ nonvolatile (do not evaporate readily)             │
├──────────────────────┼─────────────────────┼───────────────────────────┼────────────────────────────────────────────────────┤
│ Volatile acidity     │ Numeric             │ g(acetic acid)/dm^3       │ The amount of acetic acid in wine,                 │
│                      │                     │                           │ which at too high of levels c

### Target Feature  <a id='target_features'></a>

The target feature for this project will be the red wines' quality score from 0 to 10. This value will be predicted with the 11 explanatory variables using a linear regression model.

## <u>Goals and Objectives</u> <a id='goals'></a>

Red wine is an alcoholic beverage which is enjoyed by many around the world and therefore is an industry worth 182 billion dollars in 2020 and is predicted to reach 278.5 billion dollars by 2028 (Samriddhi Chauhana and Roshan Deshmukh "Wine Red Market"). \
As this is such a lucrative business, a model that can confidently predict a red wine's quality will be of great importance to red wine companies looking to increase sales. Optimising a red wine's quality with an accurate model, will increase the amount of satisfied consumers investing in the product, thus allowing red wine companies to achieve higher sales. Furthermore, this model could be beneficial to critical investors who wish to identify certain factors which generally enhance the enjoyment of red wine. \
The objective of this project is to be able to use a linear regression model to accurately predict a red wine's quality when only its features are known. Additionally, we want to be able to recognize the specific features have the most weight on a red wine's quality. \
The linear regression model will be working under the assumption that there is a linear correlation between all of the red wine's features and its quality and that no two oberservations are related to each other.


## <u>Data Cleaning and Preprocessing</u> <a id='data_cleaning'></a>

### Data Cleaning Steps

As can be seen below, there are no missing values for any of the columns present in the dataset. The dataset available online has already been cleaned and preprocessed before it was made available to the public. As such, there is no cleaning or preprocessing needed.

In [25]:
wine_df.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

### Random Sampling

The dataset contains 1600 rows of data as can be seen below. Due to the fact that the data set is not overly large, we do not have any need to use random sampling in our data analysis to cut down on the amount of data.

In [26]:
wine_df.shape[0] + 1

1600

## <u>Data Exploration and Visualisation</u> <a id='data_exploration'></a>

### Univariate Visualisation

### Two-Variable Visualisation

### Three-Variable Visualisation

## <u>Summary and Conclusions</u> <a id='summary'></a>

## References
- P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. "Wine Quality Data Set" (Kaggle). Retrieved September 27, 2022 from https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
- P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. "Modeling wine preferences by data mining from physicochemical properties" (ScienceDirect). Retrieved September 27, 2022 from https://www.sciencedirect.com/science/article/pii/S0167923609001377?fr=RR-1&ref=cra_js_challenge
- Samriddhi Chauhana and Roshan Deshmukh "Wine Red Market" (Allied Research Market). Retrieved September 28, 2022 from https://www.alliedmarketresearch.com/red-wine-market-A13400