# Predicting Red Wine Quality
## Phase 1: Data Preparation & Visualisation

### Group Name: Group 21

### Names(s) & ID(s) of Group Members: Michael Xie (s3943224), Samuel Lausberg (s3948914)

## Table of Contents
<a href='#introduction'>Introduction</a>

<a href='#goals'>Goals and Objectives</a>

<a href='#data_cleaning'>Data Cleaning</a>

<a href='#data_exploration'>Data Exploration and </a>

## <u>Introduction</u><a id='introduction'></a>


### Dataset Source
The Red Wine Quality dataset used in the study was sourced from UCI Machine Learning Repository (P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis., 2009). The dataset contains 2009 red wine qualities scored between 0 and 10.

### Dataset Details


In [12]:
# libraries

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [13]:
# read csv
file_name = "Phase1_Group21.csv"
wine_df = pd.read_csv(file_name, sep=';')

In [68]:
# see a sample
wine_df.sample(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
737,8.2,0.59,0.0,2.5,0.093,19.0,58.0,1.0002,3.5,0.65,9.3,6
542,9.3,0.715,0.24,2.1,0.07,5.0,20.0,0.9966,3.12,0.59,9.9,5
887,10.7,0.52,0.38,2.6,0.066,29.0,56.0,0.99577,3.15,0.79,12.1,7
811,12.9,0.5,0.55,2.8,0.072,7.0,24.0,1.00012,3.09,0.68,10.9,6
1267,10.4,0.43,0.5,2.3,0.068,13.0,19.0,0.996,3.1,0.87,11.4,6
669,11.3,0.34,0.45,2.0,0.082,6.0,15.0,0.9988,2.94,0.66,9.2,6
1074,7.5,0.77,0.2,8.1,0.098,30.0,92.0,0.99892,3.2,0.58,9.2,5
841,6.6,0.66,0.0,3.0,0.115,21.0,31.0,0.99629,3.45,0.63,10.3,5
704,9.1,0.765,0.04,1.6,0.078,4.0,14.0,0.998,3.29,0.54,9.7,4
506,10.4,0.24,0.46,1.8,0.075,6.0,21.0,0.9976,3.25,1.02,10.8,7


### Dataset Features

The features of the red wine dataset are described in the following table.
The units are taken are from the research paper: 
https://www.sciencedirect.com/science/article/pii/S0167923609001377?ref=cra_js_challenge&fr=RR-1
The descriptions are taken from Kaggle:
https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009

<b style='color: red'>Need to do proper APA referencing here</b>

In [67]:
from tabulate import tabulate

col_desc = [
    ['Name', 'Data Type', 'Units', 'Description'],
    ['Fixed acidity', 'Numeric', 'g(tartaric acid)/dm^3', 'Most acids involved with wine or fixed or\nnonvolatile (do not evaporate readily)'],
    ['Volatile acidity', 'Numeric', 'g(acetic acid)/dm^3', 'The amount of acetic acid in wine,\nwhich at too high of levels can lead to\nan unpleasant, vinegar taste'],
    ['Citric acid', 'Numeric', 'g/dm^3', "Found in small quantities, citric acid can\nadd 'freshness' and flavor to wines"],
    ['Residual sugar', 'Numeric', 'g/dm^3', "the amount of sugar remaining after\nfermentation stops, it's rare to find\nwines with less than 1 gram/liter and\nwines with greater than 45 grams/liter\nare considered sweet"],
    ['Chlorides', 'Numeric', 'g(sodium chlorid)/dm^3', 'The amount of salt in the wine'],
    ['Free sulfur dioxide', 'Numeric', 'mg/dm^3', 'The free form of SO2 exists in equilibrium\nbetween molecular SO2 (as a dissolved gas)\nand bisulfite ion; it prevents microbial growth\nand the oxidation of wine'],
    ['Total sulfur dioxide', 'Numeric', 'mg/dm^3', 'Amount of free and bound forms of S02;\nin low concentrations, SO2 is mostly undetectable\nin wine, but at free SO2 concentrations\nover 50 ppm, SO2 becomes evident\nin the nose and taste of wine'],
    ['Density', 'Numeric', 'g/cm^3', 'The density of water is close to that of water\ndepending on the percent alcohol and sugar content'],
    ['pH', 'Numeric', 'pH', 'Describes how acidic or basic a wine is\non a scale from 0 (very acidic)\nto 14 (very basic); most wines\nare between 3-4 on the pH scale'],
    ['Sulphates', 'Numeric', 'g(potassium suphate)/dm^3', 'A wine additive which can contribute\nto sulfur dioxide gas (S02) levels,\nwich acts as an antimicrobial and antioxidant'],
    ['Alcohol', 'Numeric', 'vol%', 'The percent alcohol content of the wine'],
    ['Quality', 'Ordinal categorical', 'N/A', 'Score between 0 and 10']
]

print(tabulate(col_desc, headers='firstrow', tablefmt='fancy_grid'))

╒══════════════════════╤═════════════════════╤═══════════════════════════╤════════════════════════════════════════════════════╕
│ Name                 │ Data Type           │ Units                     │ Description                                        │
╞══════════════════════╪═════════════════════╪═══════════════════════════╪════════════════════════════════════════════════════╡
│ Fixed acidity        │ Numeric             │ g(tartaric acid)/dm^3     │ Most acids involved with wine or fixed or          │
│                      │                     │                           │ nonvolatile (do not evaporate readily)             │
├──────────────────────┼─────────────────────┼───────────────────────────┼────────────────────────────────────────────────────┤
│ Volatile acidity     │ Numeric             │ g(acetic acid)/dm^3       │ The amount of acetic acid in wine,                 │
│                      │                     │                           │ which at too high of levels c

### Target Feature

## <u>Goals and Objectives</u> <a id='goals'></a>

## <u>Data Cleaning and Preprocessing</u> <a id='data_cleaning'></a>

### Data Cleaning Steps

### Random Sampling

## Data Exploration and Visualisation

### Univariate Visualisation

### Two-Variable Visualisation

### Three-Variable Visualisation

## Summary and Conclusions

## References
- P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Wine Quality Data Set (UCI Machine Learning Repository). Retrieved September 27, 2021 from https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/