# Introduction to the Wine Quality dataset

This intro contains a summary (with a few additions) of the dataset description that you can find [here](http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality.names).

We have also included a couple of examples about how to read the data using Pandas. We assume you have already completed the introduction to pandas in this repository.

## Dataset description

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult  [http://www.vinhoverde.pt/en/](http://www.vinhoverde.pt/en/). You can also take a look to the paper that the dataset's donors wrote. It is titled *"Modeling wine preferences"* (you can find it in this folder). Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). For the WWC, the have added a column name with fake wine names for indexing purposes (and to be able to discuss the results we will obtain during the meetup).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods. 

Some relevant data:
1. **Number of Instances:** red wine - 1,599; white wine - 4,898.
1. **Number of Attributes:** 12 + output attribute. *Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection*.
1. **Attribute information:**
  1. **Input variables (based on physicochemical tests):**  
    1. Fixed acidity.
    1. Volatile acidity.
    1. Citric acid.
    1. Residual sugar.
    1. Chlorides.
    1. Free sulfur dioxide.
    1. Total sulfur dioxide.
    1. Density.
    1. pH.
    1. Sulphates.
    1. Alcohol.
    1. Name.
  1. **Output variable (based on sensory data):**  
    1. Quality (score between 0 and 10)
1. **Missing Attribute Values:** none.

## File structure

The dataset is split into two files:
* The file **winequality-red.csv** contains 1,599 red wine records. All the names on the red wines starts with 'r'.
* The file **winequality-white.csv** contains 4,898 white wine records. All the names starts with 'w'.

You can find the data in this repo within the `data/` directory.

**Important:** The data is in the csv (comma-separated values) format but it uses ';' to split the values rather than ','. Please make sure you specify the right separator when ingesting the data.

## Reading the data

We will use pandas to read and manipulate the data.

Make sure you use the right separator for the data (;) and use the *name* and index column for the data.

In [5]:
import pandas as pd

# We are assuming this notebook is in a first level subdirectory of the repo so the data will be ../data
# Change the directory accordinly to the location of your notebook
red_wines = pd.read_csv("../data/winequality-red.csv", sep=';', index_col='name')
white_wines = pd.read_csv("../data/winequality-white.csv", sep=';', index_col='name')

And now we can check that the data was ingested as expected:

In [2]:
red_wines.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1599 entries, r0000 to r1598
Data columns (total 12 columns):
fixed acidity           1599 non-null float64
volatile acidity        1599 non-null float64
citric acid             1599 non-null float64
residual sugar          1599 non-null float64
chlorides               1599 non-null float64
free sulfur dioxide     1599 non-null float64
total sulfur dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 162.4+ KB


In [4]:
white_wines.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4898 entries, w0000 to w4897
Data columns (total 12 columns):
fixed acidity           4898 non-null float64
volatile acidity        4898 non-null float64
citric acid             4898 non-null float64
residual sugar          4898 non-null float64
chlorides               4898 non-null float64
free sulfur dioxide     4898 non-null float64
total sulfur dioxide    4898 non-null float64
density                 4898 non-null float64
pH                      4898 non-null float64
sulphates               4898 non-null float64
alcohol                 4898 non-null float64
quality                 4898 non-null int64
dtypes: float64(11), int64(1)
memory usage: 497.5+ KB
