---
# Wine Quality "Warm up" Challenge
### Physicochemical factors that predict good quality wine
---
The "warm up" challenge for this year is adapted from the well-known 'Wine Quality' challenge on Kaggle. In particular, given a dataset containing several attributes describing wine, your task is to make predictions on the quality of as-yet unlisted wine samples. Developing a model which accurately fits the available training data while also generalising to unseen data-points is a multi-faceted challenge that involves a mixture of data exploration, pre-processing, model selection, and performance evaluation.

**IMPORTANT**: please refer to the AML course guidelines concerning grading rules. Pay especially attention to the **presentation quality** item, which boils down to: don't dump a zillion of lines of code and plots in this notebook. Produce a concise summary of your findings: this notebook can exist in two versions, a "scratch" version that you will use to work and debug, a "presentation" version that you will submit. The "presentation" notebook should go to the point, and convay the main findings of your work.

---
## Overview
Beyond simply producing a well-performing model for making predictions, in this challenge we would like you to start developing your skills as a machine learning scientist. In this regard, your notebook should be structured in such a way as to explore the following tasks, that are expected to be carried out whenever undertaking such a project. The description below each aspect should serve as a guide for your work, but you are can also explore alternative options and directions. Thinking outside the box will be rewarded in these challenges.

### 1. Data preparation:
   
_Data exploration_: The first broad component of your work should enable you to familiarise yourselves with the given data, an outline of which is given at the end of this challenge specification. Among others, you can work on:
   
* Data cleaning, e.g. treatment of categorial variables;
* Data visualisation; Computing descriptive statistics, e.g. correlation.

_Data Pre-processing_: The previous step should give you a better understanding of which pre-processing is required for the data. This may include:

* Normalising and standardising the given data;
* Removing outliers;
* Carrying out feature selection, possibly using metrics derived from information theory;
* Handling missing information in the dataset;
* Augmenting the dataset with external information;
* Combining existing features.

Note that, as the name implies, this is a warm-up challenge, which essentially means that data is already put in a convenient format that requires minimal pre-processing.

### 2. Model selection
An important part of the work involves the selection of a model that can successfully handle the given data and yield sensible predictions. Instead of focusing exclusively on your final chosen model, it is also important to share your thought process in this notebook by additionally describing alternative candidate models. There is a wealth of models to choose from, such as decision trees, random forests, (Bayesian) neural networks, Gaussian processes, LASSO regression, and so on. 

Irrespective of your choice, it is highly likely that your model will have one or more parameters that require tuning. There are several techniques for carrying out such a procedure, such as cross-validation.

### 3. Performance evaluation
The evaluation metric for this project is "Log Loss". For the N wines in the test data set, the metric is calculated as:

\\(\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} y_i p_i + (1-y_i) \log(1-p_i)\\)

where \\(y\\) is the true (but withheld) quality outcome for wine \\(i\\) in the test data set, and \\(p\\) is the predicted probability of good quality for wine \\(i\\). Larger values of \\(\mathcal{L}\\) indicate poorer predictions.

---
## Dataset description
You will be working on two data files, which will be available in ```/mnt/datasets/wine/```, one for red and one for white wines:

* winequality-red.csv 
* winequality-white.csv

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference Cortez et al., 2009. Only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

### Tips
A possible trick is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. Note that this can be seen as a data preparation task.

### Training and test sets
We leave to the students to decide how to carve out training and test sets (validation sets too, if relevant to your approach). This is non a competition whereby the instructors hold a "private" test set to rank students' models.



P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

### Attributes

Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)