-- firstname lastname --

# Use physical characteristics to determine wine quality

 <b>The deadline for the assignment is 11/01/2021</b>.


## The dataset

You are asked to predict wine quality, based on its physical characteristics. The dataset is provided in the accompanying file 'winequality-white.csv'. A full description of the data set can be found in the file 'metadata.txt'.

The data set can be loaded using following commands (make sure to put the dataset in your iPython notebook directory):

In [27]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
%matplotlib inline

#read and randomly shuffle data
winequality = pd.read_csv('winequality-white.csv', sep=';')

features = winequality.columns[1:]

winequality = winequality.values
winequality = winequality[np.random.permutation(winequality.shape[0]),:]

#80% - 20% split for the training and testing sets
tr_set_size = np.int(len(winequality)*0.8)

#assign train and test sets (in your experiments, you want to do cross-validation)
X_tr = winequality[0:tr_set_size,1:]
y_tr = winequality[0:tr_set_size,0]
X_test = winequality[tr_set_size:,1:]
y_test = winequality[tr_set_size:,0]


## Minimum Requirements

You will need to train at least 2 different models on the data set. Make sure to include the reason for your choice (e.g., for dealing with categorical features).

* Train at least 2 models (e.g. decision trees and nearest neighbour) to predict the quality of the wine. You are allowed to use: Decision Trees, Perceptrons, Neural Networks, K Nearest Neighbours or Naive Bayes models (all of these are available in scikit learn library). You are also allowed to use other methods, as long as you motivate your choice.
* For each model, optimize the model parameters settings (tree depth, hidden nodes/decay, number of neighbours,...). Show which parameter setting gives the best expected error.
* Compare the best parameter settings for both models and estimate their errors on unseen data. Can you show that one of the models performs better?

All results, plots and code should be handed in as an interactive <a href='http://ipython.org/notebook.html'>iPython notebook</a>. Simply providing code and plots does not suffice, you are expected to accompany each technical section by explanations and discussions on your choices/results/observation/etc. <b>The deadline for the assignment is 11/01/2021</b>.

## Optional Extensions

You are encouraged to try and see if you can further improve on the models you obtained above. This is not necessary to obtain a good grade on the assignment, but any extensions on the minimum requirements will count for extra credit. Some suggested possibilities to extend your approach are:

* Build and host an API for your best performing model. You can create a API using pyhton frameworks such as FastAPI, Flask, ... You can host een API for free on Heroku, using your student credit on Azure, ...
* Try to combine multiple models. Ensemble and boosting methods try to combine the predictions of many, simple models. This typically works best with models that make different errors. Scikit-learn has some support for this, <a href="http://scikit-learn.org/stable/modules/ensemble.html">see here</a>. You can also try to combine the predictions of multiple models manually, i.e. train multiple models and average their predictions
* You can always investigate whether all features are necessary to produce a good model. Feel free to lookup additional resources and papers to find more information, see e.g <a href='https://scikit-learn.org/stable/modules/feature_selection.html'> here </a> for the feature selection module provided by scikit-learn library.

## Additional Remarks

* Depending on the model used, you may want to <a href='http://scikit-learn.org/stable/modules/preprocessing.html'>scale</a> or <a href='https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features'>encode</a> your (categorical) features X and/or outputs y
* Refer to the <a href='http://scipy.org/docs.html'>SciPy</a> and <a href='http://scikit-learn.org/stable/documentation.html'>Scikit learn</a> documentations for more information on classifiers and data handling.
* You are allowed to use additional libraries, but provide references for these.
* The assignment is **individual**. All results should be your own. Plagiarism will not be tolerated.

In [29]:
winequality_csv = pd.read_csv('winequality-white.csv', sep=';')

winequality_csv = winequality_csv[features].dropna()

winequality_csv.head()

Unnamed: 0,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


# Model 1: ...

# Model 2: ...

# Conclusion
### Compare the best parameter settings