# Challenge 4

## Reading Materials


## Activity 4: Wine Quality Linear Regression
Due: October 16, 2018

Activity 4 is based on the [Wine Quality Data](https://archive.ics.uci.edu/ml/datasets/Wine+Quality), which aggregates objective tests about various wines.
The output is based on sensory data (median of at least 3 evaluations made by wine experts).
Each expert graded wine quality between 0 (bad) and 10 (excellent).

The goal of this activity is to explore a linear regression to predict wine quality.
Relevant topics for this challenge include the following sections.

* Linear Regression (Section 9.2)


### Acknowledgement

This dataset is public available for research.
Additional details are available in [Cortez et al., 2009](http://dx.doi.org/10.1016/j.dss.2009.05.016).
This challenge is based, largely, on the version made available by the [Center for Machine Learning and Intelligent Systems](https://archive.ics.uci.edu/ml) at the University of California, Irvine.

 * P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.  Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.


### Linear Regression

Linear regression is a statistical approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).
In this framework, the relationships between input and output are modeled using linear predictor functions whose unknown model parameters are estimated from the data.


### Data

This challenge uses a combined and cleaned up version of `winequality-red.csv` and `winequality-white.csv`.
The 11 input variables are as follows.

 1. Fixed acidity
 2. Volatile acidity
 3. Citric acid
 4. Residual sugar
 5. Chlorides
 6. Free sulfur dioxide
 7. Total sulfur dioxide
 8. Density
 9. pH
 10. Sulphates
 11. Alcohol

The output variable is a quality score between 0 and 10.
Mathematically, wine quality is estimated based on a linear combinations of the input features,
\begin{equation*}
\hat{y}_i = \alpha_0 + \sum_j \alpha_i x_{i,j} .
\end{equation*}
The coefficients $\{ \alpha_i \}$ should be the same for every wine.


### Evaluation

The root-mean-square error (RMSE) is a frequently used criterion of the differences between predicted values (sample or population values) predicted by a model or an estimator and the values observed. The RMSD represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences.


### File Descriptions

 * `winequality-white-training.csv` – Training set
 * `winequality-white-testing.csv` – Test set
 * `winequality-white-sample.csv` – Sample submission



### Deliverables (Part 1)

User submissions are evaluated by comparing their submission CSV to the ground truth solution CSV with respect to the root-mean-square error.
Team numbers and compositions are available on GitHub under

* `ECEN689-Fall2018 -> Challenges -> 4Files -> README.md`

Documents to be submitted are as follows.

__Kaggle__: Every team should enter the Kaggle competition and submit a prediction file in the Kaggle format, as specified in `winequality-white-sample.csv`.

__GitHub__: Every team should commit and push files.
 1. A pediction file for the test set.
   * `ECEN689-Fall2018 -> Challenges -> 4Files -> Team## -> winequality-white-solution.csv`
 2. A parameter vector file, with one column.
    * `ECEN689-Fall2018 -> Challenges -> 3Files -> Team## -> population_parameters.csv'
 3. Jupyter notebook code or Python code within the same `Team##` directory


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error

winequality_white_df = pd.read_csv('4Files/winequality-white.csv') # (4898, 12)

winequality_white_training_df = pd.read_csv('4Files/winequality-white-training.csv')
print(winequality_white_training_df.shape)

winequality_white_testing_df = pd.read_csv('4Files/winequality-white-testing.csv')
print(winequality_white_testing_df.shape)

winequality_white_prediction_df = pd.read_csv('4Files/winequality-white-sample.csv')
print(winequality_white_prediction_df.shape)

