# Challenge 4

## Reading Materials


## Activity 4 (Part 1): Wine Quality Linear Regression
Due: October 16, 2018

Activity 4 is based on the [Wine Quality Data](https://archive.ics.uci.edu/ml/datasets/Wine+Quality), which aggregates objective tests about various wines.
The output is based on sensory data (median of at least 3 evaluations made by wine experts).
Each expert graded wine quality between 0 (bad) and 10 (excellent).

The goal of this activity is to explore a linear regression to predict wine quality.
Relevant topics for this challenge include the following sections.

* Linear Regression (Section 9.2)


### Acknowledgement

This dataset is public available for research.
Additional details are available in [Cortez et al., 2009](http://dx.doi.org/10.1016/j.dss.2009.05.016).
This challenge is based, largely, on the version made available by the [Center for Machine Learning and Intelligent Systems](https://archive.ics.uci.edu/ml) at the University of California, Irvine.

 * P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.  Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.


### Linear Regression

Linear regression is a statistical approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).
In this framework, the relationships between input and output are modeled using linear predictor functions whose unknown model parameters are estimated from the data.


### Data

This challenge uses a combined and cleaned up version of `winequality-red.csv` and `winequality-white.csv`.
The 11 input variables are as follows.

 1. Fixed acidity
 2. Volatile acidity
 3. Citric acid
 4. Residual sugar
 5. Chlorides
 6. Free sulfur dioxide
 7. Total sulfur dioxide
 8. Density
 9. pH
 10. Sulphates
 11. Alcohol

The output variable is a quality score between 0 and 10.
Mathematically, wine quality is estimated based on a linear combinations of the input features,
\begin{equation*}
\hat{y}_i = \alpha_0 + \sum_j \alpha_i x_{i,j} .
\end{equation*}
The coefficients $\{ \alpha_i \}$ should be the same for every wine.


### Evaluation

The root-mean-square error (RMSE) is a frequently used criterion of the differences between predicted values (sample or population values) predicted by a model or an estimator and the values observed. The RMSD represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences.


### File Descriptions

 * `winequality-white-training.csv` – Training set
 * `winequality-white-testing.csv` – Test set
 * `winequality-white-sample.csv` – Sample submission


### Deliverables (Part 1)

User submissions are evaluated by comparing their submission CSV to the ground truth solution CSV with respect to the root-mean-square error.
Team numbers and compositions are available on GitHub under

* `ECEN689-Fall2018 -> Challenges -> 4Files -> README.md`

Documents to be submitted are as follows.

__Kaggle__: Every team should enter the Kaggle competition and submit a prediction file in the Kaggle format, as specified in `winequality-white-sample.csv`.

__GitHub__: Every team should commit and push files.
 1. A pediction file for the test set.
   * `ECEN689-Fall2018 -> Challenges -> 4Files -> Team## -> winequality-white-solution.csv`
 2. A parameter vector file, with one column.
    * `ECEN689-Fall2018 -> Challenges -> 3Files -> Team## -> winequality-white-parameters.csv`
 3. Jupyter notebook code or Python code within the same `Team##` directory.

In [None]:
import pandas as pd
import numpy as np

winequality_white_training_df = pd.read_csv('4Files/winequality-white-training.csv')
print(winequality_white_training_df.shape)

winequality_white_testing_df = pd.read_csv('4Files/winequality-white-testing.csv')
print(winequality_white_testing_df.shape)

winequality_white_prediction_df = pd.read_csv('4Files/winequality-white-sample.csv')
print(winequality_white_prediction_df.shape)

## Activity 4 (Part 2): Wine Quality Decision Tree
Due: October 16, 2018

The goal of this activity is to explore a decision tree to classify the type of wine (red or white).


### Decision Tree

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.


### Data

This challenge uses a combined and cleaned up version of `winequality-red.csv` and `winequality-white.csv`.
The 11 input variables are as follows.

 1. Fixed acidity
 2. Volatile acidity
 3. Citric acid
 4. Residual sugar
 5. Chlorides
 6. Free sulfur dioxide
 7. Total sulfur dioxide
 8. Density
 9. pH
 10. Sulphates
 11. Alcohol

The output variable is the type of wine (0 – white; 1 – red).


### Evaluation

In terms of machine learning and pattern classification, the labels of a set of random observations can be divided into 2 or more classes. Each observation is called an instance and the class it belongs to is the label. The empirical error rate of the data distribution is the percentage an instance is misclassified by a classifier that knows the true class probabilities given the predictors.


### File Descriptions

 * winequality-combined-training.csv - Training set
 * winequality-combined-testing.csv - Test set
 * winequality-combined-sample.csv - Sample submission


### Deliverables (Part 2)

User submissions are evaluated by comparing their submission CSV to the ground truth solution CSV with respect to categorization accuracy.

Documents to be submitted are as follows.

__Kaggle__: Every team should enter the Kaggle competition and submit a classification file in the Kaggle format, as specified in `winequality-combined-sample.csv`.

__GitHub__: Every team should commit and push files.
 1. A classification file for the test set.
   * `ECEN689-Fall2018 -> Challenges -> 4Files -> Team## -> winequality-combined-solution.csv`
 2. Jupyter notebook code or Python code within the same `Team##` directory.

In [None]:
import pandas as pd
import numpy as np

winequality_combined_training_df = pd.read_csv('4Files/winequality-combined-training.csv')
print(winequality_combined_training_df.shape)

winequality_combined_testing_df = pd.read_csv('4Files/winequality-combined-testing.csv')
print(winequality_combined_testing_df.shape)

winequality_combined_prediction_df = pd.read_csv('4Files/winequality-combined-sample.csv')
print(winequality_combined_prediction_df.shape)

# Challenge 4


## Activity 4 (Part 3): Wine Quality Linear Regression
Due: October 16, 2018

The goal of this activity is to explore the application of a linear regression model trained in one context to a similar, yet different problem.


### Data

This challenge uses a combined and cleaned up version of `winequality-red.csv` and `winequality-white.csv`.
The 11 input variables are as follows.

 1. Fixed acidity
 2. Volatile acidity
 3. Citric acid
 4. Residual sugar
 5. Chlorides
 6. Free sulfur dioxide
 7. Total sulfur dioxide
 8. Density
 9. pH
 10. Sulphates
 11. Alcohol

The output variable is a quality score between 0 and 10.
Mathematically, wine quality is estimated based on the same linear combinations as before,
\begin{equation*}
\hat{y}_i = \alpha_0 + \sum_j \alpha_i x_{i,j} .
\end{equation*}
The coefficients $\{ \alpha_i \}$ should be those derived for the white wine data set.


### File Descriptions

 * `winequality-red-training.csv` – Training set
 * `winequality-red-testing.csv` – Test set
 * `winequality-red-sample.csv` – Sample submission


### Deliverables (Part 3)

Documents to be submitted are as follows.

__Kaggle__: Every team should enter the Kaggle competition and submit a prediction file in the Kaggle format, as specified in `winequality-red-sample.csv`.

__GitHub__: Every team should commit and push files.
 1. A pediction file for the test set.
   * `ECEN689-Fall2018 -> Challenges -> 4Files -> Team## -> winequality-red-solution.csv`

In [None]:
import pandas as pd
import numpy as np

winequality_red_training_df = pd.read_csv('4Files/winequality-red-training.csv')
print(winequality_red_training_df.shape)

winequality_red_testing_df = pd.read_csv('4Files/winequality-red-testing.csv')
print(winequality_red_testing_df.shape)

winequality_red_prediction_df = pd.read_csv('4Files/winequality-red-sample.csv')
print(winequality_red_prediction_df.shape)

### Deliverables (Part 4)

The fourth part of Challenge 4 is an attempt to draw insights from linear regression and decision tree classifier in the context of model reuse.
One can use the file `winequality-white.csv` to fit a model.
A natural question is how good is this model when applied to the `winequality-red.csv` data set?
You should reflect on this question.
Furthermore, you should describe how the classification accuracy of differentiating between white and red can play a role in predicting the performance of model reuse within this context.
Findings be submitted in a 2-page PDF report (single column, IEEE style).