# Multiple Linear Regression

Linear regression is a technique for predicting a real value. Confusingly, these problems where a real value is to be predicted are called regression problems. Linear regression is a technique where a straight line is used to model the relationship between input and output values. In more than two dimensions, this straight line may be thought of as **a plane or hyperplane**.

Predictions are made as a combination of the input values to predict the output value. Each input attribute (x) is weighted using a coefficient (b), and the goal of the learning algorithm is to discover a set of coefficients that results in good predictions (y). Coefficients can be found using **gradient descent**.

In Simple Linear Regression, we used a statistical approach to find the coefficients. Here we will employ an iterative algorithm. Gradient Descent is the process of minimizing a function by following the gradients of the cost function. This involves knowing the form of the cost as well as the derivative so that from a given point you know the gradient and can move in that direction, e.g. downhill towards the minimum value.

In machine learning, we can use a technique that evaluates and updates the coefficients every iteration called stochastic gradient descent to minimize the error of a model on our training data. The way this optimization algorithm works is that each training instance is shown to the model one at a time. The model makes a prediction for a training instance, the error is calculated and the model is updated in order to reduce the error for the next prediction. This process is repeated for a fixed number of iterations.

## Exercise 1 - Explore the Data

After we develop our linear regression algorithm with gradient descent, we will use it to model the wine quality dataset. This dataset is comprised of the details of 4,898 white wines including measurements like acidity and pH. The goal is to use these objective measures to predict the wine quality on a scale between 0 and 10. You can learn more about the dataset on the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Wine+Quality).

In [3]:
% matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
with open('winequality-white.csv', 'r') as csv:
    data = pd.read_csv(csv, sep=';')
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


## Exercise 2 - Build Multiple Linear Regression 

For more information on the derivation, check out [these videos](https://www.coursera.org/learn/machine-learning/lecture/kCvQc/gradient-descent-for-linear-regression) or [this blog article](https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/).

The general steps are:
- Estimate coefficient values for the training data using gradient descent (try [batch, mini-batch or stochastic](http://sebastianruder.com/optimizing-gradient-descent/))
- Evaluate candidate coeffeicient values 
- Tune learning rate and number of epochs
- Make predictions on out-of-sample data

The class or series of functions should do the following:
- Fit a set of X (wine measurements) and y (wine quality value)
- Predict new X based on the coefficients
- Return the coefficients and intercept

## Exercise 3 - Try it out on the Wine Data Set

- Split the data into training and testing sets
- Calculate the appropriate error metric

## Exercise 4 - Check via Statsmodels or Scikit-learn

# Additional Optional Exercises

- Proper documentation for class methods and attributes.
- Try additional values for learning rate and epochs. What happens if the learning rate is too large?
- How to tell when the algorithm has already coverged (tip: try to use relative values instead of absolute ones)? Add an optional parameter to stop iterating if it has already converged (or if it is diverging).
- Compare the analytical solution to Batch Gradient Descent and find an empirical threshold before and after which it would be better to use either one or the other approach.
- When would you use Mini-Batch or Stochastic Gradient Descent? Try implementing them.
- Can you write code that dinamically chooses an optimal learning rate given the data at hand?
- Read about [Total Error vs. Average Error](http://stats.stackexchange.com/a/155581)
- Try re-writing the code with NumPy's more usual `ndarray`s.
- Sort the dataset before separating it into training and test set and see how it performs.
- Try reducing the R2 metric by doing some feature engineering.
- Type A vs. Type B