# Machine Learning: Regression

# Table Of Contents
* [scikit-learn](#scikitlearn)
* [Math](#math)
* [Linear Models](#linearmodels)
    * [Linear Regression](#linearregression)
    * [Lasso](#lasso)
    * [ElasticNet](#elastic)
* [Support Vector Machines](#svm)
* [Stochastic Gradient Descent](#sgd)
* [K-Nearest Neighbors](#knn)

In [2]:
import pandas as pd
import numpy as np

# scikit-learn Basics <a class="anchor" id="scikitlearn"/>

It is generally not considered good practice to import the entire scikit-learn package as one statement, but rather to import only the packages that are needed at the time. For instance, if you wanted to use `LinearRegression` you can do the following:

In [3]:
from sklearn.linear_model import LinearRegression

scikit-learn also includes a number of useful datasets, which can be imported directly. Once the dataset has been loaded, in this case the Boston Housing Prices dataset, we use a `pandas.DataFrame` to hold the data.

In [4]:
from sklearn.datasets import load_boston

boston = load_boston()

# Math

There is no getting around the fact that machine learning is built heavily upon mathematics. For this series of meetups about machine learning, we decided that "a light dusting" would be the right amount.

# Machine Learning Workflow <a class="anchor" id="machinelearningworkflow"/>

When we work with datasets in machine learning, we have to partition the data in such a way that we know that the model is actually learning instead of simply memorizing the data that was previously given to it.

## Scikit-Learn's train_test_split
Scikit-Learn has a built-in method for splitting a dataset into train and test sets and is much better than trying to split a dataset by hand. You can specify the ratio of train/test, along with seeding a random number generator.

In [5]:
from sklearn.model_selection import train_test_split

Once loaded, we can create our train/test sets, with the convention that `X` is the training set and `y` is the test set. We will use the `boston.data` for our training set and `boston.target` for our test set.

In [27]:
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.3, random_state=0)

## SGDRegressor <a class="anchor" id="sgd"/>

According to the diagram, Stochastic Gradient Descent works best when we have greater than 100,000 samples. While the Boston Housing dataset does not, we can find a regression dataset that does. For this, we can turn to the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets.html?format=&task=reg&att=&area=&numAtt=&numIns=&type=&sort=instDown&view=table), in this case the [SGEMM GPU Kernel Performance Dataset](http://archive.ics.uci.edu/ml/datasets/SGEMM+GPU+kernel+performance).

In [30]:
from sklearn.linear_model import SGDRegressor

sgemm = pd.read_csv('./data/sgemm_product.csv')
sgemm.describe()

Unnamed: 0,MWG,NWG,KWG,MDIMC,NDIMC,MDIMA,NDIMB,KWI,VWM,VWN,STRM,STRN,SA,SB,Run1 (ms),Run2 (ms),Run3 (ms),Run4 (ms)
count,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0
mean,80.415364,80.415364,25.513113,13.935894,13.935894,17.371126,17.371126,5.0,2.448609,2.448609,0.5,0.5,0.5,0.5,217.647852,217.579536,217.532756,217.527669
std,42.46922,42.46922,7.855619,7.873662,7.873662,9.389418,9.389418,3.000006,1.953759,1.953759,0.500001,0.500001,0.500001,0.500001,369.012422,368.677309,368.655118,368.677413
min,16.0,16.0,16.0,8.0,8.0,8.0,8.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,13.29,13.25,13.36,13.37
25%,32.0,32.0,16.0,8.0,8.0,8.0,8.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,40.66,40.71,40.66,40.64
50%,64.0,64.0,32.0,8.0,8.0,16.0,16.0,5.0,2.0,2.0,0.5,0.5,0.5,0.5,69.825,69.93,69.79,69.82
75%,128.0,128.0,32.0,16.0,16.0,32.0,32.0,8.0,4.0,4.0,1.0,1.0,1.0,1.0,228.53,228.31,228.32,228.32
max,128.0,128.0,32.0,32.0,32.0,32.0,32.0,8.0,8.0,8.0,1.0,1.0,1.0,1.0,3339.63,3375.42,3397.08,3361.71


# Linear Models <a class="anchor" id="linearmodels"/>
In regression, the goal is to predict a value for a given set of data. The equation that linear models try to solve for is:

$$\hat{y} = w[0] * x[0] + w[1] * x[1] + ... + w[p] * x[p] + b$$

Here, `x[0]` to `x[p]` denotes the features and `w` and `b` are the parameters of the models that are learned.

### Linear Regression
The simplest form of Linear Regression is known as Ordinary Least Squares (OLS), which minimizes the mean squared error between predictions and the true regression targets, on the training set. 

In [9]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

The weights or coefficients, are stored in the `coef_` attribute, while the intercept or offset is stored in `intercept_`.

In [10]:
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))

lr.coef_: [ -1.19858618e-01   4.44233009e-02   1.18612465e-02   2.51295058e+00
  -1.62710374e+01   3.84909910e+00  -9.85471557e-03  -1.50002715e+00
   2.41507916e-01  -1.10671867e-02  -1.01897720e+00   6.95273216e-03
  -4.88110587e-01]
lr.intercept_: 37.99259277034278


Once we have fitted our model, we can get to the good stuff, finding out how well our model actually performs.

In [14]:
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

Training set score: 0.76
Test set score: 0.67


### Ridge regression
The primary difference between OLS and ridge lies in the addition of a constrain; we want the magnitude of the coefficients to be as small as possible. This means that each feature should have a very small effect on the outcome, while still having good overall model performance. This constraint is known as regularization, which restricts a model to avoid overfitting. Ridge uses `L2` regularization.

In [13]:
from sklearn.linear_model import Ridge

ridge = Ridge().fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge.score(X_test, y_test)))

Training set score: 0.76
Test set score: 0.67


We can adjust the parameter of ridge, known as `alpha` if we think it will give us better performance. By default, `alpha` is set to `1`. Adjusting down may give us better performance.

In [17]:
ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge01.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge01.score(X_test, y_test)))

Training set score: 0.76
Test set score: 0.67


### Lasso <a class="anchor" id="lasso"/>
Lasso is similar to Ridge in that it uses regularization, but instead of `L2` regularization Lasso uses `L1` regularization, which zeros some coeffecients.

In [20]:
from sklearn.linear_model import Lasso

lasso = Lasso().fit(X_train, y_train)
print("Training set score: {:.2f}".format(lasso.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lasso.coef_ != 0)))

Training set score: 0.71
Test set score: 0.61
Number of features used: 10


### ElasticNet
ElasticNet uses both `L1` and `L2` regularization, with the `l1_ratio` hyperparameter controlling the ratio between `L1` and `L2`. 

In [23]:
from sklearn.linear_model import ElasticNet

elastic = ElasticNet().fit(X_train, y_train)
print("Training set score: {:.2f}".format(elastic.score(X_train, y_train)))
print("Test set score: {:.2f}".format(elastic.score(X_test, y_test)))

Training set score: 0.71
Test set score: 0.62


## Support Vector Machines <a class="anchor" id="svm"/>
Support Vector Machines are another option for solving regression problems. The math behind them is complex, but the use of different kernels can allow them to work on many different types of problems.

In [34]:
from sklearn import svm

reg = svm.SVR(kernel='linear')
reg.fit(X_train, y_train)
print("Training set score: {:.2f}".format(reg.score(X_train, y_train)))
print("Test set score: {:.2f}".format(reg.score(X_test, y_test)))

Training set score: 0.74
Test set score: 0.62


## K-Nearest Neighbors <a class="anchor" id="knn"/>
Regression based on k-nearest neighbors.

The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set.

In [40]:
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=5).fit(X_train, y_train)
print("Training set score: {:.2f}".format(knn.score(X_train, y_train)))
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))

Training set score: 0.69
Test set score: 0.52


## Ensemble Methods 


### Random Forest Regressor

In [41]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor().fit(X_train, y_train)
print("Training set score: {:.2f}".format(rfr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(rfr.score(X_test, y_test)))

Training set score: 0.97
Test set score: 0.80
