# Table Of Contents
* [Supervised vs. Unsupervised Learning](#supervised)
* [scikit-learn](#scikitlearn)
* [Math](#math)
* [Machine Learning Workflow](#machinelearningworkflow)
    * [scikit-learn's train-test-split](#traintestsplit)
* [Accuracy and Context](#accuracy)
* [Generalization, Underfitting and Overfitting](#generalization)
* [Linear Models](#linearmodels)
    * [Linear Regression](#linearregression)
    * [Linear Regression Math Example](#mathexample)
    * [Lasso](#lasso)
    * [ElasticNet](#elastic)
* [Support Vector Machines](#svm)
* [Stochastic Gradient Descent](#sgd)
* [K-Nearest Neighbors](#knn)

In [1]:
import pandas as pd
import numpy as np

# Supervised vs. Unsupervised Learning <a class="anchor" id="supervised"/>

## Supervised
In supervised learning, we want to predict a certain outcome from a given input, and we have examples of input/output pairs. We can build a machine learning model based on those input/output pairs, which comprise our training set.

## Unsupervised
Unsupervised learning, on the other hand, is machine learning where there is no known ouput, no teacher to instruct the learning algorithm. Where we have the answers in supervised learning, we will just show the input data to algorithm and ask it to extract knowledge from that data.

# scikit-learn Basics <a class="anchor" id="scikitlearn"/>

It is generally not considered good practice to import the entire scikit-learn package as one statement, but rather to import only the packages that are needed at the time. For instance, if you wanted to use `LinearRegression` you can do the following:

In [2]:
from sklearn.linear_model import LinearRegression

scikit-learn also includes a number of useful datasets, which can be imported directly. Once the dataset has been loaded, in this case the Boston Housing Prices dataset, we use a `pandas.DataFrame` to hold the data.

In [3]:
from sklearn.datasets import load_boston
## display all datasets in a list

boston = load_boston()

# Math

There is no getting around the fact that machine learning is built heavily upon mathematics. For this series of meetups about machine learning, we decided that "a light dusting" would be the right amount.

# Machine Learning Workflow <a class="anchor" id="machinelearningworkflow"/>

When we work with datasets in supervised machine learning, we have to partition the data in such a way that we know that the model is actually learning instead of simply memorizing the data that was previously given to it.

## Scikit-Learn's train_test_split <a class="anchor" id="traintestsplit"/>
Scikit-Learn has a built-in method for splitting a dataset into train and test sets and is much better than trying to split a dataset by hand. You can specify the ratio of train/test, along with seeding a random number generator.

In [4]:
from sklearn.model_selection import train_test_split

Once loaded, we can create our train/test sets, with the convention that `X` is the training set and `y` is the test set. We will use the `boston.data` for our training set and `boston.target` for our test set. This is a supervised learning model and as such we have an input X and output y.

In [5]:
X = boston.data
X_train, X_test, y_train, y_test = train_test_split(X, boston.target, test_size=0.3, random_state=0)
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

X_train shape: (354, 13)
y_train shape: (354,)
X_test shape: (152, 13)
y_test shape: (152,)


In the previous split we end up with four numpy arrays; two are used for training `X_train` and `y_train` and two are used for testing `X_test` and `y_test`.

## K-Nearest Neighbors <a class="anchor" id="knn"/>
An algorithm that predicts either class membership or value based on surrounding data points. `k` refers to the number of neighbors that should be examined, generally a `k` value between 3 and 5 tends to work best.

#### Pros:
* Easily understood
* Gives reasonable performance without too many adjustments
* Very fast training speed

#### Cons:
* Prediction can be very slow on larger datasets
* Requires preprocessing of the data
* Performs badly on datasets with a high number of features (>100)
* Performs very badly on datasets where most features are 0

In [6]:
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=5).fit(X_train, y_train)
print("Training set score: {:.2f}".format(knn.score(X_train, y_train)))
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))

Training set score: 0.69
Test set score: 0.52


## Accuracy and Context <a class="anchor" id="accuracy"/>

While $R^2$ is the metric that is used for regression in this presentation, there are other metrics that might be more appropriate for your application. You should always be thinking about the high-level goal of the application, often called the _business metric_. An explanation of all of the different metrics is outside of the scope of a single meetup, but there are good resources out there.

## Generalization, Underfitting and Overfitting <a class="anchor" id="generalization"/>
When training a model, the goal isn't to have the model perform well on the training set, but rather to have the model perform well on examples it hasn't seen yet. If a model is able to make accurate predictions on unseen data, we say it is able to _generalize_ from the training set to the test set. 

Underfitting is scenario in which the model doesn't learn enough from the training data. This generally comes from building a model that is _too simple_. A symptom of a model that is too simple and thus underfitting is poor performance on the training set, and truly terrible performance on the testing set.

Overfitting, on the other hand, happens when the model is _too complex_ for data it was trained on. A model that has extremely good performance on the training set while having lackluster performance on the testing set is likely to be overfitting.

In [7]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score


def true_fun(X):
    return np.cos(1.5 * np.pi * X)

np.random.seed(0)

n_samples = 30
degrees = [1, 4, 15]

X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1

plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degrees[i],
                                             include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])
    pipeline.fit(X[:, np.newaxis], y)

    # Evaluate the models using crossvalidation
    scores = cross_val_score(pipeline, X[:, np.newaxis], y,
                             scoring="neg_mean_squared_error", cv=10)

    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
        degrees[i], -scores.mean(), scores.std()))
plt.show()

<Figure size 1400x500 with 3 Axes>

# Linear Models <a class="anchor" id="linearmodels"/>
In regression, the goal is to predict a value for a given set of data. The equation that linear models try to solve for is:

$\hat{y} = w[0] * x[0] + w[1] * x[1] + ... + w[p] * x[p] + b$

Here, $\hat{y}$, called 'y-hat' is the value we are trying to predict, `x[0]` to `x[p]` denotes the features and `w` and `b` are the parameters of the models that are learned.

#### Pros:
* Very fast to both train and predict
* Easy to understand how a prediction was made
* Can scale to very large datasets
* Work well with sparse data
* Very few parameters to tune

#### Cons:
* Coefficients can be hard to interpret
* Datasets with few features might not perform well
* Require preprocessing of data

### Linear Regression
The simplest form of Linear Regression is known as Ordinary Least Squares (OLS), which minimizes the mean squared error between predictions and the true regression targets, on the training set. 

In [8]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

The weights or coefficients, are stored in the `coef_` attribute, while the intercept or offset is stored in `intercept_`.

In [9]:
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))

lr.coef_: [-1.19858618e-01  4.44233009e-02  1.18612465e-02  2.51295058e+00
 -1.62710374e+01  3.84909910e+00 -9.85471557e-03 -1.50002715e+00
  2.41507916e-01 -1.10671867e-02 -1.01897720e+00  6.95273216e-03
 -4.88110587e-01]
lr.intercept_: 37.99259277034393


### Linear Regression Math Example <a class="anchor" id="mathexample"/>

In [10]:
y_hat = 0
row_sample = X_train[0]
for i,j in zip(lr.coef_, row_sample):
    y_hat += i * j

print("Our prediction: {:.2f}".format(y_hat + lr.intercept_))

print("Model prediction: {:.2f}".format(lr.predict(row_sample.reshape(1, -1))[0]))

Our prediction: 4.57
Model prediction: 4.57


Once we have fitted our model, we can get to the good stuff, finding out how well our model actually performs.

In [12]:
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

Training set score: 0.76


ValueError: Expected 2D array, got 1D array instead:
array=[0.         0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
 0.06060606 0.07070707 0.08080808 0.09090909 0.1010101  0.11111111
 0.12121212 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717
 0.18181818 0.19191919 0.2020202  0.21212121 0.22222222 0.23232323
 0.24242424 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929
 0.3030303  0.31313131 0.32323232 0.33333333 0.34343434 0.35353535
 0.36363636 0.37373737 0.38383838 0.39393939 0.4040404  0.41414141
 0.42424242 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747
 0.48484848 0.49494949 0.50505051 0.51515152 0.52525253 0.53535354
 0.54545455 0.55555556 0.56565657 0.57575758 0.58585859 0.5959596
 0.60606061 0.61616162 0.62626263 0.63636364 0.64646465 0.65656566
 0.66666667 0.67676768 0.68686869 0.6969697  0.70707071 0.71717172
 0.72727273 0.73737374 0.74747475 0.75757576 0.76767677 0.77777778
 0.78787879 0.7979798  0.80808081 0.81818182 0.82828283 0.83838384
 0.84848485 0.85858586 0.86868687 0.87878788 0.88888889 0.8989899
 0.90909091 0.91919192 0.92929293 0.93939394 0.94949495 0.95959596
 0.96969697 0.97979798 0.98989899 1.        ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

### Ridge regression
The primary difference between linear regression and ridge regression lies in the addition of a constraint; we want the coefficients to be as small as possible. This means that each feature should have a very small effect on the outcome, while still having good overall model performance. This constraint is known as regularization, which restricts a model to avoid overfitting. Ridge uses `L2` regularization, which squares the coefficient value. The parameter `alpha` controls how strong the the regularization effect is; a higher alpha value can help generalization by forcing coefficients to zero. 

In [13]:
from sklearn.linear_model import Ridge

ridge = Ridge().fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge.score(X_test, y_test)))

Training set score: 0.76


ValueError: Expected 2D array, got 1D array instead:
array=[0.         0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
 0.06060606 0.07070707 0.08080808 0.09090909 0.1010101  0.11111111
 0.12121212 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717
 0.18181818 0.19191919 0.2020202  0.21212121 0.22222222 0.23232323
 0.24242424 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929
 0.3030303  0.31313131 0.32323232 0.33333333 0.34343434 0.35353535
 0.36363636 0.37373737 0.38383838 0.39393939 0.4040404  0.41414141
 0.42424242 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747
 0.48484848 0.49494949 0.50505051 0.51515152 0.52525253 0.53535354
 0.54545455 0.55555556 0.56565657 0.57575758 0.58585859 0.5959596
 0.60606061 0.61616162 0.62626263 0.63636364 0.64646465 0.65656566
 0.66666667 0.67676768 0.68686869 0.6969697  0.70707071 0.71717172
 0.72727273 0.73737374 0.74747475 0.75757576 0.76767677 0.77777778
 0.78787879 0.7979798  0.80808081 0.81818182 0.82828283 0.83838384
 0.84848485 0.85858586 0.86868687 0.87878788 0.88888889 0.8989899
 0.90909091 0.91919192 0.92929293 0.93939394 0.94949495 0.95959596
 0.96969697 0.97979798 0.98989899 1.        ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

### Lasso <a class="anchor" id="lasso"/>
Lasso is similar to Ridge in that it uses regularization, but instead of `L2` regularization Lasso uses `L1` regularization, which takes the absolute value of the coefficient, instead of taking the square of the coefficient as `L2` does. We have the same alpha parameter as we did in Ridge which affects how strongly the regularization effect is.

In [14]:
from sklearn.linear_model import Lasso

lasso = Lasso().fit(X_train, y_train)
print("Training set score: {:.2f}".format(lasso.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lasso.coef_ != 0)))

Training set score: 0.71


ValueError: Expected 2D array, got 1D array instead:
array=[0.         0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
 0.06060606 0.07070707 0.08080808 0.09090909 0.1010101  0.11111111
 0.12121212 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717
 0.18181818 0.19191919 0.2020202  0.21212121 0.22222222 0.23232323
 0.24242424 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929
 0.3030303  0.31313131 0.32323232 0.33333333 0.34343434 0.35353535
 0.36363636 0.37373737 0.38383838 0.39393939 0.4040404  0.41414141
 0.42424242 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747
 0.48484848 0.49494949 0.50505051 0.51515152 0.52525253 0.53535354
 0.54545455 0.55555556 0.56565657 0.57575758 0.58585859 0.5959596
 0.60606061 0.61616162 0.62626263 0.63636364 0.64646465 0.65656566
 0.66666667 0.67676768 0.68686869 0.6969697  0.70707071 0.71717172
 0.72727273 0.73737374 0.74747475 0.75757576 0.76767677 0.77777778
 0.78787879 0.7979798  0.80808081 0.81818182 0.82828283 0.83838384
 0.84848485 0.85858586 0.86868687 0.87878788 0.88888889 0.8989899
 0.90909091 0.91919192 0.92929293 0.93939394 0.94949495 0.95959596
 0.96969697 0.97979798 0.98989899 1.        ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

### ElasticNet
ElasticNet uses both `L1` and `L2` regularization, with the `l1_ratio` hyperparameter controlling the ratio between `L1` and `L2`. 

In [15]:
from sklearn.linear_model import ElasticNet

elastic = ElasticNet().fit(X_train, y_train)
print("Training set score: {:.2f}".format(elastic.score(X_train, y_train)))
print("Test set score: {:.2f}".format(elastic.score(X_test, y_test)))

Training set score: 0.71


ValueError: Expected 2D array, got 1D array instead:
array=[0.         0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
 0.06060606 0.07070707 0.08080808 0.09090909 0.1010101  0.11111111
 0.12121212 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717
 0.18181818 0.19191919 0.2020202  0.21212121 0.22222222 0.23232323
 0.24242424 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929
 0.3030303  0.31313131 0.32323232 0.33333333 0.34343434 0.35353535
 0.36363636 0.37373737 0.38383838 0.39393939 0.4040404  0.41414141
 0.42424242 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747
 0.48484848 0.49494949 0.50505051 0.51515152 0.52525253 0.53535354
 0.54545455 0.55555556 0.56565657 0.57575758 0.58585859 0.5959596
 0.60606061 0.61616162 0.62626263 0.63636364 0.64646465 0.65656566
 0.66666667 0.67676768 0.68686869 0.6969697  0.70707071 0.71717172
 0.72727273 0.73737374 0.74747475 0.75757576 0.76767677 0.77777778
 0.78787879 0.7979798  0.80808081 0.81818182 0.82828283 0.83838384
 0.84848485 0.85858586 0.86868687 0.87878788 0.88888889 0.8989899
 0.90909091 0.91919192 0.92929293 0.93939394 0.94949495 0.95959596
 0.96969697 0.97979798 0.98989899 1.        ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

## SGDRegressor <a class="anchor" id="sgd"/>


According to the diagram, Stochastic Gradient Descent works best when we have greater than 100,000 samples. While the Boston Housing dataset does not, we can find a regression dataset that does. For this, we can turn to the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets.html?format=&task=reg&att=&area=&numAtt=&numIns=&type=&sort=instDown&view=table), in this case the [SGEMM GPU Kernel Performance Dataset](http://archive.ics.uci.edu/ml/datasets/SGEMM+GPU+kernel+performance).

In [16]:
from sklearn.linear_model import SGDRegressor

sgemm = pd.read_csv('./data/sgemm_product.csv')
sgemm.describe()

Unnamed: 0,MWG,NWG,KWG,MDIMC,NDIMC,MDIMA,NDIMB,KWI,VWM,VWN,STRM,STRN,SA,SB,Run1 (ms),Run2 (ms),Run3 (ms),Run4 (ms)
count,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0,241600.0
mean,80.415364,80.415364,25.513113,13.935894,13.935894,17.371126,17.371126,5.0,2.448609,2.448609,0.5,0.5,0.5,0.5,217.647852,217.579536,217.532756,217.527669
std,42.46922,42.46922,7.855619,7.873662,7.873662,9.389418,9.389418,3.000006,1.953759,1.953759,0.500001,0.500001,0.500001,0.500001,369.012422,368.677309,368.655118,368.677413
min,16.0,16.0,16.0,8.0,8.0,8.0,8.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,13.29,13.25,13.36,13.37
25%,32.0,32.0,16.0,8.0,8.0,8.0,8.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,40.66,40.71,40.66,40.64
50%,64.0,64.0,32.0,8.0,8.0,16.0,16.0,5.0,2.0,2.0,0.5,0.5,0.5,0.5,69.825,69.93,69.79,69.82
75%,128.0,128.0,32.0,16.0,16.0,32.0,32.0,8.0,4.0,4.0,1.0,1.0,1.0,1.0,228.53,228.31,228.32,228.32
max,128.0,128.0,32.0,32.0,32.0,32.0,32.0,8.0,8.0,8.0,1.0,1.0,1.0,1.0,3339.63,3375.42,3397.08,3361.71


## Support Vector Machines <a class="anchor" id="svm"/>
Support Vector Machines are another option for solving regression problems. The math behind them is complex, but they use of different kernels can allow them to work on many different types of problems.

### Pros:
* Perform well on a variety of datasets
* Work well on both high and low dimensional datasets

### Cons:
* Poor performance on larger scale datasets (>100K)
* Require careful pre-processing of data
* Difficult to inspect why predictions are made

In [17]:
from sklearn import svm

reg = svm.SVR(kernel='linear')
reg.fit(X_train, y_train)
print("Training set score: {:.2f}".format(reg.score(X_train, y_train)))
print("Test set score: {:.2f}".format(reg.score(X_test, y_test)))

Training set score: 0.74


ValueError: Expected 2D array, got 1D array instead:
array=[0.         0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
 0.06060606 0.07070707 0.08080808 0.09090909 0.1010101  0.11111111
 0.12121212 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717
 0.18181818 0.19191919 0.2020202  0.21212121 0.22222222 0.23232323
 0.24242424 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929
 0.3030303  0.31313131 0.32323232 0.33333333 0.34343434 0.35353535
 0.36363636 0.37373737 0.38383838 0.39393939 0.4040404  0.41414141
 0.42424242 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747
 0.48484848 0.49494949 0.50505051 0.51515152 0.52525253 0.53535354
 0.54545455 0.55555556 0.56565657 0.57575758 0.58585859 0.5959596
 0.60606061 0.61616162 0.62626263 0.63636364 0.64646465 0.65656566
 0.66666667 0.67676768 0.68686869 0.6969697  0.70707071 0.71717172
 0.72727273 0.73737374 0.74747475 0.75757576 0.76767677 0.77777778
 0.78787879 0.7979798  0.80808081 0.81818182 0.82828283 0.83838384
 0.84848485 0.85858586 0.86868687 0.87878788 0.88888889 0.8989899
 0.90909091 0.91919192 0.92929293 0.93939394 0.94949495 0.95959596
 0.96969697 0.97979798 0.98989899 1.        ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

## Decision Tree
As the name implies, decision tree models split the features of the dataset into a tree structure that is followed for every sample in the dataset.

### Pros:
* Easy to visualize and understand
* No pre-processing needed

### Cons:
* Even with pre-pruning, tend to overfit

In [18]:
from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor(max_depth=5).fit(X_train, y_train)
print("Training set score: {:.2f}".format(dtr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(dtr.score(X_test, y_test)))

Training set score: 0.93


ValueError: Expected 2D array, got 1D array instead:
array=[0.         0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
 0.06060606 0.07070707 0.08080808 0.09090909 0.1010101  0.11111111
 0.12121212 0.13131313 0.14141414 0.15151516 0.16161616 0.17171717
 0.18181819 0.1919192  0.2020202  0.21212122 0.22222222 0.23232323
 0.24242425 0.25252524 0.26262626 0.27272728 0.28282827 0.2929293
 0.3030303  0.3131313  0.32323232 0.33333334 0.34343433 0.35353535
 0.36363637 0.37373737 0.3838384  0.3939394  0.4040404  0.41414142
 0.42424244 0.43434343 0.44444445 0.45454547 0.46464646 0.47474748
 0.4848485  0.4949495  0.5050505  0.5151515  0.5252525  0.53535354
 0.54545456 0.5555556  0.56565654 0.57575756 0.5858586  0.5959596
 0.6060606  0.61616164 0.6262626  0.6363636  0.64646465 0.65656567
 0.6666667  0.67676765 0.68686867 0.6969697  0.7070707  0.7171717
 0.72727275 0.7373737  0.74747473 0.75757575 0.7676768  0.7777778
 0.7878788  0.7979798  0.8080808  0.8181818  0.82828283 0.83838385
 0.8484849  0.85858583 0.86868685 0.8787879  0.8888889  0.8989899
 0.90909094 0.9191919  0.9292929  0.93939394 0.94949496 0.959596
 0.969697   0.97979796 0.989899   1.        ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

### Visualizing a decision tree

In [19]:
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus

dot_data = StringIO()
export_graphviz(dtr, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

ModuleNotFoundError: No module named 'pydotplus'

# Ensemble Models 
_Ensembles_ are methods that combine multiple machine learning models, also called _weak estimators_, to create more powerful models. While there are many different models that belong in this category, the two that we will look at are both based on decision trees, random forests and gradient boosted decision trees.

## Random Forest Regressor <a class="anchor" id="random_forest"/>
A random forest is essentially a collection of decision trees, where each tree is slightly different from the others. The idea is that each tree might do a relatively good job of predicting, but will likely overfit on some part of the data. If we build many trees, all of which work well and overfit in different ways, we can reduce the amount of overfitting by averaging the results.

#### Pros:
* Very powerful
* Often work well without heavy parameter tuning
* Doesn't require data scaling


#### Cons:
* More difficult to visualize than regular decision trees
* Don't work well on very high dimensional, sparse data
* Require more CPU/RAM than linear models
* Slower to train than linear models

In [20]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor().fit(X_train, y_train)
print("Training set score: {:.2f}".format(rfr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(rfr.score(X_test, y_test)))

Training set score: 0.98


ValueError: Expected 2D array, got 1D array instead:
array=[0.         0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
 0.06060606 0.07070707 0.08080808 0.09090909 0.1010101  0.11111111
 0.12121212 0.13131313 0.14141414 0.15151516 0.16161616 0.17171717
 0.18181819 0.1919192  0.2020202  0.21212122 0.22222222 0.23232323
 0.24242425 0.25252524 0.26262626 0.27272728 0.28282827 0.2929293
 0.3030303  0.3131313  0.32323232 0.33333334 0.34343433 0.35353535
 0.36363637 0.37373737 0.3838384  0.3939394  0.4040404  0.41414142
 0.42424244 0.43434343 0.44444445 0.45454547 0.46464646 0.47474748
 0.4848485  0.4949495  0.5050505  0.5151515  0.5252525  0.53535354
 0.54545456 0.5555556  0.56565654 0.57575756 0.5858586  0.5959596
 0.6060606  0.61616164 0.6262626  0.6363636  0.64646465 0.65656567
 0.6666667  0.67676765 0.68686867 0.6969697  0.7070707  0.7171717
 0.72727275 0.7373737  0.74747473 0.75757575 0.7676768  0.7777778
 0.7878788  0.7979798  0.8080808  0.8181818  0.82828283 0.83838385
 0.8484849  0.85858583 0.86868685 0.8787879  0.8888889  0.8989899
 0.90909094 0.9191919  0.9292929  0.93939394 0.94949496 0.959596
 0.969697   0.97979796 0.989899   1.        ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

## Gradient Boosted Regression Trees <a class="anchor" id="gbt"/>
A tree-based method that works by building trees in a serial manner, with each tree trying to correct errors made in previous trees. Gradient boosted trees are often very shallow of depth one to five, which makes the model smaller in memory usage and makes predictions faster. The basic idea is to combine many simple models or _weak learners_ into one stronger model. 

### Pros:
* Very powerful
* Widely used

### Cons:
* Requires careful tuning of the hyperparameters
* Doesn't work very well on high-dimensional sparse data
* May take a long time to train, compared to linear models

In [21]:
from sklearn.ensemble import GradientBoostingRegressor

gbt = GradientBoostingRegressor().fit(X_train, y_train)
print("Training set score: {:.2f}".format(gbt.score(X_train, y_train)))
print("Test set score: {:.2f}".format(gbt.score(X_test, y_test)))

Training set score: 0.98


ValueError: Expected 2D array, got 1D array instead:
array=[0.         0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
 0.06060606 0.07070707 0.08080808 0.09090909 0.1010101  0.11111111
 0.12121212 0.13131313 0.14141414 0.15151516 0.16161616 0.17171717
 0.18181819 0.1919192  0.2020202  0.21212122 0.22222222 0.23232323
 0.24242425 0.25252524 0.26262626 0.27272728 0.28282827 0.2929293
 0.3030303  0.3131313  0.32323232 0.33333334 0.34343433 0.35353535
 0.36363637 0.37373737 0.3838384  0.3939394  0.4040404  0.41414142
 0.42424244 0.43434343 0.44444445 0.45454547 0.46464646 0.47474748
 0.4848485  0.4949495  0.5050505  0.5151515  0.5252525  0.53535354
 0.54545456 0.5555556  0.56565654 0.57575756 0.5858586  0.5959596
 0.6060606  0.61616164 0.6262626  0.6363636  0.64646465 0.65656567
 0.6666667  0.67676765 0.68686867 0.6969697  0.7070707  0.7171717
 0.72727275 0.7373737  0.74747473 0.75757575 0.7676768  0.7777778
 0.7878788  0.7979798  0.8080808  0.8181818  0.82828283 0.83838385
 0.8484849  0.85858583 0.86868685 0.8787879  0.8888889  0.8989899
 0.90909094 0.9191919  0.9292929  0.93939394 0.94949496 0.959596
 0.969697   0.97979796 0.989899   1.        ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

# Preprocessing and Feature Engineering

## Preprocessing <a class="anchor" id="preprocess"/>

Earlier we talked about how some machine learning algorithms required some level of data scaling, which means that we need to reduce the magnitude of the distance between the largest values and smallest values. To do this, we can use scikit-learn's `MinMaxScaler`.


In [22]:
from sklearn.preprocessing import MinMaxScaler

X = boston.data

X = MinMaxScaler().fit_transform(boston.data)
print(X)

[[0.00000000e+00 1.80000000e-01 6.78152493e-02 ... 2.87234043e-01
  1.00000000e+00 8.96799117e-02]
 [2.35922539e-04 0.00000000e+00 2.42302053e-01 ... 5.53191489e-01
  1.00000000e+00 2.04470199e-01]
 [2.35697744e-04 0.00000000e+00 2.42302053e-01 ... 5.53191489e-01
  9.89737254e-01 6.34657837e-02]
 ...
 [6.11892474e-04 0.00000000e+00 4.20454545e-01 ... 8.93617021e-01
  1.00000000e+00 1.07891832e-01]
 [1.16072990e-03 0.00000000e+00 4.20454545e-01 ... 8.93617021e-01
  9.91300620e-01 1.31070640e-01]
 [4.61841693e-04 0.00000000e+00 4.20454545e-01 ... 8.93617021e-01
  1.00000000e+00 1.69701987e-01]]


In [23]:
print(boston.data)

[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]


## Feature Engineering <a class="anchor" id="feature"/>
Once our data has been preprocessed, often times a good next step is to attempt to maximize the value of the data that we have. Linear models often need more features, especially given that we have only 13 features in our dataset. Using the `PolynomialFeatures` class in scikit-learn we can create additional features.

In [24]:
from sklearn.preprocessing import PolynomialFeatures

X = PolynomialFeatures(degree=2, include_bias=False).fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, boston.target, test_size=0.3, random_state=0)

In [25]:
X_train.shape

(354, 104)

## Effectiveness

In [26]:
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [29]:
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

Training set score: 0.95
Test set score: 0.64


In [30]:
ridge = Ridge().fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge.score(X_test, y_test)))

Training set score: 0.88
Test set score: 0.78
