# Welcome to the Dark Art of Coding:
## Introduction to Machine Learning
Linear Regression

<img src='../universal_images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* Cover an overview of Linear Regression
* Examine code samples that walk us through **The Process**:
   * Prep the data
   * Choose the model
   * Choose appropriate hyperparameters
   * Fit the model
   * Apply the model
   * Examine the results
* Explore a deep dive into this model
* Review some gotchas that might complicate things
* Review tips related to learning more

# Overview: Linear Regression
---

Linear Regression models are popular machine learning models because they:
* are often fast
* are often simple with few tunable hyperparameters
* are very easy to interpret
* can provide a nice baseline classification to start with before considering more sophisticated models

The LinearRegression model that we will examine here relies upon the Ordinary Least Squares (OLS) method to calculate a linear function that fits the input data.

From [Wikipedia](https://en.wikipedia.org/wiki/Ordinary_least_squares): "Geometrically, this is seen as the sum of the squared distances, ... between each data point in the set and the corresponding point on the regression surface – **the smaller the differences, the better the model fits the data**."

The result of the simplest type of linear regression calculation is a formula for a line

$$y = mx + b$$

Where:

Given some value of $x$, if we know the slope of the line ($m$) and the y-intercept ($b$) we can calculate $y$.

Beyond that, we won't cover the math here. 😀

Scikit Learn has a number of Linear Models based on calculations besides OLS: 

* Ridge 
* Lasso
* Huber
* and many more...

Each one has slightly different approaches to calculating a line that fits the data.

**Ridge**: addresses some issues related to OLS by controlling the size of coefficients.

**Lasso**: encourages simple, sparse models (i.e. models with fewer parameters). Can be useful when you want to automate certain parts of model selection, like variable selection/parameter elimination. 

**Huber**: applies a linear loss (lower weight) to samples that are classified as outliers, thus minimizing the impact of random outliers.

With this background, let's apply **The Process™** on a LinearRegression model.

## Prep the data

We start with a set of standard imports...

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn

# NOTE: during the Choose the Model step, we will import the 
#     model we want, but there is no reason you can't import it here.
# from sklearn.linear_model import LinearRegression

### Prep the training Data

In [None]:
df = pd.read_csv('../universal_datasets/linreg_train.csv',
                     names=['x', 'y'])
df.head()

In [None]:
length = len(df)
X_train = df['x'].values.reshape(length, 1)
y_train = df['y'].values.reshape(length, 1)

It can be really useful to take a look at the features matrix and target array of the training data. 

* In the raw form
* In a visualization tool

For this dataset, let's use a scatter plot.

In [None]:
X_train[:5]

In [None]:
y_train[:5]

In [None]:
plt.scatter(X_train, y_train)
plt.title("Dots in a box");

### Prep the test data

In [None]:
X_test = np.linspace(0, 30, 100).reshape(100, 1)
X_test[:5]

## Choose the Model

In this case, we have already decided upon using the LinearRegression model, so importing it is straightforward. But if we aren't sure what model we want we can always refer back to the [API Reference](https://scikit-learn.org/stable/modules/classes.html).

In [None]:
from sklearn.linear_model import LinearRegression

## Choose Appropriate Hyperparameters

For our purposes, this model doesn't require any hyperparameters, so we simply call the `LinearRegression` class.

In [None]:
model = LinearRegression()

If we were to look at the possible hyperparameters, we would see this:

```python
LinearRegression(
    fit_intercept=True,
    normalize=False,
    copy_X=True,
    n_jobs=None,
)
```

**Yeah, but what do these even mean?**

Some hyperparameters can be tricky to understand. Good places to start are the documentation:

> [sklearn.linear_model.LinearRegression¶](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

A number of these items are also explained on Stackoverflow:

> [how fit intercept parameter impacts linear regression with scikit learn](https://stackoverflow.com/questions/46510242/how-fit-intercept-parameter-impacts-linear-regression-with-scikit-learn)

It might take:

* several readings
* multiple sources
* some tests and examples

...before you start to wrap your head around the expected outcomes.

*This is OK. You are just like the rest of us!*

<img src='../universal_images/so_confused.jpg' width='300'>


## Fit the Model

In [None]:
model.fit(X_train, y_train)

## Apply the Model

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_pred.shape

In [None]:
y_pred[:5]

## Examine the results

In [None]:
plt.title("Red and Purple Results")
plt.scatter(X_train, y_train, color='rebeccapurple')
plt.plot(X_test, y_pred, color='red');

# Gotchas
---

A risk in machine learning is using a model that doesn't match the data well enough (**underfitting**) OR matches the data so well, that it doesn't apply well to test data, it only applies to the training data (**overfitting**).

For this example, we will look at three graphs. This example comes from the Scikit Learn [Underfitting/Overfitting documentation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html).

## Prep the data

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import sklearn

### Prep the test data

In the example, they create a function (`true_fun`) that generates a series of points on a graph in the shape of a Cosine.

In [None]:
def true_fun(X):
    return np.cos(1.5 * np.pi * X)

With 30 random values as `X` inputs, they use the function to generate 30 related `y` values.

In [None]:
np.random.seed(0)

n_samples = 30

X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1

Let's look at X and y.

In [None]:
X[:5]

In [None]:
y[:5]

In [None]:
plt.scatter(X, y)
plt.title("Cosine Dots");

### Prep the test data

In [None]:
X_test = np.linspace(0.05, 1, 100)

## Choose Appropriate Hyperparameters

To model the results, the example sets up something called a Pipeline. Pipelines allow you to feed inputs into one "end" of a series of models and get predictions out the other end, without having to manually take the output of one model and drop into the inputs of the next model.

This example uses the PolynomialFeatures model to transform inputs from a degree 1 polynomial into higher degree polynomials. It takes the results of those transformations and then feeds them into the LinearRegression model. 

The Pipeline simplifies things so that we only have to call `.fit()` once on the pipeline.

We will do this three times using degrees of 1, 4, and 15 to demonstrate underfitting, a good fit, and overfitting.

We will dive a little deeper into the PipeLine and the PolynomialFeatures components later.

Two of these cases will generate linear regressions that are not straight lines.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

Let's start with **degree of 1**

In [None]:
polynomial_features = PolynomialFeatures(degree=1,
                                         include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])

## Fit the Model

We only have to call `.fit()` on the pipeline, not on each of the components in the pipeline.

In [None]:
pipeline.fit(X[:, np.newaxis], y)

## Apply the Model

In [None]:
y_test = pipeline.predict(X_test[:, np.newaxis])

## Examine the results

In [None]:
plt.plot(X_test, y_test, label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.title("Underfit")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples");    

## Choose Appropriate Hyperparameters

Repeating the process to generate polynomial features of **degree 4**:

In [None]:
polynomial_features = PolynomialFeatures(degree=4,
                                         include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])

## Fit the Model

In [None]:
pipeline.fit(X[:, np.newaxis], y)

## Apply the Model

In [None]:
y_test = pipeline.predict(X_test[:, np.newaxis])

## Examine the results

In [None]:
plt.plot(X_test, y_test, label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.title("Good match")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples");    

## Choose Appropriate Hyperparameters

Lastly, let's generate polynomial features of **degree 15**:

In [None]:
polynomial_features = PolynomialFeatures(degree=15,
                                         include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])

## Fit the Model

In [None]:
pipeline.fit(X[:, np.newaxis], y)

## Apply the Model

In [None]:
y_test = pipeline.predict(X_test[:, np.newaxis])

## Examine the results

In [None]:
plt.plot(X_test, y_test, label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.title("Overfit")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples");    

# Deep Dive
---

Let's explore PolynomialFeatures and Pipelines in a bit more depth:

## PolynomialFeature

The PolynomialFeature class has a `.fit_transform()` method that transforms input values into a series of output values ready, often to be used as inputs in other models.

In [None]:
X = np.arange(3).reshape(3, 1)
X

In [None]:
poly = PolynomialFeatures(1)
poly.fit_transform(X)

Yields $1, a$ for each element in the X matrix

In [None]:
poly = PolynomialFeatures(2)
poly.fit_transform(X)

Yields $1, a, a^2$ for each element in the X matrix

In [None]:
poly = PolynomialFeatures(4)
poly.fit_transform(X)

Yields $1, a, a^2, a^3, a^4$ for each element in the X matrix

In [None]:
X2 = np.arange(6).reshape(3, 2)
X2

In [None]:
poly = PolynomialFeatures(1)
poly.fit_transform(X2)

Yields $1, a, b$ for each element in the X matrix

In [None]:
poly = PolynomialFeatures(2)
poly.fit_transform(X2)

Yields $1, a, b, a^2, ab, b^2$ for each element in the X matrix

In [None]:
poly = PolynomialFeatures(3)
poly.fit_transform(X2)

#         1     a     b     a^2   ab   b^2   a^3  a^2*b a*b^2 b^3

Yields $1, a, b, a^2, ab, b^2, a^3, a^2b, ab^2, b^3$ for each element in the X matrix

Thus for any degree that we feed into the PolynomialFeature model, we can transform an input matrix into a higher order matrix that will allow for more precise calculations of `y` values, given values of `x`.

## Pipeline

The Pipeline class accepts any number of models as input and creates a sequence of steps.

All models except the last must have some form of `*transform()` method that will output an appropriate matrix to feed into the next model in the pipeline.

Once a pipeline is created, the user only needs to call the `.fit()` and `predict()` methods once on the pipeline.



To create a Pipeline, we first instantiate any of the models we want to use, just as if we were creating standalone models.

> ```python
polynomial_features = PolynomialFeatures(degree=15,
                                         include_bias=False)
linear_regression = LinearRegression()
```

Next we provide a `list` of `tuples` to the Pipeline class, where each tuple contains a key, value pair where the key is a name we want to call the step of the pipeline and the value is the model we want to use at that step:

> ```python
pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])
```

With a Pipeline in hand, we simply call `.fit()` just as we would for any model.

> ```python
pipeline.fit(X[:, np.newaxis], y)
```

Jupyter will output the Pipeline parameters for us and we can see each of the steps we defined in the correct order and we can see that each step includes the hyperparameters that we provided.

> ```python
Pipeline(memory=None,
     steps=[('polynomial_features', PolynomialFeatures(degree=15,
             include_bias=False, interaction_only=False)), 
            ('linear_regression', LinearRegression(copy_X=True,
             fit_intercept=True, n_jobs=None,
             normalize=False))])
```

# How to learn more: tips and hints
---

**Read the outputs**: Pay close attention to the outputs that Scikit Learn prints to the screen. Regular exposure to these outputs will regularly expose you to terms, arguments, vocabulary and grammar that are fundamental to understanding the inner workings of the models specifically and machine learning more generally. 

**Do outside research**: When you find a new word OR a word used in ways that you are not used to, look it up, read articles about that concept, read stackoverflow answers about that concept, and of course read the documentation. The word **regression** has been a thorn in my side since I first saw it. I just couldn't put my finger on what it means. I know what is happening in a regression calculation, but the **meaning** just escaped me. Why that word, to describe that phenomena? 

> "The term "regression" was coined by Francis Galton in the nineteenth century to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards a normal average (a phenomenon also known as regression toward the mean)." 

> Source: [Wikipedia: Regression Analysis](https://en.wikipedia.org/wiki/Regression_analysis)

**Tear apart the examples**: The [original example](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html) showing underfitting/overfitting was a bit more complicated than what I showed here, cause they opted to create a three panel chart and to automate the processing by putting the degrees into a list and cycling through the list using a for loop to generate all the charts...

I took individual lines, looked at each line, stripped away as much of the extraneous complications as I could to look at just the machine learning components and that greatly helped clarify what was going on.

# Experience Points!
---

# Read the docs...

Explore the docs related to Support Vector Machines for about 3 - 4 minutes, in particular the section related to Support Vector Classifiers.

[**SVC (link)**](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)

Find answers to the following:

* what is the general limit on the number of samples that can be fed into this model?
* What is the default kernel?

---
When you complete this exercise, please put your **green** post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# References
---

Below are references that may assist you in learning more:
    
|Title (link)|Comments|
|---|---|
|[API docs on linear models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)||
|[sklearn description of overfitting](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html)||
|[Wikipedia article on overfitting](https://en.wikipedia.org/wiki/Overfitting)||
|[Wikipedia article on regression analysis](https://en.wikipedia.org/wiki/Regression_analysis)||