# Math  1376: Programming for Data Science
---

## Module 06: A prologue to machine learning
---

> Before machine learning, there was statistics. 

In this notebook, we briefly review linear regression to help set the stage and mindset for learning models from data. 

Linear regression is a commonly studied topic in statistics and computational science. 
The concept of linear regression is also at the core of many machine learning algorithms, which makes it a great place to start our studies.

## Introduction to Linear Regression
---

This notebook is adapted from https://github.com/justmarkham/DAT4/blob/master/notebooks/08_linear_regression.ipynb, and that notebook was adapted from Chapter 3 of An Introduction to Statistical Learning (for which I no longer have a working link).

## Motivation
---

Why are we exploring linear regression?
- It is widely used
- It runs fast
- It is easy to use (not a lot of tuning required)
- It is highly interpretable
- It is the basis for many other methods (e.g., we see how artificial neurons are related to linear regression in the next lecture notebook)


## Libraries
---

We use [scikit-learn](http://scikit-learn.org/stable/) to perform linear regression. 

- We learn a bit more about scikit-learn in this module, but you could devote an entire course to this library (and still only scratch the surface of what it has to offer).

- It is a good module for you to focus most of your energy/attention as you learn machine learning/data science since it provides a lot of the functionality for machine learning that you would in general need to use.

- We do a targeted dive into classifiers available within scikit-learn in the part (b) lecture notebook that is provided as part of the extra course materials.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

## Learning Objectives
---

- Be able to apply linear regression models to data

- Be able to interpret linear regression model coefficients

- Be able to apply linear regression to polynomial models of data

- Be able to apply multiple linear regression to multiple features

- Apply dummy variables to categorical variables in order to perform linear regression and interpret results.

## Notebook contents <a name='Contents'>
---

- [An advertisement for linear regression](#advertising)

- [Activity 1: Linear *Polynomial* Regression](#activity-polynomial)
    
- [Activity 2:  Handling Categorical Predictors with Categories](#activity-categories) 
    
- [Activity: Summary](#activity-summary)

## An advertisement for linear regression <a name='advertising'></a>

<mark> Run the code cell below and click the "play" button to see the first recorded lecture associated with this notebook.</mark>

In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo('VSZRQeLHe7U', width=800, height=450)

Let's take a look at some data, ask some questions about that data, and then use linear regression to answer those questions!

In [None]:
# read data into a DataFrame
url = 'https://github.com/CU-Denver-MathStats-OER/Programming-for-Data-Science/blob/main/Lectures-and-Assignments/06-Machine-Learning/lectures/Advertising.csv?raw=True'
data = pd.read_csv(url, index_col=0)

In [None]:
# print the shape of the DataFrame
data.shape

In [None]:
data  # To get a "big picture" view of all data

In [None]:
data.head()  # just the first 5 rows

What are the **features**?

- TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)

- Radio: advertising dollars spent on Radio

- Newspaper: advertising dollars spent on Newspaper

What is the **response**?

- Sales: sales of a single product in a given market (in thousands of widgets)

There are 200 **observations**, and thus 200 **markets** in the dataset.

In [None]:
# visualize the relationship between the features and the response using scatterplots
fig, axs = plt.subplots(1, 3, sharey=True)
plt.rcParams.update({'font.size': 22})
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(16, 8))
data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1])
data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])

## Questions About the Advertising Data
---

Let's pretend you work for the company that manufactures and markets this widget. The company might ask you the following: On the basis of this data, how should we spend our advertising money in the future to maximize profit?

This general question might lead you to more specific questions:

1. Is there a relationship between ads and sales?

2. How strong is that relationship?

3. Which ad types contribute the most to sales?

4. What is the effect of each ad type of sales?

5. Given ad spending in a particular market, can sales be predicted?

We will explore these questions below!

## Simple Linear Regression
---

Simple linear regression is an approach for predicting a **quantitative response** using a **single feature** (or "predictor" or "input variable"). It takes the following form:

$y = \beta_0 + \beta_1x$

What does each term represent?

- $y$ is the response

- $x$ is the feature

- $\beta_0$ is the intercept

- $\beta_1$ is the coefficient for x

Together, $\beta_0$ and $\beta_1$ are called the **model coefficients**. To create your model, you must "learn" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict sales!

## Estimating ("Learning") Model Coefficients
---

Generally speaking, coefficients are estimated using the **least squares criterion**, which means we find the line (mathematically) which minimizes the **sum of squared residuals** (or "sum of squared errors"):

<img src="https://github.com/CU-Denver-MathStats-OER/Programming-for-Data-Science/blob/main/Lectures-and-Assignments/06-Machine-Learning/lectures/06_estimating_coefficients.png?raw=1">

What elements are present in the diagram?

- The black dots are the **observed values** of x and y.

- The blue line is our **least squares line**.

- The red lines are the **residuals**, which are the distances between the observed values and the least squares line.

How do the model coefficients relate to the least squares line?

- $\beta_0$ is the **intercept** (the value of $y$ when $x$=0)

- $\beta_1$ is the **slope** (the change in $y$ divided by change in $x$)

Here is a graphical depiction of those calculations:

<img src="https://github.com/CU-Denver-MathStats-OER/Programming-for-Data-Science/blob/main/Lectures-and-Assignments/06-Machine-Learning/lectures/06_slope_intercept.png?raw=1">

We now estimate the model coefficients for sales as a function of TV advertising data using linear regression. 

Check out the code documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [None]:
# create feature vector x and response vector y
feature_cols = ['TV']
x = data[feature_cols]
y = data.Sales

In [None]:
# follow the usual sklearn pattern: (1) import, (2) instantiate, (3) fit

# First we import
from sklearn.linear_model import LinearRegression

# Then, we instantiate a "linear regression" object
lm = LinearRegression()

# Finally, we call the fit method in the object to "learn" the linear relationship
lm.fit(x, y)

In [None]:
# print intercept and coefficients that were "learned" by the fit method and stored
# as data attributes within the object
print(lm.intercept_) 
print(lm.coef_)  # Shows up as an array because there are usually many features

## Interpreting Model Coefficients
---

How do we interpret the TV coefficient ($\beta_1$)?

- A "unit" increase in TV ad spending is **associated with** a 0.047537 "unit" increase in Sales.

- Or more clearly: An additional $1,000 spent on TV ads is **associated with** an increase in sales of 47.537 widgets.

Note that if an increase in TV ad spending was associated with a **decrease** in sales, $\beta_1$ would be **negative**.

## Using the Model for Prediction
---

Let's say that there was a new market where the TV advertising spend was **$50,000**. What would we predict for the Sales in that market?

$$y = \beta_0 + \beta_1x$$
$$y = 7.032594 + 0.047537 \times 50$$

In [None]:
# manually calculate the prediction
7.032594 + 0.047537*50

In [None]:
# manually calculate the prediction in a slightly better way
lm.intercept_ + lm.coef_[0]*50

Thus, we would predict Sales of **9,409 widgets** in that market.

Of course, we could (and *should*) also use the linear model object to make the prediction! 

***In general, it is expecting that there are multiple features in the model (as we later see), so we need to formulate the input as either a 2-D array or as a DataFrame.***

In [None]:
import numpy as np

In [None]:
x_predict_array = np.array([[50]])  # Notice the double brackets

In [None]:
lm.predict(x_predict_array)

In [None]:
# This is probably the better way to do it since it aligns
# with the whole idea of using DataFrames to begin with.
x_predict_df = pd.DataFrame({'TV': [50]})
x_predict_df.head()

In [None]:
# use the model to make predictions on a new value
lm.predict(x_predict_df)

## Plotting the Least Squares Line
---

Let's make predictions for the **smallest and largest observed values of x**, and then use the predicted values to plot the least squares line:

In [None]:
# create a DataFrame with the minimum and maximum values of TV
x_new = pd.DataFrame({'TV': [data.TV.min(), data.TV.max()]})
x_new.head()

In [None]:
# make predictions for those x values and store them
linear_preds = lm.predict(x_new)
linear_preds

In [None]:
# first, plot the observed data
data.plot(kind='scatter', x='TV', y='Sales')

# then, plot the least squares line
plt.plot(x_new, linear_preds, c='red', linewidth=2)

## How Well Does the Model Fit the data?
---

The most common way to evaluate the overall fit of a linear model is by the **R-squared** value. R-squared is the **proportion of variance explained**, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the **null model**. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)

R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the model. Here's an example of what R-squared "looks like":

<img src="https://github.com/CU-Denver-MathStats-OER/Programming-for-Data-Science/blob/main/Lectures-and-Assignments/06-Machine-Learning/lectures/06_r_squared.png?raw=1">

You can see that the **blue line** explains some of the variance in the data (R-squared=0.54), the **green line** explains more of the variance (R-squared=0.64), and the **red line** fits the training data even further (R-squared=0.66). (Does the red line look like it's overfitting?)

Let's calculate the R-squared value for our simple linear model using the score method:

In [None]:
# print the R-squared value for the model
lm.score(x, y)

Is that a "good" R-squared value? It's hard to say. The threshold for a good R-squared value depends widely on the domain. Therefore, it's most useful as a tool for **comparing different models**.

Let's try some higher-order polynomials models in the following activity and compare it to the linear model.

---

## <mark>Activity 1: Linear *Polynomial* Regression</mark> <a name='activity-polynomial'/></a>

The trick to adapt linear regression to nonlinear relationships between variables is to transform the feature data.
The basic idea is to create a linear model that looks like this:

$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \cdots
$$

where the $x_1, x_2, x_3,$ and so on just involved functions applied to the single-dimensional input $x$.
That is, we let $x_n = f_n(x)$, where $f_n()$ is some function that transforms the features.

If we are working with polynomials so that $f_n(x) = x^n$ for some $n$, then a polynomial regression problem takes the form of:

$$
y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \cdots
$$

Notice that this is *still a linear model*—the linearity refers to the fact that the coefficients $\beta_n$ never multiply or divide each other.
What we have effectively done is taken the one-dimensional $x$ values and projected them into a higher dimension. 
Then, a linear fit can be applied to try and account for more complicated relationships between $x$ and $y$.

This is related to the idea of applying kernels to features to transform the feature space into a new space where relationships between features and response variables are more easily understood or explained. 

-  I show you how to transform the data by hand to fit a quadratic.<mark> Add comments to each of the code cells below to explain the individual lines of code in each cell (you do not need to add comments to the lines I have already provided comments for). </mark>

In [None]:
import copy as cp  # So we can make a copy of x without changing x

In [None]:
x_2 = cp.deepcopy(x)  # Now create a copy of x

In [None]:
x_2.insert(1, "TV^2", x['TV']**2, False)

In [None]:
x_2

In [None]:
lm2 = LinearRegression()
lm2.fit(x_2, y)

In [None]:
print(lm2.intercept_)
print(lm2.coef_)

In [None]:
lm2.score(x_2,y)

- Well, the value is *slightly* better for the quadratic. What does this look like? <mark> Add comments to each of the code cells below to explain the individual lines of code in each cell (you do not need to add comments to the lines I have already provided comments for). </mark>

In [None]:
x_new_2 = pd.DataFrame({'TV': np.linspace(data.TV.min(), data.TV.max(), 20),
                        'TV^2': np.linspace(data.TV.min(), data.TV.max(), 20)**2})
x_new_2.head()

In [None]:
quadratic_preds = lm2.predict(x_new_2)

In [None]:
data.plot(kind='scatter', x='TV', y='Sales')  # first, plot the observed data

plt.plot(x_new, linear_preds, c='red', linewidth=2)
plt.plot(x_new_2['TV'], quadratic_preds, c='k', linewidth=2 )

- <mark> Create cells below to repeat the above process for a cubic approximation. Then, try something silly like a 6th or 10th order polynomial approximation. What do you think is the best model and why? </mark>

End of Activity 1.

---

## Multiple Linear Regression
---

Simple linear regression can easily be extended to include multiple features. This is called **multiple linear regression**:

$y = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$

Each $x$ represents a different feature, and each feature has its own coefficient. In this case:

$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$

We estimate these coefficients below.

In [None]:
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper']
X = data[feature_cols]
y = data.Sales

lm_multiple = LinearRegression()
lm_multiple.fit(X, y)

# print intercept and coefficients
print(lm_multiple.intercept_)
print(lm_multiple.coef_)

How do we interpret these coefficients? For a given amount of Radio and Newspaper ad spending, an **increase of $1000 in TV ad spending** is associated with an **increase in Sales of 45.765 widgets**. This interprets the first coefficient. How would you interpret the other coefficients?

## Feature Selection
---

How should we decide **which features to include** in a linear model to balance simplicity with a goodness of fit? Here's one idea:

- Try different models, and only keep the features in the model that result in larger coefficients.

- Check whether the R-squared value goes up when you add new features.

What are the **drawbacks** to this approach?

- Linear models rely upon a lot of **assumptions** (such as the features being independent), and if those assumptions are violated (which they usually are), R-squared and other statistical values used to assess goodness of fit or importance of a coefficient are less reliable.

- R-squared is susceptible to **overfitting**, and thus there is no guarantee that a model with a high R-squared value will generalize.

**R-squared will always increase as you add more features to the model**, even if they are unrelated to the response. Thus, selecting the model with the highest R-squared is not a reliable approach for choosing the best linear model.

There is alternative to R-squared called **adjusted R-squared** that penalizes model complexity (to control for overfitting), but it generally [under-penalizes complexity](http://scott.fortmann-roe.com/docs/MeasuringError.html).

In [None]:
# Let's look at the R-squared value for our the multiple linear regression model above
# calculate the R-squared
lm_multiple.score(X, y)

So is there a better approach to feature selection? **Cross-validation.** It provides a more reliable estimate of out-of-sample error. It is therefore a better way to choose which of your models will best **generalize** to out-of-sample data. 

There is extensive functionality for cross-validation in scikit-learn, including automated methods for searching different sets of parameters and different models. Importantly, cross-validation can be applied to any model, whereas the methods described above only apply to linear models.

We will talk more about cross-validation in a future (extra) lecture, but it is a topic that is best investigated in depth in a more advanced course.

---
## <mark>Activity 2:  Handling Categorical Predictors with Categories</mark> <a name='activity-categories'/></a>

<mark>***Read through this section carefully, and look for the highlighted questions/problems that you need to answer.***</mark>

Up to now, all of our predictors have been numeric. What if one of our predictors was categorical?

Let's create a new feature called **Size**, and randomly assign observations to be **small or large** defining the size of the market:

In [None]:
import numpy as np

# set a seed for reproducibility
np.random.seed(12345)

# create a Series of booleans in which roughly half are True
nums = np.random.rand(len(data))
mask_large = nums > 0.5

# initially set Size to small, then change roughly half to be large
data['Size'] = 'small'
data.loc[mask_large, 'Size'] = 'large'
data.head()

- <mark> What is the map method in the code cell below doing?</mark>

For scikit-learn, we need to represent all data **numerically**. If the feature only has two categories, we can simply create a **dummy variable** that represents the categories as a binary value:

In [None]:
# create a new Series called IsLarge
data['IsLarge'] = data.Size.map({'small':0, 'large':1})
data.head()

Let's redo the multiple linear regression and include the **IsLarge** predictor:

In [None]:
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper', 'IsLarge']
X = data[feature_cols]
y = data.Sales

# instantiate, fit
lm = LinearRegression()
lm.fit(X, y)

# print coefficients
for i, j in zip(feature_cols, lm.coef_):
    print(i, j)

How do we interpret the **IsLarge coefficient**? For a given amount of TV/Radio/Newspaper ad spending, being a large market is associated with an average **increase** in Sales of 57.42 widgets (as compared to a Small market, which is called the **baseline level**).

What if we had reversed the 0/1 coding and created the feature 'IsSmall' instead? The coefficient would be the same, except it would be **negative instead of positive**. As such, your choice of category for the baseline does not matter, all that changes is your **interpretation** of the coefficient.

### Handling Categorical Predictors with More than Two Categories
---

- <mark> Print out and comment on what the `mask_suburban` and `mask_urban` variables are and how they are being used.</mark>

Let's create a new feature called **Area**, and randomly assign observations to be **rural, suburban, or urban**:

In [None]:
# set a seed for reproducibility
np.random.seed(123456)

# assign roughly one third of observations to each group
nums = np.random.rand(len(data))
mask_suburban = (nums > 0.33) & (nums < 0.66)
mask_urban = nums > 0.66
data['Area'] = 'rural'
data.loc[mask_suburban, 'Area'] = 'suburban'
data.loc[mask_urban, 'Area'] = 'urban'
data.head()

We have to represent Area numerically, but we can't simply code it as 0=rural, 1=suburban, 2=urban because that would imply an **ordered relationship** between suburban and urban (and thus urban is somehow "twice" the suburban category).

Instead, we create **another dummy variable**:

In [None]:
# create three dummy variables using get_dummies, then exclude the first dummy column
area_dummies = pd.get_dummies(data.Area, prefix='Area').iloc[:, 1:]

# concatenate the dummy variable columns onto the original DataFrame (axis=0 means rows, axis=1 means columns)
data = pd.concat([data, area_dummies], axis=1)
data.head()

Here is how we interpret the coding:

- **rural** is coded as Area_suburban=0 and Area_urban=0

- **suburban** is coded as Area_suburban=1 and Area_urban=0

- **urban** is coded as Area_suburban=0 and Area_urban=1

Why do we only need **two dummy variables, not three?** Because two dummies captures all of the information about the Area feature, and implicitly defines rural as the baseline level. (In general, if you have a categorical feature with k levels, you create k-1 dummy variables.)

If this is confusing, think about why we only needed one dummy variable for Size (IsLarge), not two dummy variables (IsSmall and IsLarge).

Let's include the two new dummy variables in the model:

In [None]:
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper', 'IsLarge', 'Area_suburban', 'Area_urban']
X = data[feature_cols]
y = data.Sales

# instantiate, fit
lm = LinearRegression()
lm.fit(X, y)

# print coefficients
for i, j in zip(feature_cols, lm.coef_):
    print(i, j)

How do we interpret the coefficients?

- Holding all other variables fixed, being a **suburban** area is associated with an average **decrease** in Sales of 106.56 widgets (as compared to the baseline level, which is rural).

- Being an **urban** area is associated with an average **increase** in Sales of 268.13 widgets (as compared to rural).

**A final note about dummy encoding:** If you have categories that can be ranked (i.e., strongly disagree, disagree, neutral, agree, strongly agree), you can potentially use a single dummy variable and represent the categories numerically (such as 1, 2, 3, 4, 5).

- <mark> In code and markdown cells below (you should add more), repeat the above analysis assuming there are four new categories to the data that you need to consider in the analysis. Call these four new categories **ET**, **CT**, **MT** and **PT** to indicate the typical four timezones most people reside in within the United States. You get to choose how to randomly assign these categories to the data. You may want to look up the general population distribution in the USA to estimate how to do this.  <mark>

End of Activity 2.

---

## What Did We NOT Cover?
---

- Detecting collinearity

- Diagnosing model fit

- Transforming predictors to fit non-linear relationships

- Interaction terms

- Assumptions of linear regression

- ...and so much more!

You could certainly go very deep into linear regression, and learn how to apply it really, really well. It's an excellent way to **start your modeling process** when working a regression problem. However, it is limited by the fact that it can only make good predictions if there is a **linear relationship** between the features and the response, which is why more complex methods (with higher variance and lower bias) will often outperform linear regression.

Therefore, we want you to understand linear regression conceptually, understand its strengths and weaknesses, be familiar with the terminology, and know how to apply it. However, we also want to spend time on many other machine learning models, which is why we aren't going deeper here.

## Other Resources
---

- To go much more in-depth on linear regression, see Dr. Joshua French's OER textbook [A Progressive Introduction to Linear Models](https://cu-denver-mathstats-oer.github.io/Applied-Regression-Analysis/), which is developed for the MATH 4733/5733 Applied Regression Analysis course at CU Denver.

- Alternatively, watch some [videos](http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/) or read this [quick reference guide](http://www.dataschool.io/applying-and-interpreting-linear-regression/) to various key points.

---

## <mark>Activity: Summary</mark> <a name='activity-summary'/>

Summarize some of the key takeaways/points from this notebook in a list below and prepare a few code examples related to these takeaways/points in the code cells below. You need to have at least one example for each of your summary points and you need at least three summary points.


- [Your summary point 1 goes here]




- [Your summary point 2 goes here]




- [Your summary point 3 goes here]

End of Summary Activity.

---

### [Click here to return to Notebook Contents](#Contents)