# Module 9: Introduction to Machine Learning Part 2

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Learning Outcomes</a></span></li><li><span><a href="#Readings-and-Resources" data-toc-modified-id="Readings-and-Resources-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Readings and Resources</a></span></li><li><span><a href="#The-Space-Shuttle-Challenger-Explosion-and-the-O-Rings" data-toc-modified-id="The-Space-Shuttle-Challenger-Explosion-and-the-O-Rings-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>The Space Shuttle Challenger Explosion and the O-Rings</a></span></li><li><span><a href="#Feature-Engineering" data-toc-modified-id="Feature-Engineering-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Feature Engineering</a></span><ul class="toc-item"></ul></li><li><span><a href="#Dummy-Variables" data-toc-modified-id="Dummy-Variables-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Dummy Variables</a></span></li><li><span><a href="#Training-vs-Test-datasets" data-toc-modified-id="Training-vs-Test-datasets-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Training vs Test datasets</a></span></li><li><span><a href="#Introduction-to-Other-Modeling-Techniques" data-toc-modified-id="Introduction-to-Other-Modeling-Techniques-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Introduction to Other Modeling Techniques</a></span></li><li><span><a href="#References" data-toc-modified-id="References-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>References</a></span></li></ul></div>

## Learning Outcomes

In this module you will continue to learn and practice:
* Linear Regression
* Feature Engineering
* Dummy Variables
* Testing / Training Models
* Other Modeling Techniques

## Readings and Resources

The majority of the notebook content draws from the recommended readings. We invite you to further supplement this notebook with the following recommended texts.

Geron, A. (2017) *Hands-on Machine Learning with Scikit-Learn and TensorFlow*. O'Reilly Media.

Witten, I.H, Frank, E. (2005) *Data Mining. Practical Machine Learning Tools and Techniques* (2nd edition). Elsevier.

`statsmodels` Documentation can be found at https://www.statsmodels.org/dev/index.html.

`scikit-learn` Documentation can be found at http://scikit-learn.org/stable/documentation.html.

This module continues the introduction to Machine Learning. It will cover the qualitative aspects of Machine Learning, including feature engineering, dummy variables, model validation, and introduction to other modeling techniques.

We will begin by refreshing our knowledge of linear regressions by applying the concepts to a new problem.

## The Space Shuttle Challenger Explosion and the O-Rings

We will use a data set about the Challenger space shuttle disaster of 1986, when one of the rocket boosters exploded. After the data was analyzed, the commission determined that the explosion was caused by the failure of an O-ring in the rocket booster which was unacceptably sensitive to the outside temperature and other factors. The dataset contains data from 24 flights prior to the disaster.

The dataset home page can be found [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Challenger+USA+Space+Shuttle+O-Ring).
You can also read about the event in [Wikipedia](https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster).

First, we will import the relevant modeling packages and the data set. The definitions for each column are as follows:

* ORings: Number of O-rings at risk on a given flight
* DistressedOrings: Number experiencing thermal distress
* Temp: Launch temperature (degrees F)
* Pressure: Leak-check pressure (psi)
* TempOrderOfFlight: Temporal order of flight

In [1]:
# Render plots inline
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

# import formula api as alias smf
import statsmodels.formula.api as smf

cols = ['ORings', 'DistressedOrings', 'Temp', 'Pressure', 'TempOrderOfFlight']

df = pd.read_csv('o-ring-erosion-or-blowby.csv', names=cols)
df

Unnamed: 0,ORings,DistressedOrings,Temp,Pressure,TempOrderOfFlight
0,6,0,66,50,1
1,6,1,70,50,2
2,6,0,69,50,3
3,6,0,68,50,4
4,6,0,67,50,5
5,6,0,72,50,6
6,6,0,73,100,7
7,6,0,70,100,8
8,6,1,57,200,9
9,6,1,63,200,10


Now, we will run an Ordinary Least Squares (OLS) Regression, which is one of the simplest methods of linear regression. But what does it mean?

Well, we know there is rarely a perfect relationship when we are modeling one or more independent variables and their impact on a dependent variable. In this example, there is no *exact* temperature that will *always* result in a distressed o-ring. Sometimes there will be errors in output - this is referred to as the error rate. And because there is no *perfect* relationship, there are many lines of best fit which can represent a relationship between two or more variables.

The objective of an OLS regression is to produce the one straight line with the least squared errors, to provide us with the most accurate prediction possible by using the linear regression method.

Let's review the variables which make up our model:

* **O-Rings:** This is a constant (its always 6) so there is no point using it as a predictor. It doesn't vary so it can't contribute to different cases having different outcomes.

* **DistressedOrings:** This is what we're trying to predict so this is our target variable
 
* **Temp:** Our most important predictor

* **Pressure:** Might or might not be predictive. Include it and see what happens

* **TempOrderOfFlight:** This is just the order of the flights (Flight #1, #2, etc.). If we were interested in whether the situation is getting better or worse over time we want to include this as a predictor but since we are only interested in the effects of temperature (and possibly test pressure) including this might result in the model attributing the change in # of rings to just the passage of time and mask the relationship we're really interested in.


In [2]:
X = df[['Temp', 'Pressure']]
y = df['DistressedOrings']

# Add a constant so the model will choose an intercept. (Otherwise the model will fit a line through the origin).
X = sm.add_constant(X)

# Fit the OLS model
est = sm.OLS(y, X).fit()

# Check the results
est.summary()

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,DistressedOrings,R-squared:,0.354
Model:,OLS,Adj. R-squared:,0.29
Method:,Least Squares,F-statistic:,5.49
Date:,"Thu, 16 Jan 2020",Prob (F-statistic):,0.0126
Time:,01:44:21,Log-Likelihood:,-17.408
No. Observations:,23,AIC:,40.82
Df Residuals:,20,BIC:,44.22
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.3298,1.188,2.803,0.011,0.851,5.808
Temp,-0.0487,0.017,-2.910,0.009,-0.084,-0.014
Pressure,0.0029,0.002,1.699,0.105,-0.001,0.007

0,1,2,3
Omnibus:,19.324,Durbin-Watson:,2.39
Prob(Omnibus):,0.0,Jarque-Bera (JB):,23.471
Skew:,1.782,Prob(JB):,8e-06
Kurtosis:,6.433,Cond. No.,1840.0


Now that we have our regression results, we can use the `params` function to find the values of our slopes and intercept.

In [3]:
est.params

const       3.329831
Temp       -0.048671
Pressure    0.002939
dtype: float64

We can then use these values and the line which best represents the relationship to predict outcomes based on different levels of pressure.

In [4]:
# Intercept
constant = est.params[0] 
# Coeff for Temp
coef1 = est.params[1]
# Coeff for Pressure
coef2 = est.params[2]

# No. of O rings in distress when temperature = 31 and pressure is 0, 50, 100, and 200
for pressure in [0, 50, 100, 200]:
    print("Temp=31 Pressure=", pressure, " Predicted # of O-Rings in distress:", constant + coef1 * 31 + coef2 * pressure)

Temp=31 Pressure= 0  Predicted # of O-Rings in distress: 1.8210269508611583
Temp=31 Pressure= 50  Predicted # of O-Rings in distress: 1.9679931836796445
Temp=31 Pressure= 100  Predicted # of O-Rings in distress: 2.114959416498131
Temp=31 Pressure= 200  Predicted # of O-Rings in distress: 2.4088918821351033


In [5]:
# Or using predict()
est.predict([[1, 31, 0], [1, 31, 50], [1, 31, 100], [1, 31, 200]])

array([1.82102695, 1.96799318, 2.11495942, 2.40889188])

### Conclusion

If we assume the overall relationship is linear the analysis suggests that approximately 2 o-rings will experience distress if the temperature the day of the launch is 31F.  See notes below for important cautions.

### Notes

Extrapolating results outside the range of actual observations like we are doing here is always a very dicey proposition. It assumes that the overall relationship is truly linear but any smooth curve looked at over a small enough range will look straight--this is why calculus works.

In the absence of any better way to do this (there were only so many actual flights) it can provide some insight but must be thought of as indicative only, not an actual prediction with any precision.

## Fitting models using R-style formulas

The `statsmodel.api` function allows us to fit our regression model in the form of a matrix. Using `statsmodel.formula.api` will allow us to fit the same model using formula arguments. This is comparable to other programming languages, including R.

You can find more information about this library here: http://www.statsmodels.org/dev/example_formulas.html

In [6]:
import statsmodels.formula.api as smf

# formula: response ~ predictors
est = smf.ols(formula='DistressedOrings ~ Temp+Pressure', data=df).fit()

In [7]:
est.summary()

0,1,2,3
Dep. Variable:,DistressedOrings,R-squared:,0.354
Model:,OLS,Adj. R-squared:,0.29
Method:,Least Squares,F-statistic:,5.49
Date:,"Thu, 16 Jan 2020",Prob (F-statistic):,0.0126
Time:,01:44:38,Log-Likelihood:,-17.408
No. Observations:,23,AIC:,40.82
Df Residuals:,20,BIC:,44.22
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.3298,1.188,2.803,0.011,0.851,5.808
Temp,-0.0487,0.017,-2.910,0.009,-0.084,-0.014
Pressure,0.0029,0.002,1.699,0.105,-0.001,0.007

0,1,2,3
Omnibus:,19.324,Durbin-Watson:,2.39
Prob(Omnibus):,0.0,Jarque-Bera (JB):,23.471
Skew:,1.782,Prob(JB):,8e-06
Kurtosis:,6.433,Cond. No.,1840.0


The rest of the calculations are exactly the same as in Part 1 of the solution above.

## Feature Engineering

All models use input data to produce a certain output. Feature engineering refers to the act of creating an input dataset which is relevant and compatible with the learning algorithm requirements. For example, let's assume you're building a model to predict house prices based on the following data points: number of rooms, area, latitude, and longitude. Perhaps you can use number of rooms or area as individual variables, however latitude and longitude don't have much meaning on their own. This is where feature engineering comes in - you can combine the latitude and longitute values to create defined location coordinates. This information is significantly more useful than either data point on its own.

There are a number of techniques we can use to engineer features. We will review them briefly in this module and you will learn to apply them as you progress through the certificate.

* **Imputation:** Missing values are a common problem as you prepare your data for machine learning. The simplest solution to dealing with missing values is to drop rows or columns that have a significant number of missing values. Numerical Imputation is another method of handling missing values, where you assign a numerical value through assumption, or by using the mean,  median or most of a column.

* **Detecting Outliers:** Outliers, or individual data points that lie outside of the data distribution, can skew your results. This is why it is important to remove any outliers that do not fit your data distribution. In the next course, you will learn more methods of visualizing a data set to help you quickly identify and remove outliers.

* **Binning:** This method refers to categorizing ranges of data in logical bins or groups - for example: low, medium, or high - to make it more meaningful for your analysis.

* **Variable Transformation:** In the next course, you will learn the various distributions that may describe your data set. Typically, we want our data to be normally distributed which makes it easier to analyze. In the cases where data is not normally distributed, there are a variety of methods that can be applied to correct for this, including logarithmic, Box-Cox or exponential transformations.

* **Feature Creation:** You can use a variety of mathematical functions to create new features. In the house price example mentioned above, we combined two columns to create a more meaningful data point. You can use this method, along with addition, subtraction, mean calculation, min/max, product or any other relevant methods that can create a more meaningful variable.

These are just some common methods of feature engineering. The concept is both a science and an art, in that you can choose the method that makes the most sense for the objectives of your model.

In Python, you can use NumPy and Pandas for most of these methods.

## Dummy Variables

In statistics, we can use both quantitative (numerical) or qualitative (categorical) data to predict outcomes. In the last module, we learned about linear regression models. These models work easily with numerical data, however it is more difficult to use them for categorical data. For example, let's consider that we want to predict the political vote of a group. The attributes "Republican", "Democrat" and "Independent" are categorical variables. In order to use them in a regression, or machine learning model, we would need to transform them into a numerical value. This is where we introduce Dummy Variables.

Dummy variables take the form of 0 or 1 and are used to represent mutually exclusive categories in your analysis - for example, Repubican vs. Other. 

The number of dummy variables you should use depends on the number of categories a value can assume. For example, if you would like to categorize political affiliation, you may set your dummy variables as follows:

$ x_{1} = 1,\ if \ Republican,\ and\ x_{1} = 0,\ if\ Otherwise \\
 x_{2} = 1,\ if\ Democrat,\ and\ x_{2} = 0,\ if\ Otherwise \\
 x_{3} = 1,\ if\ Independent,\ and\ x_{3} = 0,\ if\ Otherwise $

However, you should be cautious about the "Dummy Variable Trap". This refers to defining too many dummy variables and causing a multicollinearity problem in your model. Multicollinearity occurs when two or more independent variables are related to each other, and this can impact the accuracy of your results. For example, let's consider the case of Smoker vs. Non-Smoker. If you were to assign dummy variables to both, such that:

$ x_{1} = 1,\ if \ Smoker,\ and\ x_{1} = 0,\ if\ Non-Smoker \\
 x_{2} = 1,\ if\ Non-Smoker,\ and\ x_{2} = 0,\ if\ Smoker $

Including both of these variables in a model would be redundant, because if we know that someone is a "Smoker", that automatically means they are NOT a "Non-Smoker". Thus, we only need one variable.

In Python, you can create dummy variables by using Pandas. The function is pd.get_dummies(df['column name']) and you can merge the dummy columns with your existing data frame.

## Training vs Test datasets

Machine learning models can be used to predict outcomes. We build them using an existing data set, and typically apply them to new data sets once complete. But, how do we know if the prediction will be accurate when applied to a new data set? What if the existing data was biased, and we didn't realize?

Well, this is where the concept of **cross-validation** comes in. The purpose of cross-validation is to ensure that the model we build is accurate for existing/known data, but also for independent or new data. There are several methods of cross-validation, but for purposes of this class we will focus on the simplest method: splitting your original data set into a **training** data set and a **testing** data set.

The **training** subset is used to train the model and fit the model parameters.
The **test** subset is used to test how well your model performs on the new data.

Why do we need to split the data into two subsets? Let's imagine that we created a model and fine-tuned the parameters so well that the accuracy of the model is almost 100% and it perfectly fits the data which we have. What will be the accuracy of the model on a new data? Will the model work for the data which was not used to train the model? The accuracy of the model can only be determined by considering how well a model performs on new data that was not used when fitting the model.

Below are two examples when the model is **overfitted**, meaning that the model performs very well on the training data but it does not generalize well. Both examples are used in [this Wikipedia article](https://en.wikipedia.org/wiki/Overfitting) on overfitting.

**Example 1:** The green line represents an overfitted classification model and the black line represents a better generalized model. While the green line perfectly describes the training data, it is likely to have a high error rate on new data, compared to the black line.

<img src='https://upload.wikimedia.org/wikipedia/commons/1/19/Overfitting.svg' height="300" width="300" alt="Overfitted classification model (green line)">

**Image source:** Image by [Chabacano - Own work, CC BY-SA 4.0](https://commons.wikimedia.org/w/index.php?curid=3610704)

**Example 2:** In this example, noisy data is perfectly fitted to a polynomial function (blue line). Even though the polynomial function provides a perfect fit, it won't perform well on a new data. The simple linear function would be a better fit for this data.

<img src='https://upload.wikimedia.org/wikipedia/commons/6/68/Overfitted_Data.png' height="400" width="400" alt="Overfitted model">

**Image source:** Image by [Ghiles - Own work, CC BY-SA 4.0](https://commons.wikimedia.org/w/index.php?curid=47471056)


Usually, the original dataset is split 70/30 or 80/20. This means that 70% (or 80%) of all data points are used to  train the model and the remaining data is reserved for testing.

Each model trained on the train subset can be tested on the test subset to see its predictive performance. This allows fine tuning of the prediction model.  

As you progress through the certificate, you will learn other methods of cross-validation, including K-fold and repeated random sub-sampling.

## Introduction to Other Modeling Techniques

There are a number of other modeling techniques which you will learn in later courses of this certificate. These include:

* **Logistic Regression:** this method is largely used for classification problems, and this is where the concept of dummy variables applies. For example, you can also use a logistic regression to predict explosions due to o-ring damage from our Challenger problem (e.g. 1 = damaged, 0 = not damaged). Logistic regression models compute the probability that an instance belongs to a positive class (e.g. 1 = damaged), and makes a placement decision based on a defined threshold value. You can use the scikit-learn library within Python for this type of regression. Below is an example of the output.

<img src='DmgTempLogReg.png' height="400" width="400" alt="Logistic Regression">

* **Support Vector Machines (SVM):** SVM algorithms are very powerful and versatile, they can be applied to both classification and regression problems, and are used often for outlier detection. They are not limited to linear problems, and are capable of performing non-linear classification and regression.

<img src='SVM.png' height="500" width="500" alt="SVM">

* **Decision Trees:** Decision Trees is a special family of Machine Learning algorithms. These algorithms are versatile and can perform both classification and regression tasks. Essentially, they break down data sets into smaller and smaller data sets until a decision is reached. They are very powerful, can be used to fit complex datasets but also they are used as a component of **Random Forests** algorithms, which are among the most powerful Machine Learning algorithms available today.

<img src='DecisionTree.png' height="500" width="500" alt="Decision Tree">

As part of this certificate, in the Machine Learning course you will learn how to build the aforementioned types of models.

-----------

You have reached the end of this module. 

----

## References

Connor Johnson blog (http://connor-johnson.com/2014/02/18/linear-regression-with-python/).

MNIST database (https://en.wikipedia.org/wiki/MNIST_database).

Ng, A. (2018)  _Machine Learning Yearning_ (electronic book).

Unsupervised Learning. (https://en.wikipedia.org/wiki/Unsupervised_learning#Approaches).

Witten, I.H, Frank, E. (2005) *Data Mining. Practical Machine Learnng Tools and Techniques* (2nd edition). Elsevier.