# Machine learning with scikit-learn in Python

## Introduction

The scikit-learn library is a Python library designed to consolidate a bunch of machine learning techniques into one place. In this lab, we'll explore data about passengers on the Titanic and build a predictive model to use their information to try to predict if they survived or not.

When you see a code cell, run it by clicking on it and then hitting Ctrl-Enter. Unlike in the previous module in which you were reminded to do it each time, this module will simply expect that you will select each code chunk as you meet it and hit Ctrl-Enter.

Remember that markdown cells (the ones with text and not code) are treated a little differently. It is very important that you **double-click** on Markdown cells before trying to edit them! Double-clicking puts you in "edit" mode and allows you to type. You can tell you're in edit mode because the left edge of the cell is green. If you single-click only, the cell's left edge will be blue. This means you're still in "command" mode. In command mode, the keyboard is assigned to various tasks like creating or deleting cells. Doing this by accident is very bad because it messes up your document. Again, do not type anything on the keyboard until you're absolutely sure you're in edit mode. (Look for the green edge on the left!)

## Import libraries and data

First, we use the `import` command to grab the tools we need. For example, we'll import `pandas` so we can work with DataFrames.

In [81]:
import pandas as pd

One important reminder here is that whenever we want to use a function from `pandas`, we have to preceed that function with `pd`. As in,

```
pd.some_function()
```

We'd like to import some other functions from other libraries. We could import the whole library, but then we'd have to type more to be able to use functions from that library. Besides, we may only need one or two functions from that library. So we can also do this:

In [82]:
from statistics import mean, stdev

This way, we can calculate means and standard deviations by just calling the functions `mean()` and `stdev()` instead of `statistics.mean()` and `statistics.stdev()`. This only causes a problem if you import two or more libraries that have different functions sharing the same name. (So, for example, if you also import the `numpy` library, it also has a `mean()` function that might work somewhat differently that the one from the `statistics` library.)

Here are some other functions we'll need in this module:

In [83]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

We can even give "shortcut" names to some of these functions to save time. For example,

In [84]:
ss = StandardScaler()
ohe = OneHotEncoder(categories = "auto", sparse = False)

Note that `ohe` will call the `OneHotEncoder` function with the specific arguments set here (`categories = "auto"` and `sparse = False`). If we wanted to use different arguments, we'd have to use `OneHotEncoder` directly, or assign it to a new variable with the new settings.

Now we import the data by grabbing the url where the data is stored and then reading it in via the `read_csv()` function from `pandas`.

In [85]:
url = "https://raw.githubusercontent.com/amueller/scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv"
titanic_data = pd.read_csv(url)

## Explore data

#### In the code chunks below, use the `head()` and `tail()` commands to look at a few rows of `titanic_data`.

In [86]:
# Use the head() function here.
titanic_data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [87]:
# Use the tail() function here.
titanic_data.tail()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0,0,0,2670,7.225,,C,,,
1308,3,0,"Zimmerman, Mr. Leo",male,29.0,0,0,315082,7.875,,S,,,


*****

Explanations for most of the variable can be found here:

https://www.kaggle.com/c/titanic/data

The information about a data set, including a list of the variables and their meanings, is often called a "code book".

#### Visit the link above and familiarize yourself with each variable and its properties. (Keep in mind that not all the variables are described in the link.)

#### In the code chunk below, check the `shape` attribute of `titanic_data` to see how many rows and columns are present.

(Rember that `shape` is an "attribute" of a DataFrame, so it doesn't require parentheses like `head()` and `tail()` do. Attempting to use `shape()` will generate an error.)

In [88]:
# Use shape to print the number of rows and columns in titanic_data.
titanic_data.shape

(1309, 14)

*****

Let's examine some `info()` about this DataFrame:

In [89]:
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
pclass       1309 non-null int64
survived     1309 non-null int64
name         1309 non-null object
sex          1309 non-null object
age          1046 non-null float64
sibsp        1309 non-null int64
parch        1309 non-null int64
ticket       1309 non-null object
fare         1308 non-null float64
cabin        295 non-null object
embarked     1307 non-null object
boat         486 non-null object
body         121 non-null float64
home.dest    745 non-null object
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


The `survived` variable will be our response variable, meaning the variable we are trying to predict using the rest of the data. All other variables will be considered "features". (In machine learning, the term "feature" is used to mean the same thing as a bunch of other synonymous terms like "explanatory variable", "independent variable", or "predictor".)

Not all these variables will be useful for our analysis.

#### Based on what you read about the variables, which of the variables above should we exclude? Explain why.

name

#### Based on the `info()` above, which variables should we exclude due to a huge amount of missing data?

body, boat, home_dest, cabin

*****

We're also going to drop the `sibsp` and `parch` variables. They may have some predictive values, but they're sort of weird because they take on only a few discrete values, so they don't make very good numerical variables.

Now let's investigate the remaining variables.

The categorical variables can be summarized using `value__counts()` which creates a frequency table.

Recall that the syntax for this is
```
DataFrame["variable"].value_counts()
```
but you'll replace `DataFrame` and `variable` with the correct words. 

#### Use `value_counts()` to explore the response variable `survived` as well as the features `pclass`, `sex`, and `embarked`.

In [90]:
# Make a frequency table for survived
titanic_data.survived.value_counts()

0    809
1    500
Name: survived, dtype: int64

In [91]:
# Make a frequency table for pclass
titanic_data.pclass.value_counts()

3    709
1    323
2    277
Name: pclass, dtype: int64

In [92]:
# Make a frequency table for sex
titanic_data.sex.value_counts()

male      843
female    466
Name: sex, dtype: int64

In [93]:
# Make a frequency table for embarked
titanic_data.embarked.value_counts()

S    914
C    270
Q    123
Name: embarked, dtype: int64

#### Get summary statistics for numerical variables using `describe()`

The syntax is similar to `value_counts()`:
```
DataFrame["variable"].describe()
```

In [94]:
# Get summary statistics for age
titanic_data.age.describe()

count    1046.000000
mean       29.881135
std        14.413500
min         0.166700
25%        21.000000
50%        28.000000
75%        39.000000
max        80.000000
Name: age, dtype: float64

In [95]:
# Get summary statistics for fare
titanic_data.fare.describe()

count    1308.000000
mean       33.295479
std        51.758668
min         0.000000
25%         7.895800
50%        14.454200
75%        31.275000
max       512.329200
Name: fare, dtype: float64

### Data Wrangling

After exploring the data, we need to clean it up a bit. First, let's create a new DataFrame called `titanic_data2` that selects only the variables we want to keep for our analysis.

In [96]:
titanic_data2 = titanic_data[["survived", "pclass", "sex", "embarked", "age", "fare"]]
titanic_data2.head()

Unnamed: 0,survived,pclass,sex,embarked,age,fare
0,1,1,female,S,29.0,211.3375
1,1,1,male,S,0.9167,151.55
2,0,1,female,S,2.0,151.55
3,0,1,male,S,30.0,151.55
4,0,1,female,S,25.0,151.55


Now let's drop any missing rows and reset the index so that the rows are ordered sequentially.

In the previous module we did this in two separate steps. We can actually do it in one line of code instead. This is an example of what's called "method chaining":

In [97]:
titanic_data2 = titanic_data2.dropna().reset_index(drop = True)

#### How many rows were dropped? (First figure out how many rows are in the new data frame and then just subtract.)

In [98]:
# Use a command you already know to find out the number of rows in titanic_data2
titanic_data2.shape

(1043, 6)

*Type your answer here*

#### In the previous steps, we (1) selected certain columns and then (2) dropped missing values. Why didn't we drop missing data first and then select the columns we wanted? Run the following code and explain below what is happening.

In [99]:
titanic_data.dropna().shape

(0, 14)

*Type your answer here*

*****

## Prepare data for analysis

The data is in a nice clean format as we can see below:

In [100]:
titanic_data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1043 entries, 0 to 1042
Data columns (total 6 columns):
survived    1043 non-null int64
pclass      1043 non-null int64
sex         1043 non-null object
embarked    1043 non-null object
age         1043 non-null float64
fare        1043 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 49.0+ KB


Every variable has all 1043 values. Now, some variables have the wrong data type: `survived` and `pclass` are categorical, but they are stored as integers. In some analyses that might matter, but as we'll see later, we have to do a different kind of manipulation on those variables for `scikit-learn` to work anyway. So we'll leave them alone for now.

Just because the data is "clean" doesn't mean it's in a form we can use. The functions of `scikit-learn` generally require that the response variable be stored separately from the features, so let's do that next.

In [101]:
y = titanic_data2["survived"]

Note that the response variable `y` needs to be a "Series", not a one-column data frame. Observe:

In [102]:
type(y)

pandas.core.series.Series

Compare with this:

In [103]:
y_incorrect = titanic_data2[["survived"]]
type(y_incorrect)

pandas.core.frame.DataFrame

So even small differences like using a single bracket versus two brackets can make or break the analysis.

The explanatory variables can still be stored in a DataFrame. The easiest way to get what we want is to use the `drop()` method. One argument of `drop` is `axis`; by specifying `axis = 1`, we're telling `pandas` to drop a column and not a row.

In [104]:
X = titanic_data2.drop("survived", axis = 1)
X.head()

Unnamed: 0,pclass,sex,embarked,age,fare
0,1,female,S,29.0,211.3375
1,1,male,S,0.9167,151.55
2,1,female,S,2.0,151.55
3,1,male,S,30.0,151.55
4,1,female,S,25.0,151.55


## One hot encoding

Another peculiarity of `scikit-learn` is that it requires all data to be stored in a certain way. For example, categorical variables have to be transformed using a process called "one hot encoding". Here is how it works.

Take the `pclass` variable as an example. There are three classes: first class, second class, and third class. They are labeled with 1, 2, and 3. One hot encoding creates three columns to replace the one `pclass` column. Each new column represents one of the three classes, and the new values are either 0 or 1: 0 if the case does not belong to that class, and 1 if it does. It helps to see an example.

Suppose we have some passengers with the following classes:

| passenger |pclass |
------------|-------|
| A | 1 |
| B | 2 |
| C | 1 |
| D | 3 |
| E | 3 |
| F | 2 |

The one hot encoding of this data will appear as follows:

| passenger | pclass1 | pclass2 | pclass3 |
|-----------|---------|---------|---------|
| A | 1 | 0 | 0 |
| B | 0 | 1 | 0 |
| C | 1 | 0 | 0 |
| D | 0 | 0 | 1 |
| E | 0 | 0 | 1 |
| F | 0 | 1 | 0 |

In other words, every row will be all zeroes except a single 1 in the column corresponding to that passenger's class.

The one hot encoding function from `scikit-learn` was assigned to a new name, `ohe`. We also need to apply the `fit_transform` method to convert the data to the one hot encoding format. Remember that `X` is the DataFrame containing just the features, so `X[[`pclass`]]` is a DataFrame with only the one column `pclass`.

In [105]:
ohe.fit_transform(X[["pclass"]])

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       ...,
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

Check that the first three rows are, indeed, 1st class passengers and the last three rows are 3rd class passengers:

In [106]:
X[["pclass"]].head(n = 3)

Unnamed: 0,pclass
0,1
1,1
2,1


In [107]:
X[["pclass"]].tail(n = 3)

Unnamed: 0,pclass
1040,3
1041,3
1042,3


#### Do the same thing for `sex` and `embarked`. In other words, modify the code above to one hot encode the other two categorical features. Then check the first three rows and last three rows of the original data to see that they match the output from `ohe.fit_transform`.

In [108]:
# One hot encode the sex feature
ohe.fit_transform(X[["sex"]])

array([[1., 0.],
       [0., 1.],
       [1., 0.],
       ...,
       [0., 1.],
       [0., 1.],
       [0., 1.]])

In [109]:
# Print the first three rows of x[["sex"]]
X[["sex"]].head(n=3)

Unnamed: 0,sex
0,female
1,male
2,female


In [110]:
# Print the last three rows of x[["sex"]]
X[["sex"]].tail(3)

Unnamed: 0,sex
1040,male
1041,male
1042,male


In [111]:
# One hot encode the embarked feature
ohe.fit_transform(X[["embarked"]])

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])

In [112]:
# Print the first three rows of x[["embarked"]]
X[["embarked"]].head(3)

Unnamed: 0,embarked
0,S
1,S
2,S


In [113]:
# Print the last three rows of x[["embarked"]]
X[["embarked"]].tail(3)

Unnamed: 0,embarked
1040,C
1041,C
1042,S


#### Based on what you can see in the output above, which sex is coded `[1, 0]` and which is coded `[0, 1]`?

*Type your answer here*

#### Based on what you can see in the output above, identify the Port of Embarkation locations coded `[1, 0, 0]`, `[0, 1, 0]`, and `[0, 0, 1]`. List them by name, not just by their abbreviation in this data set. (Hint: remember to check the code book!)

*Type your answer here*

## Scaling numerical data

Most machine learning algorithms work best when numerical data is on a common scale.

To see why, consider the two numerical variables we have here: age and fare. Age has units of years and has values between 0 and 80 whereas fare is in pounds (I think) and ranges from 0 to 512. When these numbers enter into complex calculations, it's possible that higher values of fare might drive the algorithms just because they are larger numbers, and not because fare is necessarily an important variable.

The way to fix this is to use a fomula to "standardize" the variables. This involves subtracting the mean and dividing by the standard deviation:

$$\frac{(x - mean(x))}{stdev(x)}$$

This has the effect of centering the variables so that their new mean is zero. It also rescales the variables so that a value of 1 is always one standard deviation from the mean. Therefore, all variables will be more or less on the same scale after standardization.

Let's try it in Python.The `StandardScaler` function was given the new name `ss` for convenience. We apply the `fit_transform` method in the exact same way as we did before with one hot encoding, but using `ss` instead of `ohe`.

In [114]:
ss.fit_transform(X[["age"]])

array([[-0.05663194],
       [-2.01237899],
       [-1.93693697],
       ...,
       [-0.23073426],
       [-0.1959138 ],
       [-0.05663194]])

We can check manually that the first value is correct. First, let's extract the first value. (Remember that Python is a "zero-indexed" language, meaning you start counting at 0, not a 1.)

In [115]:
age0 = X["age"][0]
age0

29.0

We need to calculate the mean and the standard deviation of the age variable.

In [116]:
mean_age = mean(X["age"])
mean_age

29.813199137104505

In [117]:
stdev_age = stdev(X["age"])
stdev_age

14.36626096948101

And now we apply the standardization formula from above:

In [118]:
(age0 - mean_age)/stdev_age

-0.056604786647828934

This is pretty close: the first age in the DataFrame (29) corresponds to the first value listed in the transformed data (-0.05663194) to the fourth decimal place. (The numbers don't agree exactly because of a very subtle difference in the way we computed the standard deviation and the way the `StandardScaler` computes it.)

#### Go through the same process for `fare`. Apply `fit_tranform` to standardize the data. (For `fit_transform`, you'll need to use `X[["fare"]]` with double brackets.) Then extract the first value from `X["fare"]` (this time with single brackets). Calculate the mean and the standard deviation of fare and then standardize the first value from `X["fare"]` to see if it matches the output of `StandardScaler`.

In [119]:
# Use `ss.fit_transform` to standardize X[["fare"]]
ss.fit_transform(X[["fare"]])

array([[ 3.13554913],
       [ 2.06268333],
       [ 2.06268333],
       ...,
       [-0.52717838],
       [-0.52717838],
       [-0.51551435]])

In [120]:
# Extract the first entry from X["fare"]
fare0 = X["fare"][0]
fare0

211.3375

In [121]:
# Calculate the mean fare
mean_fare = mean(X["fare"])
mean_fare

36.60302387344199

In [122]:
# Calculate the standard deviation of fare
stdev_fare = stdev(X["fare"])
stdev_fare

55.753647701308196

In [123]:
# Standardize the first value of x["fare"]
(fare0 - mean_fare)/stdev_fare

3.1340456334385824

Does the number above match (somewhat closely) to the value from the output of `fit_transform` above?

*Type your answer here*

## Build transformers (more than meets the eye)

We'll need to apply all the knowledge from above to `X`, the whole DataFrame of features. Rather than trying to apply each separately as we did above and then reconstuct the DataFrame, we can use some sophisticated functionality of `scikit-learn` to transform all the columns at once.

First we gather the names of the columns, both the categorical ones and the numerical ones:

In [124]:
features_cat = ["pclass", "sex", "embarked"]

In [125]:
features_num = ["age", "fare"]

Next we define a "transfomer", which will be a set of instructions to apply to each set of columns. Note that we will be instructing Python to use `ohe` (`OneHotEncoder`) on the set of categorical features and `ss` (`StandardScaler`) on the numerical features.

In [126]:
trans_cat = ("cat", ohe, features_cat)
trans_num = ("num", ss, features_num)

Next we build an object called a `ColumnTransformer` that gathers everything we've done so far.

In [127]:
ct = ColumnTransformer(transformers = [trans_cat, trans_num])

Finally, we can actually apply the transformer to the data. We use `fit_transform` but not quite as before. This time we apply it to the whole DataFrame of features, `X`. We will re-assign the output to a new object called `X_trans`.

In [128]:
X_trans = ct.fit_transform(X)
X_trans

array([[ 1.        ,  0.        ,  0.        , ...,  1.        ,
        -0.05663194,  3.13554913],
       [ 1.        ,  0.        ,  0.        , ...,  1.        ,
        -2.01237899,  2.06268333],
       [ 1.        ,  0.        ,  0.        , ...,  1.        ,
        -1.93693697,  2.06268333],
       ...,
       [ 0.        ,  0.        ,  1.        , ...,  0.        ,
        -0.23073426, -0.52717838],
       [ 0.        ,  0.        ,  1.        , ...,  0.        ,
        -0.1959138 , -0.52717838],
       [ 0.        ,  0.        ,  1.        , ...,  1.        ,
        -0.05663194, -0.51551435]])

Note that `X_trans` is no longer a DataFrame. It's something called a "numpy array". We don't need to worry too much what this is. Suffice it to say that it now has the right form to apply `scikit-learn` machine learning algoithms to it.

In [129]:
type(X_trans)

numpy.ndarray

#### Observe the "shape" of the new `X_trans` array. In particular, note that there are 10 columns. We only had 6 features before, so can you explain why there are now 10 columns?

In [130]:
X_trans.shape

(1043, 10)

*Type your answer here*

## Train-test split

Recall that for machine learning, it is important to train the data on one set and then hold out a test set so that you can estimate how well the models you build will perform on new, unseen data.

This is accomplished by using the `train_test_split` function. Its input should be the transformed feature array (`X_trans`) and the response series (`y`). There will be four pieces of output generated: one training and one testing set for both the features and the response. We will call these `X_train`, `X_test`, `y_train`, and `y_test`.

Other arguments to the function are the size of the test set we desire (we'll use a 75/25 split, so 25% goes to the test set), and a "random state" that acts as a seed so that our results are reproducible.

In [131]:
X_train, X_test, y_train, y_test = train_test_split(X_trans, y, test_size = 0.25, random_state = 98765)

In [132]:
X_train.shape

(782, 10)

In [133]:
X_test.shape

(261, 10)

In [134]:
y_train.shape

(782,)

In [135]:
y_test.shape

(261,)

#### How many rows ended up in the training and test sets?

*Type your answer here*

## Applying machine learning algorithms

The task of using features to predict the value of a categorical variable (in this case, whether the passenger survived) is called *classification*. So we often call a machine learning model built for this task a `classifier`.

Let's build a classifier (called `clf`) that will perform an algorithm called logistic regression.

**Important note:** for purposes of this tutorial, it would be hard to explain thoroughly how logistic regression works "under the hood" so to speak. So we won't really try. Having said that, however, I do not recommend the the blind use of algorithms without knowing something about how and why they work. Some algorithms are more or less suited for certain types of data, and there are all sorts of assumptions and conditions that have to be met before we can be confident that a model is doing its job correctly. So before you go out into the world and start applying machine learning, take the time to learn about what algorithms are available and how to use them correctly.

We build a classifier object:

In [136]:
clf = LogisticRegression()

Next we fit the model using the training data.

In [137]:
clf.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

We're getting a warning, which is not the same thing as an error, but is still somewhat annoying. In this case, the warning is very explicit about why there is a warning (the default solver will be changing in a future version of this library) and how to fix it (specify a solver). Okay, sure, why not:

In [138]:
clf = LogisticRegression(solver = "lbfgs")

In [139]:
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Yay, the warning is gone.

Congratulation, you have just fit your first machine learning model!

Okay, now what? What do we do with it? Well, a model is used to make predictions. Let's start by seeing how well the model predicts on its own training data. Usually, the model should do reasonably well on the very data used to train the model.

The `predict` method will feed the data through the model and the output is a set of predictions.

In [140]:
y_pred_train = clf.predict(X_train)

Let's just look at the first 10 predictions. Remember 0 means the person didn't survive and 1 means they did survive.

In [141]:
y_pred_train[0:9]

array([1, 1, 0, 1, 0, 1, 1, 1, 0])

Compare this to the *actual* survival status. (Annoyingly, the output below is formatted differently. That's because `y_pred_train` is a numpy array and `y` is a Series. Even more annoyingly, to get 10 items, we have to use `0:10` instead of `0:9` like before.)

In [142]:
y[0:10]

0    1
1    1
2    0
3    0
4    0
5    1
6    1
7    0
8    1
9    0
Name: survived, dtype: int64

#### Among the first 10 passengers, how many did our model predict correctly? Which row numbers were incorrectly predicted?

*Type your answer here*

*****

We can calculate the accuracy over all the training data using the `accuracy_score` function. It requires as input first the actual data followed by the predicted values.

In [143]:
accuracy_score(y_train, y_pred_train)

0.7864450127877238

We can also generate a confusion matrix that shows all four combinations of possible outcomes.

In [144]:
confusion_matrix(y_train, y_pred_train)

array([[388,  69],
       [ 98, 227]])

There are no labels, so it's not so clear what each of these numbers means. The true values are in the rows and the predicted values are in the columns. The following table should clarify:

Actual vs Predicted | Predicted 0 | Predicted 1 |
--------------------|-------------|-------------|
Actual 0            | 388         | 69          |
Actual 1            | 98          | 227         |



#### Using the numbers from the confusion matrix above, manually calculate the accuracy and confirm the 78.6% we got before.

In [145]:
# Calculate the accuracy manually
(388+227)/(388+227+69+98)

0.7864450127877238

*****

Of course, the real test of a model is how well it predicts unseen data. It's easy for a model to predict well on its own training data, but it might be overfitting. So we need to do the same steps again, but for the test data.

#### Make the necessary changes to the code above to get the accuracy and confusion matrix on the test data.

In [146]:
# Use clf.predict on the test data features X_test. Store the result as y_pred_test
y_pred_test = clf.predict(X_test)

In [147]:
# Compute the accuracy on the test data. (y_test is the actual data and y_pred_test is the predicted values from the previous step)
accuracy_score(y_test, y_pred_test)

0.7854406130268199

In [148]:
# Calculate the confusion matrix
confusion_matrix(y_test, y_pred_test)

array([[132,  29],
       [ 27,  73]])

## Conclusion

We have just scratched the surface of the machine learning workflow in `scikit-learn`. To learn more, check out any number of amazing online resources and tutorials.

# Submission

Choose File -> Download as HTML 

Submit this HTML file to canvas.