# Introduction to Machine Learning in Python 

This lesson provides an introduction to some of the common methods and terminologies used in machine learning research. We will cover areas such as data preparation and resampling, model building, and model evaluation. 

It is a prerequisite for the other lessons in the machine learning curriculum. In later lessons we explore tree-based models for prediction, neural networks for image classification, and responsible machine learning. 

## Goal - Predicting the outcome of critical care patients 

Critical care units are home to sophisticated monitoring systems, helping carers to support the lives of the sickest patients within a hospital. These monitoring systems produce large volumes of data that could be used to improve patient care.

Our goal is to predict the outcome of critical patients using physiological data available on the first day of admission to the intensive care unit. These predictions could be used for resource planning or to assist with family discussions. 

The dataset used in this lesson was extracted from the [eICU Collaborative Research Database](https://www.nature.com/articles/sdata2018178), a publicly available dataset comprising de-identified physiological data collected from critically ill patients. 

# Part 1 - Introduction

## Objectives

* What is machine learning?
* What is the relationship between machine learning, AI, and statistics?
* Understand the difference between supervised and unsupervised learning.

## Rule-based programming 

We are all familiar with the idea of applying rules to data to gain insights and make decisions. For example, we learn that human body temperature is ~37 °C, and that higher or lower temperatures can be cause for concern. 

As programmers we understand how to codify these rules. If we were developing software for a hospital to flag patients at risk of deterioration, we might create early-warning rules such as those below:

In [None]:
def has_fever(temp_c):
    if temp_c > 38:
        return True
    else:
        return False

## What is machine learning?

__Machine Learning (ML)__ is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data and thus perform tasks without explicit instructions. Recently, artifical neural networks have been able to surpass many previous approaches in performance. 

__ML__ finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. When applied to business problems, it is known under the name predictive analytics. Although not all machine learning is statistically based, computational statistics is an important source of the field's methods. 

## Relationships to statistics

Machine learning and statistics are closely related fields in terms of methods, but distinct in their principal goal: __statistics draws population inferences from a sample, while machine learning finds generalisable predictive patterns__. 

Conventional statistical analyses require the a priori selection of a model most suitable for the study data set. In addition, only significant or theoretically relevant variables based on previous experience are included for analysis. In contrast, machine learning is not built on a pre-structured model; rather, the data shape the model by detecting underlying patterns. The more variables (input) used to train the model, the more accurate the ultimate model will be. 

## Relationships to AI and deep learning

![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/AI_hierarchy.svg/399px-AI_hierarchy.svg.png)

## Machine learning approaches

Machine learning approaches are traditionally divided into three broad categories, which correspond to learning paradigms, depending on the nature of the "signal" or "feedback" available to the learning system:

* __Supervised learning:__ The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.
* __Unsupervided learning:__ No labels are given to the learning algorithm, leaving it on its own find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).
* __Reinforcement learning:__ A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving vehicle or playing a game against an opponent). As it navigates its problem space, the program is provided feedback that's analogous to rewards, which it tries to maximise.

Although each algorithm has advantages and limitations, no single algorithm works for all problems. 

## Machine learning models

A __machine learning model__ is a type of mathematical model that, after being "trained" on a given dataset, can be used to make predictions or classifications on new data. During training, a learning algorithm iteratively adjusts the model's internal parameters to minimise errors in its predictions. By extension, the term "model" can refer to several levels of specificity, from a general class of models and their associated learning algorithms to a fully trained model with all its internal parameters tuned. 

Various types of models have been used and researched for machine learning systems, picking the best model for a task is called model selection.

* __Artificial neural networks (ANNs):__ are computing systems vaguely inspired by the biological neural networks that constitute animal brains. Such systems "learn" to perform tasks by considering examples, generally without being programmed with any task-specific rules.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/300px-Colored_neural_network.svg.png)

* __Decision trees:__ decision tree learning uses a decision tree as a predictive model to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining, and machine learning.

![](https://i0.wp.com/why-change.com/wp-content/uploads/2021/11/Decision-Tree-elements-2.png?resize=715%2C450&ssl=1)

* __Support-vector machines (SVMs):__ also known as support-vector networks, are a set of related supervised learning methods used for classification and regression. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category.

![](https://i.ytimg.com/vi/ny1iZ5A8ilA/maxresdefault.jpg)

* __Regression analysis:__ encompasses a large variety of statistical methods to estimate the relationship between input variables and their associated features. Its most common form is linear repression, where a single line is drawn to best fit the given data according to a mathematical criterion such as ordinary least squares. The latter is often extended by regularisation methods to mitigate overfitting and bias, as in ridge regression. When dealing with non-linear problems, go-to models include polynomial regression, logistic regression (often used in statistical classification) or even kernel regression, which introduces non-linearity by taking advantage of the kernel trick to implicitly map input variables to higher-dimensional space.

![](https://www.questionpro.com/blog/wp-content/uploads/2019/03/Regression-Analysis_1.jpg)

* __Bayesian networks:__ a Bayesian network is a probabilistic graphical model that represents a set of random variables and their conditional independence with a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. 

![](https://www.nbertagnolli.com/assets/Bayes_Nets/figure_01.png) 

* __Gaussian processes:__ a Gaussian process is a stochastic process in which every finite collection of the random variables in the process has a multivariate normal distribution, and it relies on a pre-defined covariance function, or kernel, that models how pairs of points relate to each other depending on their locations. Gaussian processes are popular surrogate models in Bayesian optimisation used to do hyperparameter optimisation.

![](https://www.researchgate.net/profile/Florent-Leclercq/publication/327613136/figure/fig1/AS:749406701776896@1555683889137/Illustration-of-Gaussian-process-regression-in-one-dimension-for-the-target-test.png)

* __Genetic algorithms:__ a genetic algorithm (GA) is a search algorithm and heuristic technique that mimics the process of natural selection, using methods such as mutation and crossover to generate new genotypes in the hope of finding good solutions to a given problem. In machine learning, genetic algorithms were used in the 1980s and 1990s.

![](https://miro.medium.com/v2/resize:fit:1400/1*BYDJpa6M2rzWNSurvspf8Q.png)

## Key points

* Machine learning borrows heavily from fields such as statistics and computer science.
* In machine learning, models learn rules from data.
* In supervised learning, the target in our training data is labelled.

# Part 2 - Data preparation

__Objectives:__

* What are the common steps in data preparation?
* Why do we partition data at the start of a project?
* What is the purpose of setting a random state when partitioning?
* Should we impute missing values before or after partitioning?
* Explore characteristics of our dataset.
* Partition data into training and test sets.
* Encode categorical values.
* Use scaling to pre-process features.

## Sourcing and accessing data

Machine learning helps us to find patterns in data, so sourcing and understanding data is key. Unsuitable or poorly managed data will lead to a poor project outcome, regardless of the modelling approach. 

We will use an open access subset of the [eICU Collaborative Research Database](https://eicu-crd.mit.edu/about/eicu/), a publicly available dataset comprising deidentified physiological data collected from critically ill patients. For simplicity, we will be working with a pre-prepared CSV file that comprises data extracted from a [demo version of the dataset](https://physionet.org/content/eicu-crd-demo/2.0.1/).

Let's begin by loading this data:

In [None]:
import pandas as pd

# load the data
cohort = pd.read_csv('./eicu_cohort.csv')
cohort.head()

## Knowing the data 

Before moving ahead on a project, it is important to understand the data. Having someone with domain knowledge - and ideally first hand knowledge of the data collection process - helps us to design a sensible task and to use data effectively. 

Summarising data is an important first step. We will want to know aspects of the data such as: extent of missingness; data types; numbers of observations. One common step is to view summary characteristics (for example, see [Table 1](https://www.nature.com/articles/s41746-018-0029-1/tables/1) of the paper by Rajkomar et al.).

Let's generate a similar table for ourselves:

In [None]:
#!pip install tableone

In [None]:
from tableone import tableone

# rename columns
rename = {
    "unabridgedhosplos": "length of stay",
    "meanbp": "mean blood pressure",
    "wbc": "white cell count"
}

# view summary 
t1 = tableone(cohort, groupby="actualhospitalmortality", rename=rename)
t1

__Exercise:__

* What is the approximate percent mortality in the eICU cohort?
* Which variables appear noticeably different in the "Alive" and "Expired" groups?
* How does the in-hospital mortality differ between the eICU cohort and the ones in [Rajkomar et al](https://www.nature.com/articles/s41746-018-0029-1/tables/1)?

## Encoding

It is often the case that our data includes categorical values. In our case, for example, the binary outcome we are trying to predict - in hospital mortality - is recorded as "ALIVE" and "EXPIRED". Some models can cope with taking this text as input, but many cannot. We can use label encoding to convert the categorical values to numerical representations. 

In [None]:
# check current data types of each column
cohort.dtypes

In [None]:
cohort

* `object`: the default type to store text data in pandas
* `float64`: floating point numbers
* `int64`: integers

Check the documentation for [pandas.DataFrame.dtypes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html).

In [None]:
# convert column "actualhospitalmortality" to a categorical type
categories = ['ALIVE', 'EXPIRED']
cohort['actualhospitalmortality'] = pd.Categorical(cohort['actualhospitalmortality'], categories=categories)
cohort.dtypes

Check the documentation for [pandas.Categorical](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html).

In [None]:
# add the encoded value to a new column
cohort['actualhospitalmortality_enc'] = cohort['actualhospitalmortality'].cat.codes
cohort[['actualhospitalmortality', 'actualhospitalmortality_enc']].head()
cohort.dtypes

Let's encode the gender in the same way:

In [None]:
cohort['gender'] = pd.Categorical(cohort['gender'])
cohort['gender'] = cohort['gender'].cat.codes

## Partitioning

Typically we will want to split our data into a training set and "held-out" test set. The training set is used for building our model and our test set is used for evaluation. A split of ~70% training, 30% test is common. 

![](https://carpentries-incubator.github.io/machine-learning-novice-python/fig/train_test.png)

To ensure reproducibility, we should set the random state of the splitting method. This means that Python's random number generator will produce the same "random" split in the future.

In [None]:
#!pip install scikit-learn

In [None]:
from sklearn.model_selection import train_test_split

x = cohort.drop('actualhospitalmortality', axis=1)
y = cohort['actualhospitalmortality']
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=42)

## Missing data

Certain types of models - for example some decision trees - are able to implicitly handle missing data. For logistic regression, we need to impute values. We will take a simple approach of replacing with the median. 

With physiological data, imputing the median typically implies that the missing observation is not a cause for concern.

To avoid data leaking between our training and test sets, we take the median from the training set only. The training median is then used to impute missing values in the held-out test set. 

In [None]:
# impute missing values from the training set
x_train = x_train.fillna(x_train.median())
x_test = x_test.fillna(x_train.median())

It is often the case that data is not missing at random. For example, the presence of blood sugar observations may indicate suspected diabetes. To use this information, we can choose to create missing data features comprising of binary "is missing" flags. 

## Normalisation

Lastly, normalisation - scaling variables so that they span consistent ranges - can be important, particularly for models that rely on distance based optimisation metrics. 

As with creating train and test splits, it is a common enough task that there are plenty of pre-built functions for us to choose from. We will choose the [Min-Max Scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) from the sklearn package, which scales each feature between zero and one. 

$$
  x_{std} = \frac{x - x_{min}}{x_{max} - x_{min}}
$$
$$
  x_{scaled} - x_{std} * (x_{max} - x_{min}) + x_{min}
$$

In [None]:
# define the scaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

#fit the scaler on the training dataset
scaler.fit(x_train)

# scale the training set
x_train = scaler.transform(x_train)

# scale the test set 
x_test = scaler.transform(x_test)

* __MinMaxScaler.fit:__ compute the minimum and maximum to be used for later scaling.
* __MinMaxScaler.transform:__ scale features of dataset according to feature_range, where feature_range=(min, max).

Outliers in features can have a negative impact on the normalisation process - they can essentially squash non-outliers into a small space - so they may need special treatment (for example, a [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler)).

## Key points

* Data pre-processing is arguably the most important task in machine learning.
* Data is typically partitioned into training and test sets.
* Setting random states helps to promote reproducibility. 

# Part 3 - Learning 

__Objectives:__

* How do machines learn?
* How can machine learning help us to make predictions?
* Why is it important to be able to quantify the error in our models?
* What is an example of a loss function?
* Understand the importance of quantifying error.
* Code a linear regression model that takes inputs, weights, and bias.
* Code a loss function that quantifies model error.

## How do machines learn?

How do machines learn? Typically we are given examples and we learn rules through trial and error. Machines aren't that different! In the context of machine learning, we talk about how a model "fits" to the data.

Our model has a number of tweakable parameters. We need to find the optimal values for those parameters such that our model outputs the "best" predictions for a set of input variables. 

![](https://carpentries-incubator.github.io/machine-learning-novice-python/fig/ml_model.png)

## Loss functions

Finding the best model means defining "best". We need to have some way of quantifying the difference between a "good" model (capable of making useful predictions) vs a "bad" model (not capable of making useful predictions).

Loss functions are crucial for doing this. They allow us to quantify how closely our predictions fit to the known target values. You will hear "objective function", "error function", and "cost function" used in a similar way.

Mean squared error is a common example of a loss function, often used for linear regression. For each prediction, we measure the distance between the known target value ($y$) and our prediction ($y_{hat}$), and then we take the square.

In [None]:
import pandas as pd

# create sample labelled data
data = {
    'x': [1,2,3,4,5],
    'y': [-0.5,1,2,4,7]
}
df = pd.DataFrame(data)

# add predictions
df['y_hat'] = [0,2,4,6,8]

# plot the data
ax = df.plot(x='x', y='y', kind='scatter', xlim=[0,6], ylim=[-1,9])

# plot the predictions
ax.plot(df['x'], df['y_hat'], color='blue')

# plot error
ax.vlines(x=df['x'], ymin=df['y'], ymax=df['y_hat'], color='red', linestyle='dashed')
ax.text(x=3.1, y=3, s='Error')
ax.set_title('Prediction error')

The further away from the data points our line gets, the bigger the error. Our best model is the one with the smallest error. Mathematically, we can define the mean squared error as:

$$
  mse = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y_i})^2
$$

* $mse$ is the Mean Squared Error.
* $y_i$ is the actual value.
* $\hat{y_i}$ is the predicted value.
* $\sum$ means we are taking the sum of the difference.
* $n$ is the number of observations, so $\frac{1}{n}$ means we are taking the mean.

We could implement this in our codes as follows:

In [None]:
import numpy as np 

def loss(y, y_hat):
    distances = y - y_hat
    squared_distances = np.square(distances)
    return np.mean(squared_distances)

## Minimising the error

Our goal is to find the "best" model. We have defined best as being the model with parameters that give us the smallest mean squared error. We can write this as:

$$
  argmin\frac{1}{n}\sum{i=1}^{n}(y_i - \hat{y_i})^2
$$

Let's stop and look at what this loss function means. We'll plot the squared error for a range of values to demonstrate how loss scales as the difference between $y$ and $\hat{y}$ increases.

In [None]:
import matplotlib.pyplot as plt

x = np.arange(-50, 50, 0.05)
y = np.square(x)

plt.plot(x, y)
plt.xlabel('Difference between y and y_hat')
plt.ylabel('Loss (squared error)')

As we can see, our loss rapidly increases as predictions ($\hat{y}$) move away from the true values ($y$). The result is that outliers have a strong influence on our model fit. 

## Optimisation

In machine learning, there is typically a training step where an algorithm is used to find the optimal set of model parameters (i.e. those parameters that give the minimum possible error). This is the essence of machine learning.

![](https://carpentries-incubator.github.io/machine-learning-novice-python/fig/ml_model_loss.png)

There are many approaches to optimisation. [Gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) is a popular approach.

__Exercise:__

* What does a loss function quantify?
* What is an example of a loss function?
* What is happening when a model is trained?

Now that we've touched on how machines learn, we'll tackle the problem of predicting the outcome of patients admitted to intensive care units in hospitals across the United States. 

## Key points

* Loss functions allow us to define a good model.
* $y$ is a known target. $\hat{y}(y_{hat})$ is a prediction.
* Mean squared error is an example of a loss function.
* After defining a loss function, we search for the optimal solution in a process known as "training".
* Optimisation is at the heart of machine learning. 

# Part 4 - Modelling

__Objectives:__

* Broadly speaking, when talking about regression and classification, how does the prediction target differ?
* Would linear regression be most useful for a regression or classification task? How about logistic regression?
* Use a linear regression model for prediction.
* Use a logistic regression model for prediction.
* Set a decision boundary to predict an outcome from a probability.

## Regression vs Classification

Predicting one or more classes is typically referred to as __classification__. The task of predicting a continuous variable on the other hand (for example, length of hospital stay) is typically referred to as a __regression__.

Note that "regression models" can be used for both regression tasks and classification tasks. 

We will begin with a linear regression, a type of model borrowed from statistics that has all of the hallmarks of machine learning, which can be written as:

$$
  \hat{y} = wX + b
$$

Our predictions can be denoted by $\hat{y}$ (pronounced "y hay") and our explanatory variables (or "features") denoted by $X$. In our case, we will use a single feature: the APACHE-IV score, a measure of severity of illness. 

There are two parameters of the model that we would like to learn from the training data:

* $w$: weight.
* $b$: bias.

Could we use a linear regression for our classification task? Let's try fitting a line to our outcome data.

In [None]:
# import the regression model 
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression

reg = LinearRegression()

# use a single feature (apache score)
# note: remove the reshape if fitting to more than one input variable
X = cohort.apachescore.values.reshape(-1, 1)
y = cohort.actualhospitalmortality_enc.values

# fit the model to our data 
reg = reg.fit(X, y)

# generate some data to predict y values
buffer = 0.2 * max(X)
X_fit = np.linspace(min(X)-buffer, max(X)+buffer, num=50)
y_fit = reg.predict(X_fit)

# plot
plt.scatter(X, y, color='black', marker='x')
plt.plot(X_fit, y_fit, color='red', linewidth=2)
plt.show()

Linear regression places a line through a set of data points that minimises the error between the line and the points. It is difficult to see how a meaningful threshold could be set to predict the binary outcome in our task. The predicted values can exceed our range of outcomes. 

## Sigmoid function

The sigmoid function (also known as a logistic function) comes to our rescue. This function gives an "s" shaped curve that can take a number and map it into a value between 0 and 1:

$$
  f: \mathbb{R} \mapsto (0,1)
$$

The sigmoid function can be written as:

$$
  f(x) = \frac{1}{1+e^{-x}}
$$

Let's take a look at a curve generated by this function:

In [None]:
def sigmoid(x, k=0.1):
    """
    Sigmoid function.
    Adjust k to set slope.
    """
    s = 1/(1+np.exp(-x/k))
    return s

# generate some x values 
x = np.linspace(-1, 1, 50)

plt.plot(x, sigmoid(x))
plt.show()

We can use this to map our linear regression to produce output values that fall between 0 and 1.

$$
  f(x) = \frac{1}{1 + e^{-(wX+b)}}
$$

As an added benefit, we can interpret the output values as a probability. The probability relates to the positive class (the outcome with value "1"), which in our case is in-hospital mortality ("EXPIRED").

## Logistic regression

Logistic regressions are powerful models that often outperform more sophisticated machine learning models. In machine learning studies, it is typical to include performance of a logistic regression model as a baseline.

We need to find the parameters for the best-fitting logistic model given our data. As before, we do this with the help of a loss function that quantifies error. Our goal is to find the parameters of the model that minimise the error. With this model, we no longer use least squares due to the model's non-linear properties. Instead we will use log loss. 

## Training (or fitting) the model

As is typically the case when using machine learning packages, we don't need to code the loss function ourselves. The function is implemented as part of our machine learning package (in this case [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)). Let's try fitting a Logistic Regression to our data.

__Exercise:__ 

* Following the previous example for a linear regression, fit a logistic regression to your data and create a new plot. How do the predictions differ from before? Hint: `from sklearn.linear_model import LogisticRegression`.

In [None]:
# import the regression model 
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression

reg = LogisticRegression()

# use a single feature (apache score)
# note: remove the reshape if fitting to more than one input variable
X = cohort.apachescore.values.reshape(-1, 1)
y = cohort.actualhospitalmortality_enc.values

# fit the model to our data 
reg = reg.fit(X, y)

# generate some data to predict y values
buffer = 0.2 * max(X)
X_fit = np.linspace(min(X)-buffer, max(X)+buffer, num=50)
y_fit = reg.predict(X_fit)

# plot
plt.scatter(X, y, color='black', marker='x')
plt.plot(X_fit, y_fit, color='red', linewidth=2)
plt.show()

## Decision boundary

Now that our model is able to output the probability of our outcome, we can set a decision boundary for the classification task. For example, we could classify probabilities of <0.5 as "ALIVE" and >=0.5 as "EXPIRED". Using this approach, we can predict outcomes for a given input. 

In [None]:
x = [[90]]

outcome = reg.predict(x)
probs = reg.predict_proba(x)[0]
print(f'For x={x[0][0]}, we predict an outcome of "{outcome[0]}".\n'
      f'Class probabilities (0, 1): {round(probs[0], 2), round(probs[1], 2)}.')

__Key points:__

* Linear regression is a popular model for regression tasks.
* Logistic regression is a popular model for classification tasks.
* Probabilities that can be mapped to a prediction class. 

# Part 5 - Validation

__Objectives:__

* What is meant by model accuracy?
* What is the purpose of a validation set?
* What are two types of cross validation?
* What is overfitting?
* Train a model to predict patient outcomes on a held-out test set.
* Use cross validation as part of our model training process. 

## Accuracy

One measure of the performance of a classification model is accuracy. Accuracy is defined as the overall proportion of correct predictions. If, for example, we take 50 shots and 40 of them hit the target, then our accuracy is 0.8 (40/50). 

![](https://carpentries-incubator.github.io/machine-learning-novice-python/fig/japan_ren_hayakawa.jpg)

Accuracy can therefore be defined by the formula below:

$$
  Accuracy = \frac{Correct predictions}{All predictions}
$$

What is the accuracy of our model at predicting in-hospital mortality?

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split 

# define features and outcome
features = ['apachescore']
outcome = ['actualhospitalmortality_enc']

# partition data into training and test sets
X = cohort[features]
y = cohort[outcome]
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

# restructure data for input into model
# note: remove the reshape if fitting to >1 input variable
x_train = x_train.values.reshape(-1, 1)
y_train = y_train.values.ravel()
x_test = x_test.values.reshape(-1, 1)
y_test = y_test.values.ravel()

# train model
reg = LogisticRegression(random_state=0)
reg.fit(x_train, y_train)

# generate prediction
y_hat_train = reg.predict(x_train)
y_hat_test = reg.predict(x_test)

# accuracy on training set
acc_train = np.mean(y_hat_train == y_train)
print(f'Accuracy on training set: {acc_train: .2f}')

# accuracy on test set
acc_test = np.mean(y_hat_test == y_test)
print(f'Accuracy on test set: {acc_test: .2f}')

* Note: The `array.ravel()` method in NumPy is used to return a contiguous flattened array.

There was a slight drop in performance on out test set, but that is to be expected.

## Validation set

Machine learning is iterative by nature. We want to improve our model, tuning and evaluating as we go. This leads us to a problem. Using our test set to iteratively improve our model would be cheating. It is supposed to be "held out", not used for training. So what do we do?

The answer is that we typically partition off part of our training set to use for validation. The "validation set" can be used to iteratively improve our model, allowing us to save our test set for the __final__ evaluation.

![](https://carpentries-incubator.github.io/machine-learning-novice-python/fig/training_val_set.png)

## Cross validation

Why stop at one validation set? With sampling, we can create many training sets and many validation sets, each slightly different. We can then average our findings over the partitions to give an estimate of the model's predictive performance. 

The family of resampling methods used for this is known as "__cross validation__". It turns out that one major benefit to cross validation is that it helps us to build more robust models.

If we train our model on a single set of data, the model may learn rules that are overly specific (e.g. "all patients aged 63 years survive"). These rules will not generalise well to unseen data. When this happens, we say our model is "__overfitted__".

If we train on multiple, subtly-different versions of the data, we can identify rules that are likely to generalise better outside our training set, helping to avoid overfitting. 

Two popular cross-validation methods:

* K-fold cross validation
* Leave-one-out cross validation

## K-fold cross validation

In K-fold cross validation, "K" indicates the number of times we split our data into training/validation sets. With 5-fold cross validation, for example, we create 5 separate training/validation sets. 

![](https://carpentries-incubator.github.io/machine-learning-novice-python/fig/k_fold_cross_val.png)

With K-fold cross validation, we select our model to evaluate and then:

1. Partition the training data into a training set and a validation set. An 80%, 20% split is common.
2. Fit the model to the training set and make a record of the optimal parameters.
3. Evaluate performance on the validation set.
4. Repeat the process 5 times, then average the parameter and performance values.

When creating our training and test sets. we needed to be careful to avoid data leaks. The same applies when creating training and validation sets. We can use a `pipeline` object to help manage this issue. 

In [None]:
from numpy import mean, std
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

# define dataset
X = x_train
y = y_train

# define the pipeline 
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)

# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance
print('Cross-validation accuracy, mean (std): %.2f (%.2f)' % (mean(scores)*100, std(scores)*100))

Leave-one-out cross validation is the same idea, except that we have many more folds. In fact, we have one fold for each data point. Each fold we leave out one data point for validation and use all of the other points for training. 

## Key points

* Validation sets are used during model development, allowing models to be tested prior to testing on a held-out set.
* Cross-validation is a resampling technique that creates multiple validation sets.
* Cross-validation can help to avoid overfitting. 

# Part 6 - Evaluation

__Objectives:__

* What kind of values go into a confusion matrix?
* What do the letters AUROC stand for?
* Does an AUROC of 0.5 indicate our predictions were good, bad, or average?
* In the context of evaluating performance of a classifier, what is TP?
* Create a confusion matrix for a predictive model.
* Use the confusion matrix to compute popular performance metrics.
* Plot an AUROC curve.

## Evaluating a classification task

We trained a machine learning model to predict the outcome of patients admitted to intensive care units. As there are two outcomes, we refer to this as a "binary" classification task. We are now ready to evaluate the model on our held-out test set. 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# define features and outcome
features = ['apachescore']
outcome = ['actualhospitalmortality_enc']

# partition data into training and test sets
X = cohort[features]
y = cohort[outcome]
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, random_state =  42)

# restructure data for input into model
# note: remove the reshape if fitting to >1 input variable
x_train = x_train.values.reshape(-1, 1)
y_train = y_train.values.ravel()
x_test = x_test.values.reshape(-1, 1)
y_test = y_test.values.ravel()

# train model
reg = LogisticRegression(random_state=0)
reg.fit(x_train, y_train)

# generate predictions
y_hat_test = reg.predict(x_test)
y_hat_test_proba = reg.predict_proba(x_test)

Each prediction is assigned a probability of a positive class. For example, the first 10 probabilities are:

In [None]:
probs = y_hat_test_proba[:,1][:10]
rounded_probs = [round(x,2) for x in probs]
print(rounded_probs)

These probabilities correspond to the following predictions, either a "0" ("ALIVE") or a 1 ("EXPIRED"):

In [None]:
print(y_hat_test[:10])

In comparison with the known outcomes, we can put each prediction into one of the following categories:

* __True positive__: we predict "1" ("EXPIRED") and the true outcome is "1".
* __True negative__: we predict "0" ("ALIVE") and the true outcome is "0".
* __False positive__: we predict "1" ("EXPIRED") and the true outcome is "0".
* __False negative__: we predict "0" ("ALIVE") and the true outcome is "1". 

In [None]:
print(y_test[:10])

## Confusion matrices

It is common practice to arrange these outcome categories into a "__confusion matrix__", which is a grid that records our predictions against the ground truth. For a binary outcome, confusion matrices are organised as follows:

|| __Negative (predicted)__ | __Positive (predicted)__ |
|:------|:----:|:----:|
| Negative (actual) | __TN__ | FP |
| Positive (actual) | FN | __TP__ |

The sum of the cells is the total number of predictions. The diagonal from top left to bottom right indicates correct predictions. Let's visualise the results of the model in the form of a confusion matrix:

In [None]:
# import the metrics class
from sklearn import metrics

confusion = metrics.confusion_matrix(y_test, y_hat_test)
class_names = cohort['actualhospitalmortality'].cat.categories

disp = metrics.ConfusionMatrixDisplay.from_estimator(
    reg, x_test, y_test, display_labels=class_names,
    cmap=plt.cm.Blues
)
plt.show()

We have two columns and rows because we have a binary outcome, but you can also extend the matrix to plot multi-class classification predictions. If we had more output classes, the number of columns and rows would match the number of classes. 

## Accuracy

Accuracy is the overall proportion of correct predictions. Think of a dartboard. How many shots did we take? How many did we hit? Divide one by the other and that's the accuracy.

Accuracy can be written as:

$$
  Accuracy = \frac{TP+TN}{TP+TN+FP+FN}
$$

__What was the accuracy of our model?__

In [None]:
acc = metrics.accuracy_score(y_test, y_hat_test)
print(f'Accuracy (model) = {acc:.2f}')

In [None]:
zeros = np.zeros(len(y_test))
acc = metrics.accuracy_score(y_test, zeros)
print(f'Accuracy (zeros) = {acc: .2f}')

The problem with accuracy as a metric is that it is heavily influenced by prevalence of the positive outcome: because the proportion of 1s is relatively low, classifying everything as 0 is a safe bet. 

We can see that the high accuracy is possible despite totally missing our target. To evaluate an algorithm in a way that prevalence does not cloud our assessment, we often look at sensitivity and specificity.

## Sensitivity (Recall or True Positive Rate)

Sensitivity is the ability of an algorithm to predict a positive outcome when the actual outcome is positive. In our case, of the patients who die, what proportion did we correctly predict? This can be written as:

$$
  Sensitivity = Recall = \frac{TP}{TP+FN}
$$

Because a model that calls "1" for everything has perfect sensitivity, this measure is not enough on its own. Alongside sensitivity we often report on specificity. 

## Specificity (True Negative Rate)

Specificity relates to the test's ability to correctly classify patients who survive their stay (i.e. class "0"). Specificity is the proportion of those who survive who are predicted to survive. The formula for specificity is:

$$
  Specificity=\frac{TN}{FP+TN}
$$

## Receiver-Operator Characteristic

A Receiver-Operator Characteristic (ROC) curve plots 1 - specificity vs. sensitivity at varying probability threshold. The area under this curve is known as the AUROC (or sometimes just "Area Under the Curve", AUC) and it is a well-used measure of discriminination that was originally developed by radar operators in the 1940s.

In [None]:
metrics.RocCurveDisplay.from_estimator(reg, x_test, y_test)

An AUROC of 0.5 is no better than guessing and an AUROC of 1.0 is perfect. An AUROC of 0.9 tells us that the 90% of times our model will assign a higher risk to a randomly selected patient with an event that to a randomly selected patient without an event. 

## Key Points

* Confusion matrices are the basis for many popular performance metrics.
* AUROC is the area under the receiver operating characteristic. 0.5 is bad!
* TP is True Positive, meaning that our prediction hit its target. 

# Part 7 - Bootstrapping 



# References

* Carpentries Incubator - [Introduction to Machine Learning in Python](https://carpentries-incubator.github.io/machine-learning-novice-python/)
* Wikipedia - [Machine learning](https://en.wikipedia.org/wiki/Machine_learning)