# Machine Learning introduction: practical

The basic machine learning project consists of the same steps:

1. Aquire data
2. Split data into Train/Test and (optionally) validation sets
3. Preprocess the data
4. Define the model
5. Fit the model to the training data
6. Measure the accuracy of the model on the training data
7. Measure the accuracy of the model on the test data

You might go through several iternations of 3-6 while you get your model working, and that is where validation sets are useful. 

Their are many different programming frameworks for doing machine learning. Many of them are written in C and then export "language bindings" to allow them to be used from python, R, matlab, java, Ruby, Scala....

Some examples are:

* TensorFlow
* Keras
* PyTorch

These are often optimised to allow very powerful "Deep Learning" neural network models, such as CNNs, ResNets, LSTMs and Transformers. 

These are trained on thounssands of GPUs.

We will be using `scikit-learn`

* Focuses more on "traditional" learning algorithms
* Extremely easy to use
* Runs well on a single machine

In [111]:
import sklearn

## Aquire the data

For this example we will be using an example dataset included with `scikit-learn`. 

This dataset contains clinical data from patients with breast cancer, and whether their tumour is benign or malignant. 

In [112]:
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()

If you are interested, the data includes a description of the study

In [116]:
print(breast_cancer["DESCR"])

.. _breast_cancer_dataset:

Breast cancer Wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 569

:Number of Attributes: 30 numeric, predictive attributes and the class

:Attribute Information:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

    The mean, standard error, and "worst" or largest (mean of the three
    worst/largest values) of these features were computed for each image,
    resulting in 30 features.  For instance, field 0 is Mean Radius, field
    10 is Radius SE, field 20 is Worst Radius.

    - 

Traditionally in machine learning we use the variable X to represent the features of the training data, and Y to represent the thing we are trying to predict.

In [117]:
X = breast_cancer.data
Y = breast_cancer.target

Lets have a look inside these...

In [118]:
print(X)

[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]


In [119]:
print(Y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 0 1 0 1 1 0 

Lets investigate these datasets. 

X and Y are `array`s. Arrays are like lists, except: 

* They can have more than one dimension (e.g. rows AND columns).
* They can only have one data type in them.
* You can't `append` or `extend` them. You have to predefine their size.

(most things in `sklearn` can also be done with pandas dataframes). 

The `shape` attribute allows us to see how big they are. It returns an array, the first entry
* the first entry contains the number of rows
* the second entry contains the number of columns (if it has them)

In [120]:
n_x = X.shape[0] 
n_features = X.shape[1]
n_y = Y.shape[0]

print("The number of examples is %i" % n_x)
print("Each has %i features" % n_features)
print("They predict %i class labels " % n_y)

The number of examples is 569
Each 30 features
They predict 569 class labels 


You'll have seen that Y contains 1s and 0s. Remember that the thing we are trying to predict has to be numeric. 

But what does 0 and 1 mean? 

We can find out from the dataset.

In [123]:
class_names = breast_cancer["target_names"]
class_names

array(['malignant', 'benign'], dtype='<U9')

So:

* when y is 0, the tumour is malignant
* when y is 1, the tumour is benign

We can look which is which using this.

In [126]:
class_names[Y][1:50]

array(['malignant', 'malignant', 'malignant', 'malignant', 'malignant',
       'malignant', 'malignant', 'malignant', 'malignant', 'malignant',
       'malignant', 'malignant', 'malignant', 'malignant', 'malignant',
       'malignant', 'malignant', 'malignant', 'benign', 'benign',
       'benign', 'malignant', 'malignant', 'malignant', 'malignant',
       'malignant', 'malignant', 'malignant', 'malignant', 'malignant',
       'malignant', 'malignant', 'malignant', 'malignant', 'malignant',
       'malignant', 'benign', 'malignant', 'malignant', 'malignant',
       'malignant', 'malignant', 'malignant', 'malignant', 'malignant',
       'benign', 'malignant', 'benign', 'benign'], dtype='<U9')

We can also have a look at what each of the features is:

In [127]:
breast_cancer["feature_names"]

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

## Split the data

Now that we have got our data, and understood it, the next step is to split into test and train data. 

This is important to ensure that we don't "overfit" on the training data. 

That is - learn how to distinguish these examples very well but only these examples. 

In [151]:
from sklearn.model_selection import train_test_split

# sklearn provides the `train_test_split function to do the splitting for you
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, shuffle = True, random_state=321)

# Check the split has worked. 
n_x_train = X_train.shape[0]
n_x_trest = X_train.shape[0]

print ("The number of examples in the full data is %i" % n_x)
print ("The number of training examples is %i" % X_train.shape[0])
print ("The number of testing examples is %i" % X_test.shape[0])


The number of examples in the full data is 569
The number of training examples is 398
The number of testing examples is 171


The amount of data taken for testing depends on the problem. 

You need enough to get a good estimate of how well you are doing.

But not so much you arn't left with enoguh for training. 

10%-30% is traditional

There are other ways to do the splitting

Noteably:

* Creating a validation set as well as a test set
* Cross-validation (see below).

But also if you need to make sure certain examples do or do not end up together. 

## Preprocess the data

We are going to come back to this

## Define the model

In this example we will use perhaps the most simple classification model:

Logistic Regression. 

In logistic regression we first define a score for each example by multiplying each feature by a weight and additing it together:

$$ Z = \beta_1 \times x_1 + \beta_2 \times x_2 + ... + \beta_{30} \times x_{30} $$

Here $x_n$ is a feature. So for example, $x_1$ is "mean radius". $\beta_1$ is some score we associate with the first feature. 

This is often written 
$$ Z = \sum_n \beta_n \times x_n $$

or 

$$ Z = \beta^T X $$

Z is then transformed using the "logistic function":

$$ Y = \frac{1}{1 + e^{-Z}} $$

The job of ML here is to find the values of $\beta$ that get Y closest to the real values. 

We can create a LogisticRegression model using `LogisticRegression` from `sklearn`.

There are many options, some of which you will use later.

For now we'll (mostly) use the defaults. 

In [133]:
from sklearn.linear_model import LogisticRegression
LR_model = LogisticRegression(penalty=None)

## Fitting the model

Fitting the model is the easiest bit!

We just provide `X_train` and `Y_train` to our model's `fit` method.

In [152]:
model_fit = LR_model.fit(X_train, Y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


There are two warnings.

One tells us that we are calling the function wrong.

The other that our model hasn't converged. This might be a problem, but we'll come back to it. 

# Measure the performance of the model

The next step is to measure the accuracy of our model. There are many metrics for measureing how good your model is. 

The simplest is "accuracy" - simply the fraction of guesses that are correct.

Lets define an accuracy function. 

In [135]:
def accuracy_score(y_true, y_pred):
    ''' 
    The accuracy is the number of correct predictions, divided by total number of predictions

    Parameters
    ----------

    y_true : numpy array-like, array of true classes for each example, encode is 0 for 
        False, 1 for true

    y_pred : numpy array-like, array of predicted classes for each example, encoded as 
        above

    Returns
    -------

    score : float - the accuracy.
    '''

    # number of examples
    total = len(y_true)
    
    # Calculate the difference between the predictions and the truth (remember benign is 0, malignant is 1)
    # first do it per example
    wrong = abs(y_true - y_pred)

    # then count the number of examples
    n_wrong=sum(wrong)

    # subtract from the total number to get the number right. 
    n_right = total - n_wrong

    # accuracy is now the ratio of these two
    accuracy = float(n_right/total)
    
    return(accuracy)

We can now use this to ask how well our trained model predicts the true class labels of our training examples from our training data.
First predict the values of Y:

In [158]:
Y_train_pred = LR_model.predict(X_train)

Then compare them to the correct values:

In [154]:
accuracy_score(Y_train, Y_train_pred)

0.9597989949748744

95% accuracy seems pretty good, but is this also true in the test data. 

Remember overfitting can lead to models working less well on data other than what they were trained on. 

In [155]:
Y_test_pred = LR_model.predict(X_test)
accuracy_score(Y_test, Y_test_pred)

0.9239766081871345

Accuracy works fairly well when you have an equal number of examples in each class - the same number of benign or malignant turmours. 

It works badly where this is not the case. 

Imagine - only 10 out of 300 cases are malignant. 

A model that just guess all examples were benign would have an accuracy of 290/300 ~ 97%. 

Alternatives to accuracy include 
* precision (the number of cases called malignant that are
* recall (the number of cases that are malignant that are called malignant)
* F1 score - an average of the above.
* AUC (Area under the reciever operator curve)

In [166]:
from sklearn.metrics import f1_score, roc_auc_score

print ("Training F1 score is %.2f" % f1_score(Y_train, Y_train_pred))
print ("Test F1 score is %.2f" % f1_score(Y_test, Y_test_pred))

Training F1 score is 0.97
Test F1 score is 0.94


In [165]:
print ("Training AUC score is %.2f" % roc_auc_score(Y_train, Y_train_pred))
print ("Test AUC score is %.2f" % roc_auc_score(Y_test, Y_test_pred))

Training AUC score is 0.96
Test AUC score is 0.92


* So on the test data its 92%, which is less good, but not massively so. 

* Still our current model is getting nearly 10% of guesses wrong.

* This would be no good if we were telling people if their cancer was benign or malignant. 

## Data preprocessing

There several things we can do to the data or the model to try to improve this performance, before we move to a more complex model. 

One common thing to do is to do one or more of a common set of transformations to the data. 

The most common are:

* Centering the data
* Scaling the data
* Taking the log of the data

Some times you have principled reasons to believe that something should be done. 

More often, you just try and see if it makes things better. 

Lets try centering and scaling the data. This meakes the mean and standard deviation of each of the features the same. 

To do this, we create a "StandardScaler" object, and "fit" it to our data. 

In [167]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)

We can now use the `transform` method of this object to scale our data:

In [168]:
scaled_X_train = scaler.transform(X_train)

We can now see that the means are all 0. 

Means before transformation:

In [171]:
X_train.mean(axis=0)

array([1.39741658e+01, 1.93426382e+01, 9.08990201e+01, 6.40355025e+02,
       9.55764824e-02, 1.01927035e-01, 8.52440653e-02, 4.64194673e-02,
       1.80827387e-01, 6.26657538e-02, 4.02647236e-01, 1.23964497e+00,
       2.86333266e+00, 4.02035854e+01, 7.11084925e-03, 2.50634221e-02,
       3.09895525e-02, 1.14432588e-02, 2.05774070e-02, 3.74401080e-03,
       1.60196332e+01, 2.56463568e+01, 1.05591834e+02, 8.52779899e+02,
       1.30876106e-01, 2.48253719e-01, 2.62752611e-01, 1.09040957e-01,
       2.87438693e-01, 8.36328141e-02])

Means after transformation:

In [172]:
scaled_X_train.mean(axis=0)

array([-4.77953802e-15, -6.38238764e-16,  1.81485201e-15,  1.06503304e-15,
       -3.39594349e-15, -1.48708516e-15, -3.01824450e-16,  5.83564464e-16,
       -3.39378163e-15,  2.89773788e-15, -5.94094846e-16, -7.91382593e-16,
        4.90952895e-17, -3.25814194e-16,  2.28181516e-16, -5.69059038e-16,
        3.48688136e-16, -1.67593466e-15, -7.52608472e-16,  1.54594372e-15,
       -2.12616078e-15,  3.71673658e-15,  1.37076280e-15,  7.92219445e-16,
        4.32436052e-15,  6.53302091e-16, -5.24426956e-17,  1.47843770e-16,
       -1.74623018e-16, -1.85306823e-15])

We can now refit our model and see if it does any better. 

Before we go any further. We have a problem. 

We said we didn't want to "train" on the "test" data so as to keep it seperate, and not overfit. 

But if we try lots of different modifications to the data, and test on the test data, then we might choose a set of things that are only good for that test data.

We will have "comtaminated" our test data - used it in developing our model.

This is something you should never do. 

This is where **validation** sets come in useful.

You effectively have two test sets - one you use in model development, and one you don't. 

However, if we keep breaking our data into smaller and smaller pieces, we'll have none left. 


A solution to this is Cross-Validation. To use cross validation:

1. Break your data into k pieces (say 10 pieces).
2. Train on 9, and use the 10th as a validation set.
3. Record the performance on your metric of choice.
4. Repeat 2-3, but using a different piece as validation.
5. Continue until you have used each pieces as the validation set once.
6. Take an average of the scores. 

First lets test our unscaled model under cross-validation for comparison:

In [221]:
from sklearn.model_selection import KFold

# Define how to breakdown the dataset
kfold = KFold(n_splits=10, shuffle=True, random_state=4564)

# Create a list to hold the scores. 
scores = list()
i = 0

# kfold splits out sets of indicies to use to select data subsets
for train_i, test_i in kfold.split(X_train):

    i += 1
    
    # Grab the train and test data for this split. 
    X_train_k, Y_train_k = X_train[train_i], Y_train[train_i]
    X_val_k, Y_val_k = X_train[test_i], Y_train[test_i]
    
    # Fit the model for the training split
    LR_model = LogisticRegression(penalty=None)
    LR_model = LR_model.fit(X_train_k, Y_train_k)

    # Predict the classes for the validation examples. 
    y_pred = LR_model.predict(X_val_k)

    # calculate the score 
    k_score = accuracy_score(Y_val_k, y_pred)

    print ("Fold %i. %i training examples (%i positive). %i validation examples (%i positive). Accuracy %.2f" % 
           (i, X_train_k.shape[0], np.sum(Y_train_k), X_test_k.shape[0], np.sum(Y_test_k), k_score))
    scores.append(k_score)
    

Fold 1. 358 training examples (240 positive). 45 validation examples (27 positive). Accuracy 0.97
Fold 2. 358 training examples (233 positive). 45 validation examples (27 positive). Accuracy 0.95
Fold 3. 358 training examples (232 positive). 45 validation examples (27 positive). Accuracy 0.97
Fold 4. 358 training examples (233 positive). 45 validation examples (27 positive). Accuracy 0.93
Fold 5. 358 training examples (233 positive). 45 validation examples (27 positive). Accuracy 0.97
Fold 6. 358 training examples (234 positive). 45 validation examples (27 positive). Accuracy 1.00
Fold 7. 358 training examples (232 positive). 45 validation examples (27 positive). Accuracy 0.95
Fold 8. 358 training examples (235 positive). 45 validation examples (27 positive). Accuracy 0.85
Fold 9. 359 training examples (234 positive). 45 validation examples (27 positive). Accuracy 0.95
Fold 10. 359 training examples (234 positive). 45 validation examples (27 positive). Accuracy 0.92


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to sca

In [222]:
np.mean(scores)

np.float64(0.9471794871794872)

A mean accuracy of 95% is okay. But can we do better?

In [223]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=10, shuffle=True, random_state=4564)
scores = list()
i = 0

for train_i, test_i in kfold.split(X_train):
    
    i +=1 

    # Grab the train and test data for this split.
    X_train_k, Y_train_k = X_train[train_i], Y_train[train_i]
    X_val_k, Y_val_k = X_train[test_i], Y_train[test_i]

    # transform the training data
    scaled_X_train = scaler.transform(X_train_k)
    
    # Fit the model for the training split
    LR_model = LogisticRegression(penalty=None)
    scaled_LR_model = LR_model.fit(scaled_X_train, Y_train_k)

    # transform the validation data
    scaled_X_val = scaler.transform(X_val_k)
    
    # Predict the classes for the validation examples.
    y_pred = scaled_LR_model.predict(scaled_X_val)

    # calculate the score 
    k_score = accuracy_score(Y_val_k, y_pred)

    print ("Fold %i. %i training examples (%i positive). %i validation examples (%i positive). Accuracy %.2f" % 
           (i, X_train_k.shape[0], np.sum(Y_train_k), X_test_k.shape[0], np.sum(Y_test_k), k_score))
    scores.append(k_score)

Fold 1. 358 training examples (240 positive). 45 validation examples (27 positive). Accuracy 1.00
Fold 2. 358 training examples (233 positive). 45 validation examples (27 positive). Accuracy 0.95
Fold 3. 358 training examples (232 positive). 45 validation examples (27 positive). Accuracy 1.00
Fold 4. 358 training examples (233 positive). 45 validation examples (27 positive). Accuracy 0.95
Fold 5. 358 training examples (233 positive). 45 validation examples (27 positive). Accuracy 0.97
Fold 6. 358 training examples (234 positive). 45 validation examples (27 positive). Accuracy 1.00
Fold 7. 358 training examples (232 positive). 45 validation examples (27 positive). Accuracy 0.97
Fold 8. 358 training examples (235 positive). 45 validation examples (27 positive). Accuracy 0.93
Fold 9. 359 training examples (234 positive). 45 validation examples (27 positive). Accuracy 0.95
Fold 10. 359 training examples (234 positive). 45 validation examples (27 positive). Accuracy 0.95




In [224]:
np.mean(scores)

np.float64(0.9672435897435898)

A slight improvement to 97%, so a 2% improvement. This might not seem like a lot, but when you are going from 92% -> 95% -> 97%, 2% is nearly half of the possible improvement. 

## Conclusion

You should now have what you need to get started. 

Go find your data. 

Load in your data (sklearn will happily take pandas dataframes). 

Convert the thing you want to predict into numbers. 

Divide into test-train. 

Use crossvalidation to test a simply logistic regression model on your data, with and without scaling and centring. 

Also try with and without log transformation. 