<a href="https://colab.research.google.com/github/GabeMaldonado/AIforMedicine/blob/master/AIforMed_C3_W1_Lab2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

$\mathfrak{Gabriel \ Maldonado}$

# AI for Medicine Course III
## Week 1 Lab 1 
## Model Training/Tuning Basics with Sklearn

In this notebook we're going to be exploring the `sklearn` library, including an overview of its underlying data types and methods for tweaking a model's hyperparameters using some dummy data.

### Packages

We will be using the following packages 

- `pandas` -- to manipulate the data
- `numpy`  -- for mathematical and scientific operations
- `sklearn`  -- for machine learning and statistical modeling
- `itertools` helps with hyperparameter (grid) searching

In [19]:
# Import Packages

import numpy as np
import pandas as pd
import itertools

# Set the random seed for consistent output
np.random.seed(18)

# Read in the data
data = pd.read_csv("/content/dummy_data.csv", index_col=0)
data.head()


Unnamed: 0,sex,age,obstruct,outcome,TRTMT
1,0,57,0,1,True
2,1,68,0,0,False
3,0,72,0,0,True
4,0,66,1,1,True
5,1,69,0,1,False


### Train/Test Split

In this step we will split the data into train and test sets

In [4]:
# Import module to split data

from sklearn.model_selection import train_test_split

# Get the label
y = data.outcome

# Get the features
X = data.drop('outcome', axis=1)

# Get training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

print(f"Number of observations for training: {y_train.size}")
print(f"Number of observations for testing: {y_test.size}")

Number of observations for training: 35
Number of observations for testing: 15


### Model Fit and Prediction

Let's fit a logistic regression to the training data. `Sklearn` allows you to provide arguments that override the defaults. 

The default solver is `lbfgs`.  
- Lbfgs stands for ['Limited Memory BFGS'](https://en.wikipedia.org/wiki/Limited-memory_BFGS), and is an efficient and popular method for fitting models.
- The solver is set explicitly here for learning purposes; if you do not set the solver parameter explicitly, the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) function will use its default solver, which is lbfgs as well.

In [5]:
from sklearn.linear_model import LogisticRegression

# Create an instance of a model
lr = LogisticRegression(solver = 'lbfgs')

# Fit the model
lr.fit(X_train, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

When it fits the training data, `sklearn` also prints out the model's hyperparameters.  
- Here, these are the default hyperparameters for `sklearn's` logistic regression classifier.
- Another way to check these parameters is the `get_params()` method of the classifier.

You should spend some time checking out the [documentation](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) to get a deeper understanding of what's going on. One important thing to note is that each classifier has different hyperparameters. 

### Prediction
To predict with the classifier, use the `predict()` method. 
- This returns a `numpy` array containing the predicted class for each observation in the test set, as you can see by running the next cell:


In [6]:
# Use the trained model to predict labels from the features of the test set
predictions = lr.predict(X_test)

# View the prediction type, shape, and print out a sample prediction
print(f"predictions is of type: {type(predictions)}")
print(f"predictions has shape: {predictions.shape}")
print(f"predicted class for 10th element in test set: {predictions[9]}")

predictions is of type: <class 'numpy.ndarray'>
predictions has shape: (15,)
predicted class for 10th element in test set: 0


### Prediction probability

When a model predicts that a label is 1 rather than 0, it may help you to know if the model was predicting 1 with a 51% probability or 90% probability; in other words, how confident is that prediction?

You can get the model's probability of predicting each of the class. 
- To do this, use the `predict_proba()` method. 
- The resulting array will have a shape that matches the number of classes for the target variable.

In [7]:
prediction_probs = lr.predict_proba(X_test)
print(f"prediction_probs is of type: {type(prediction_probs)}")
print(f"prediction_probs has shape: {prediction_probs.shape}")
print(f"probabilities for first element in test set: {prediction_probs[9]}")

prediction_probs is of type: <class 'numpy.ndarray'>
prediction_probs has shape: (15, 2)
probabilities for first element in test set: [0.52049488 0.47950512]


There are 13 patients in the test set.  Each patient's label could be either 0 or 1, so the prediction probability has 13 rows and 2 columns.  To know which column refers to label 0 and which refers to label 1, you can check the `.classes_` attribute.

In [8]:
lr.classes_

array([0, 1])

Since the order of the `classes_` array is 0, then 1, column 0 of the prediction probabilities has label 0, and column 1 has label 1.



In [9]:
# Print the first 5 elements of the dataset

for i in range(5):
    print(f"Element number: {i}")
    print(f"Predicted class: {predictions[i]}")
    print(f"Probability of predicting class 0: {prediction_probs[i][0]}")
    print(f"Probability of predicting class 1: {prediction_probs[i][1]}\n")

Element number: 0
Predicted class: 1
Probability of predicting class 0: 0.4163960798121926
Probability of predicting class 1: 0.5836039201878074

Element number: 1
Predicted class: 1
Probability of predicting class 0: 0.487859105386813
Probability of predicting class 1: 0.512140894613187

Element number: 2
Predicted class: 1
Probability of predicting class 0: 0.47448109788199044
Probability of predicting class 1: 0.5255189021180096

Element number: 3
Predicted class: 0
Probability of predicting class 0: 0.8518210817782923
Probability of predicting class 1: 0.14817891822170776

Element number: 4
Predicted class: 0
Probability of predicting class 0: 0.8287608087287452
Probability of predicting class 1: 0.17123919127125475



We can see here that the predicted class matches the class with a higher probability of being predicted. Since you're dealing with `numpy` arrays, you can simply slice them and get specific information, such as the probability of predicting class 1 for all elements in the test set:

In [10]:
# Retrieve prediction probabilities for label 1, for all patients
prediction_probs[:, 1]

array([0.58360392, 0.51214089, 0.5255189 , 0.14817892, 0.17123919,
       0.29148012, 0.51712649, 0.33070589, 0.40522705, 0.47950512,
       0.17606497, 0.5081346 , 0.29847409, 0.19057603, 0.38912412])

### Tuning the Model

Most of the time, the predictive power of a classifier can be increased if a good set of hyperparameters is defined. This is known as model tuning. 

For this process, you'll need a classifier, an appropriate evaluation metric, and a set of parameters to test. Since this is a dummy example, you'll use the default metric for the logistic regression classifier: the **mean accuracy**.

### Mean Accuracy
Mean Accuracy is the number of correct predictions divided by total predictions. This can be computed with the `score()` method. 

Let's begin by checking the performance of the out-of-the-box logit classifier:

In [11]:
lr.score(X_test, y_test)

0.6666666666666666

Let's say we want to tweak this model's default parameters. We can pass a dictionary containing the values we specify to the classifier when we instantiate it. Notice that these must be passed as keyword arguments, or `kwargs`, which are created by using the ** prefix:



In [14]:
# Choose hyperparameters and place them as key-value pairs in a dictionary
params = {
    'solver': 'liblinear',
    'fit_intercept': False,
    'penalty': 'l1',
    'max_iter': 500
}

# Pass in the dictionary as keyword arguments to the model
lr_tweaked = LogisticRegression(**params)

# Train the model
lr_tweaked.fit(X_train, y_train)

# View hyper-parameters
print(f"Tweaked hyperparameters: {lr_tweaked.get_params()}\n")

# Evaluate the model with the mean accuracy
print(f"Mean Accuracy: {lr_tweaked.score(X_test, y_test)}")

Tweaked hyperparameters: {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': False, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 500, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l1', 'random_state': None, 'solver': 'liblinear', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}

Mean Accuracy: 0.6


The model with the tweaked parameters is about the same as the original! However, there might still be some combination of parameters that can increase the predictive power of the logit classifier. 

### Trying  Different Hyperparameters
Testing this can be daunting considering all the possible parameter combinations. Let's try something 

To get started, we'll apply `itertools.product()` to create all the combinations of parameters. 
- Notice that the iterable (in this case a list of the lists of parameters) must be passed as *args to the `product()` function.

In [15]:
# Choose hyperparameters and place in a dictionary
hyperparams = {
    'solver': ["liblinear"],
    'fit_intercept': [True, False],
    'penalty': ["l1", "l2"],
    'class_weight': [None, "balanced"]
}
# Get the values of hyperparams and convert them to a list of lists
hp_values = list(hyperparams.values())
hp_values

[['liblinear'], [True, False], ['l1', 'l2'], [None, 'balanced']]

In [17]:
hp_keys = list(hyperparams.keys())
hp_keys

['solver', 'fit_intercept', 'penalty', 'class_weight']

In [20]:
# Get every combination of the hyperparameters
for hp in itertools.product(*hp_values):
    print(hp)

('liblinear', True, 'l1', None)
('liblinear', True, 'l1', 'balanced')
('liblinear', True, 'l2', None)
('liblinear', True, 'l2', 'balanced')
('liblinear', False, 'l1', None)
('liblinear', False, 'l1', 'balanced')
('liblinear', False, 'l2', None)
('liblinear', False, 'l2', 'balanced')


In [21]:
# Loop through the combinations of hyperparams
for hp in itertools.product(*hp_values):

    # Create the model with the hyperparams
    estimator = LogisticRegression(solver=hp[0],
                                   fit_intercept=hp[1],
                                   penalty=hp[2],
                                   class_weight=hp[3])
    # Fit the model
    estimator.fit(X_train, y_train)
    print(f"Parameters used: {hp}")
    print(f"Mean accuracy of the model: {estimator.score(X_test, y_test)}\n")

Parameters used: ('liblinear', True, 'l1', None)
Mean accuracy of the model: 0.6

Parameters used: ('liblinear', True, 'l1', 'balanced')
Mean accuracy of the model: 0.5333333333333333

Parameters used: ('liblinear', True, 'l2', None)
Mean accuracy of the model: 0.4666666666666667

Parameters used: ('liblinear', True, 'l2', 'balanced')
Mean accuracy of the model: 0.5333333333333333

Parameters used: ('liblinear', False, 'l1', None)
Mean accuracy of the model: 0.6

Parameters used: ('liblinear', False, 'l1', 'balanced')
Mean accuracy of the model: 0.5333333333333333

Parameters used: ('liblinear', False, 'l2', None)
Mean accuracy of the model: 0.4

Parameters used: ('liblinear', False, 'l2', 'balanced')
Mean accuracy of the model: 0.5333333333333333



### That was Grid Search!

