# AI in Medicine: Data Science - Machine Learning
## Python programming: Machine learning using *scikit-learn*

- **Tutor:** Roshan Prakash Rane, AG Ritter, Charité - Universitätsmedizin Berlin (roshan-prakash.rane@charite.de)
- **Target audience**: Medical students from Charité

> **Note:** If you are using this notebook outside of Google Colab, **please ensure that you're using a Python version 3.6 or higher**. You can check that in the top right corner of your browser window. If you're using a different Version, go to the Tab "Kernel" --> "Change Kernel" and select "Python 3.6" or higher. <br>

**Executing the below cell should return a version higher than 3.6.0**

In [None]:
!python --version

## 1. Aims of this session

This session will serve as a programming tutorial following the previous theoretical session on machine learning. We will revisit the basic concepts of **machine learning** using a practical example. We will use the Python programming language and learn to use Python's machine learning package '*scikit-learn*' (along with *pandas*, *numpy* and *matplotlib* packages), to train different types of machine learning models and compare them.  

## 2. Learning goals

By the end of this session, you should be familiar with:

- How to read a dataset, explore the different data columns and clean it, if necessary, for training machine learning models.
- Cross validation: Why we should split our dataset into 'training' data and 'test' data and how to do it.
- Learn to identify if the task at hand is a *classification task* or a *regression task*.
- Training 2 different machine learning models: 
    - Support Vector Machine Classifiers (SVC)
    - Logistic Regression
- Compare the performance of different machine learning models and determine which one is better.

## 3. References

Documentation for python libraries used in this notebook:

- https://scikit-learn.org/stable/
- https://pandas.pydata.org/pandas-docs/stable/
- https://numpy.org/doc/
- https://matplotlib.org/

Documentation for classifiers:

- https://en.wikipedia.org/wiki/Support_vector_machine
- https://en.wikipedia.org/wiki/Logistic_regression

Documentation for metrics used:

- https://en.wikipedia.org/wiki/Accuracy_and_precision
- https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
- https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

Further learning material:

- StatQuest's Machine learning lecture series on youtube: ([click here](https://www.youtube.com/playlist?list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF))<br>(Episodes: Introduction, Cross-validation, Confusion Matrix, Sensitivity and Specificity, k-nearest neighbor, linear / logistic regression, support vector machines)


## 4. Theory
The theory will be refreshed alongside, as we walk-through the practical sections.

## 5. Practical
5.1 Data loading and exploration<br>
5.2 Data preprocessing<br>
5.3 Classification or regression?<br>
5.4 Splitting the data into 'training' set and a 'test' set<br>
5.5 Train machine learning models<br>
5.6 Evaluate model performance<br>

### 5.1 Data loading and exploration

* In this tutorial, we will look at the dataset provided by *Pima Indians Diabetes Database*. It is freely [available here](https://www.kaggle.com/uciml/pima-indians-diabetes-database). 
* Our objective is to build a machine learning model that can accurately predict (diagnose) whether or not a patient has diabetes or not, based on several diagnostic measurements provided in the dataset.
* We will be using a subset of the larger database for ease-of-use. In our subset we have selected only females who are at least 21 years old, and of Pima Indian heritage. You can find this data subset in our github repository: https://github.com/volkamerlab/ai_in_medicine/raw/master/data/ .

*In this part, you will get a chance to apply the lessons learned in the previous programming sessions too.*

The source publication of the data for reference:<br>
*Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.*

Let us first load the dataset and try to understand the different columns provided: 

In [None]:
# First, import the Pandas, NumPy and matplotlib libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [None]:
# load the csv file as a pandas dataframe
df = pd.read_csv("https://github.com/volkamerlab/ai_in_medicine/raw/master/data/diabetes.csv")

In [None]:
#see the loaded dataframe 
df

As we can observe,
- the dataset has 768 subjects and each subject have 9 variables:
    1. Pregnancies
    2. Glucose 
    3. BloodPressure 
    4. SkinThickness 
    5. Insulin 
    6. BMI 
    7. DiabetesPedigreeFunction 
    8. Age 
    9. Outcome
- The last column "Outcome" denotes the diabetic information. A value of '1' here denotes that the subject has diabetes.
- The rest of the 8 columns are medical predictor variables that are known to be associated with diabetes and they *might* help us to more accurately predict diabetes.

**Our task here is to use these 8 medical predictor variables and try and predict if the subject has diabetes or not** (i.e. to predict the 'outcome' variable):

<img src="https://github.com/volkamerlab/ai_in_medicine/raw/master/images/MLflowchart.png">

(*[link to the source diagram](https://github.com/volkamerlab/ai_in_medicine/raw/master/images/MLflowchart.png)*)

**Discussion:** 
1. What variables among these do you think are the most informative, medically, for diabetes diagnosis? 
2. Would a machine learning model that uses these variables and diagnoses diabetes accurately be beneficial for the medical field?

Let us try to visualize the 'distribution' of the diabetes column in our data. <br>
We can use the `values_counts()` method of a `pandas.Dataframe`, to print the count of diabetic/ non-diabetic subjects. 

In [None]:
# count the subject with and without diabetes
label = df["Outcome"] # first, select the 'Outcome' column from the dataframe
label_counts = label.value_counts() # next, use the pandas 'value_counts()' method 
# print the result
label_counts

Let's make these numbers more intuitive by plotting a 'bar' graph. <br>
We can use the `plot()` function of pandas.Dataframe to do this.

In [None]:
ax = label_counts.plot(kind="bar")

Ok, we can definitely improve this graph a bit more and make it readable for anyone else who would look at it. 

In [None]:
ax = label_counts.plot(kind="bar", 
                       title="Count of diabetic vs healthy subjects",
                       ylabel="number of subjects", 
                       rot=0)
tick_names = ax.set_xticklabels(["healthy", "diabetic"])

Now, let's visualize the distributions of the different medical variables that will be provided as input to our machine learning model.

In [None]:
# first, create a canvas on which 1 x 4 graphs can be drawn
f, axes = plt.subplots(1, 4, sharey=True, figsize=(16,4))

df.plot(y="Age", kind="hist", ax=axes[0])
df.plot(y="Glucose", kind="hist", ax=axes[1])
df.plot(y="DiabetesPedigreeFunction", kind="hist", ax=axes[2])
df.plot(y="Pregnancies", kind="hist", ax=axes[3])

plt.tight_layout()
plt.show()

Side note: We can also use the 'density' plot to visualize the same data..

**Exercise:** Now it's your turn. Similar to what we did in the previous cell, plot the distributions of the remaining 4 variables in the dataset:
1. BMI
2. SkinThickness
3. BloodPressure
4. Insulin

In [None]:
# YOUR CODE GOES HERE
# hint: start by copying the code from the above cell. 
# Next, provide the column names you want to visualize.

<!-- f, axes = plt.subplots(1, 4, sharey=True, figsize=(16,4))

df.plot(y="BMI", kind="hist", ax=axes[0])
df.plot(y="SkinThickness", kind="hist", ax=axes[1])
df.plot(y="BloodPressure", kind="hist", ax=axes[2])
df.plot(y="Insulin", kind="hist", ax=axes[3])

plt.tight_layout()
plt.show() -->

### 5.2 Preprocessing the data

Generally, large datasets are messy with missing information or incorrect and invalid values due to human errors or systemic issues in the data collection process. Therefore, in practice, you need to preprocess the data. Otherwise, they wouldn't be reliable enough to train a machine learning model.

Some of the prominent preprocessing steps include:
* Dropping invalid/non-numeric/gibberish values
* Cleaning unneeded columns
* Converting measurement units
* Reducing noise using smoothing functions
* Standardizing and normalizing variable values: covered in the previous theory session

**Discussion:** Do you find any such discrepancies in our dataset so far?

Some subjects have 'Glucose' value as '0' which can't be the case. Maybe the data was not collected for these subjects.

In [None]:
# print the count of subjects with Glucose == 0
(df['Glucose'] == 0)

In [None]:
(df['Glucose'] == 0).value_counts()

Let's drop these subjects from our data as they are probably not very reliable

In [None]:
# select only the subjects who have a 'Glucose' value 
df_clean = df[(df['Glucose'] != 0)]
df_clean

It is the same case with 'BMI' where some subjects have '0'. Let's remove them too.

In [None]:
df_clean = df_clean[(df_clean['BMI'] != 0)]

**Exercise:** Similarly, lets also remove subjects with BloodPressure=0 as those are also probably incoherent data.

In [None]:
# YOUR CODE GOES HERE

<!--  df_clean = df_clean[(df_clean['BMI'] != 0)] -->

In [None]:
df_clean

### 5.3 Classification or regression?

Is our task a classification task or a regression task?

<img src="https://github.com/volkamerlab/ai_in_medicine/raw/master/images/classandregress.png" width="700" />

*([link to source image](http://tonyeiyalla.com/images/classandregress.png))*

Let's try to look at our data again. <br>
This time, let's use a 'scatter plot' to compare how 2 input variable relate to each others. 

In [None]:
f, axes = plt.subplots(1, 2, figsize=(10,4))

# plot BMI vs Glucose 
ax1 = df_clean.plot(x="BMI", y="Glucose", kind="scatter", ax=axes[0])

# plot DiabetesPedigreeFunction vs Glucose 
ax2 = df_clean.plot(x="DiabetesPedigreeFunction", y="Glucose", kind="scatter", ax=axes[1])

plt.show()

Now, let's see how the output 'diabetes' variable relates to these relations.

In [None]:
f, axes = plt.subplots(1, 2, figsize=(10,4))

# set a red color for diabetic subjects and blue for healthy subjects
label = df_clean['Outcome'].map({0:'blue', 1:'red'})

# plot BMI vs Glucose 
ax1 = df_clean.plot(x="BMI", y="Glucose", kind="scatter", ax=axes[0], c=label)
ax1.legend(["Diabetic", "Healthy"])

# plot DiabetesPedigreeFunction vs Glucose 
ax2 = df_clean.plot(x="DiabetesPedigreeFunction", y="Glucose", kind="scatter", ax=axes[1], c=label)
ax2.legend(["Diabetic", "Healthy"])

plt.show()

**Discussion:** 
1. Do you see a correlation between the "BMI" and "Glucose" scores?
2. What about between "BMI" and "DiabetesPedigreeFunction"?
3. If you had to draw a line to differentiate between diabetic and healthy subjects in the first plot, where would you put it?
4. Are we doing classification or regression?
5. If we were trying to predict 'Glucose' from 'BMI', would we be doing a classification or a regression?

Extract the 'X' and y for our machine learning model as numpy arrays:

In [None]:
X = df_clean[["Pregnancies","Glucose","BloodPressure","SkinThickness",
              "Insulin","BMI","DiabetesPedigreeFunction","Age"]].values
y = df_clean[["Outcome"]].values

In [None]:
# print the shapes of our numpy arrays
X.shape, y.shape

### 5.4 Splitting the data into 'training' set and 'test' set

### Why do we need a test set?

Our goal is to be learn a (machine learning) model that **generalizes** well. <br>
Over-fitting problem in a classification task vs a regression task:
<table>
    <tr>
        <td>
            <img src="https://github.com/volkamerlab/ai_in_medicine/raw/master/images/tuning.png" width="300" />
        </td>
        <td>
            <img src="https://github.com/volkamerlab/ai_in_medicine/raw/master/images/fitting_data.png" width="800" />
        </td>
    </tr>
</table>

*([link to source](https://github.com/volkamerlab/ai_in_medicine/raw/update-2021.02/images/tuning.png))*
*([link to source](https://github.com/volkamerlab/ai_in_medicine/raw/update-2021.02/images/fitting_data.png))*

Therefore, we split our dataset into 2 subsets: 
1. A larger subset on which we will train our model called the 'training' set. 
2. A smaller subset on which we will 'test' the model. It is important that we evaluate our model on a subset of the data it has never seen before to ensure that we are not overfitting the data and that our model can generalize well to unseen data.

Let us use 20% of our data as the test set. The remainding 80% can be used to train the classifer. These ratios may vary depending on the size of the dataset we are using, but 20% to 80% is a good starting point.

In [None]:
# find out how many subject form '80%' in our data
round(len(X)*80/100)

In [None]:
x_train = X[:579]
x_test = X[579:]

y_train = y[:579]
y_test = y[579:]

(x_train.shape), (x_test.shape), (y_train.shape), (y_test.shape)

Python has a machine learning library called '*sklearn*' that provides several convenient functions for machine learning:<br>
<img src="https://github.com/volkamerlab/ai_in_medicine/raw/master/images/Scikit_learn_logo.png" width="300" /> 
*([link to source](https://en.wikipedia.org/wiki/Scikit-learn#/media/File:Scikit_learn_logo_small.svg))* <br>

*sklearn* library has a function called `train_test_split()` that can be used for splitting our dataset.

In [None]:
# import function for splitting our data 
from sklearn.model_selection import train_test_split

# Split the features and labels intro training and test sets by setting the test_size variable to 20%
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

(x_train.shape), (x_test.shape), (y_train.shape), (y_test.shape)

### 5.5 Train machine learning models

There are many different machine learning models. Just to name a few popular ones:
1. Linear Regression
2. Logistic Regression
3. Decision Tree
4. Linear Support Vector Machine Classifier (LinearSVC)
5. Non-linear Support Vector Machine Classifier (SVC)
6. Deep learning methods: 
    1. Feed-forward networks
    2. Convolutional neural networks (CNN)
    3. Recurrent neural networks (RNN)
    
We will try 3 different machine learning models for our task:
1. Linear Support Vector Machine Classifier (LinearSVC)
2. Non-linear Support Vector Machine Classifier (SVC)
3. Logistic Regression

Why do we need so many different models?
- In machine learning, the idea of using an algorithm is to figure out the relationship or function mapping between the features X and target y.
- Every machine learning algorithm makes its assumptions about the data based on which it generalizes and makes predictions. Its performance depends on how well the assumptions correlate with the underlying patterns in the data.
- No free lunch theorem: There is "no free lunch" (best predictions) without having the best knowledge of the underlying data. It is the job of the data scientist to determine what fits best with the data.

#### 5.5.1 Linear Support Vector Machine Classifier (LinearSVC)
To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. 

<img src="https://github.com/volkamerlab/ai_in_medicine/raw/master/images/hyperplane.png" width="400" /> 

*([link to source](https://github.com/volkamerlab/ai_in_medicine/raw/update-2021.02/images/hyperplane.png))* 


In [None]:
# import model classe from sklearn
from sklearn.svm import LinearSVC

# Create a model (also called creating an 'instance' of a model class in programming lingo) 
linsvc = LinearSVC(max_iter=2000)

# fit the model to our train data using a class method
linsvc.fit(x_train, y_train)

In [None]:
y_train = y_train.ravel()
y_test = y_test.ravel()

Linearly seperable task vs non-linearly seperable task:
<table>
    <tr>
        <td>
            <img src="https://github.com/volkamerlab/ai_in_medicine/raw/master/images/linear_sep.png" width="300" />
        </td>
        <td>
            <img src="https://github.com/volkamerlab/ai_in_medicine/raw/master/images/non-linear_sep.png" width="300" />
        </td>
    </tr>
</table>

(*[link to source](https://github.com/volkamerlab/ai_in_medicine/raw/master/images/linear_sep.png)*)
(*[link to source](https://github.com/volkamerlab/ai_in_medicine/raw/master/images/non-linear_sep.png)*)

<img src="https://github.com/volkamerlab/ai_in_medicine/raw//master/images/sphx_glr_plot_iris_svc_001.png" width="700" />

(*[link to source](https://github.com/volkamerlab/ai_in_medicine/raw/master/images/sphx_glr_plot_iris_svc_001.png)*)

Reading the warning message from python, it appears as though our linear SVC did not converge. This may imply that our data is not linearly separable and maybe a different classifier is better suited for the classification task. Lets forget the linear SVC and try the other classifier methods.


#### 5.5.2 Non-linear Support Vector Machine Classifier (SVC) 
This is good for separating non-linearly separable data.<br>

In [None]:
# import model classe from sklearn
from sklearn.svm import SVC

# Instantiate an object of the model class
svc = SVC(probability=True) 
# We set probability to True when instantiating our SVC model to get a probability estimate of the labels.

# fit the model to our train data using a class method
svc.fit(x_train, y_train)

#### 5.5.3 Logistic Regression Classifier

A logistic regression classifier is also good for separating non-linearly separable data.

<img src="https://github.com/volkamerlab/ai_in_medicine/raw/master/images/logistic_reg.png" width="400" /> *([link to source](https://github.com/volkamerlab/ai_in_medicine/raw/update-2021.02/images/logistic_reg.png))* 


In [None]:
# import model from sklearn
from sklearn.linear_model import LogisticRegression

# Instantiate an object of the model class
logreg = LogisticRegression(max_iter=200) 
# We set the max iterations of our logistic classigier to 200 when we instantiate our class because it did not converges with the default value of 100.

# fit the model to our train data using a class method
logreg.fit(x_train, y_train)

Now that we have two model that have been fit to our training data, we can use our test data to evaluate them.

### 5.6 Evaluate model performance


#### 5.6.1 Make predictions

Lets make predictions on the test set that we will later compare to the respective true labels to evaluate of our model.

In [None]:
# import useful functions from the metrics module to evaluate our model
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report

In [None]:
# use the predict() method from our model classes to predict labels given the test set of features x_test.
y_pred_svc = svc.predict(x_test)
y_pred_log = logreg.predict(x_test)

#### 5.6.2 Accuracy

$\text{accuracy} = \frac{tp + tn}{tp + fp + tn + fn}$, is a measure of how well our classifier can determine the true labels of inputs. <br><br>
Here we will compute the accuracies of both of our models.

In [None]:
# we are using the model class method score() to return the accuracy of predictions from each model.
acc_svc = svc.score(x_test, y_test)
acc_log = logreg.score(x_test, y_test)
print('Accuracy of Support Vector classifier on test set: {:.1f}%'.format(acc_svc*100))
print('Accuracy of logistic regression classifier on test set: {:.1f}%'.format(acc_log*100))

Although accuracy is a common metric, it often does not tell the whole story.

We need other ways to assess how well our classifier performs.

#### 5.6.3 Outcomes of a classifier

_Successful predictions_ are only one of the possible outcomes of a prediction from a classifier. These outcomes can be generalized using the following four classes:

- True positives
- False positives
- True negatives
- False negatives

<table>
    <tr>
        <td>
            <img src="https://github.com/volkamerlab/ai_in_medicine/raw/master/images/confusion_matrix.png" width="500" />
        </td>
        <td>
            <img src="https://github.com/volkamerlab/ai_in_medicine/raw/master/images/confusion_matrix_pregnancy.png" width="500" />
        </td>
    </tr>
</table>

(*[link to source](https://dzone.com/articles/understanding-the-confusion-matrix)*)

#### 5.6.4 Confusion Matrices

We can visualize the distribution of prediction classes predicted using a classifier by plotting a confusion matrix. We now use the plot_confusion_matrix function we imported above to visualize the distribution of True Positives, False Positives, False Negatives, True Negatives for both of our classifiers.

In [None]:
# Check the documentation to know what variables to use!
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(15, 5))

plot_confusion_matrix(svc, x_test, y_test, ax=ax1) # , normalize='all'
ax1.set_title('Confusion Matrix - SVC') # , fontsize=20

plot_confusion_matrix(logreg, x_test, y_test, ax=ax2) # , normalize='all'
ax2.set_title('Confusion Matrix - LR') # , fontsize=20

plt.show()

#### 5.6.5 Receiver Operating Characteristic (ROC) Curve

- The ROC curve is a graphical method to summarise all possible confusion matrix of a model. 
- It shows the performance of a classification model at all thresholds. 
- It is a common choice for assessing a binary classifier. 
- The ROC curve is a plot of the True Positive Rate (recall) vs the False Positive Rate. 
- The diagonal line (C) represents the performance of a random classifier that has a 50% chance of outputting either label. 
- The area under the curve (AUC) of the ROC curve is a measure that tells us how well our classifier can distinguish between the classes.


<img src="https://github.com/volkamerlab/ai_in_medicine/raw/master/images/rocs.png" width="600" /> 

(*[link to source](https://github.com/volkamerlab/ai_in_medicine/raw/master/images/rocs.png)*)

In [None]:
svc_roc_auc = roc_auc_score(y_test, svc.predict(x_test))
logit_roc_auc = roc_auc_score(y_test, logreg.predict(x_test))
print('ROC AUC score for Support Vector classifier on test set: {:.1f}%'.format(svc_roc_auc*100))
print('ROC AUC score for logistic regression classifier on test set: {:.1f}%'.format(logit_roc_auc*100))

In [None]:
# create Figure, Axes objects and set figure dimensions
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(15, 5))

# compute and plot the AUC and ROC values for the support vector classifier
fpr, tpr, thresholds = roc_curve(y_test, svc.predict_proba(x_test)[:,1])

ax1.plot(
    fpr, tpr, 
    label='SVC - RBF Kernel (area = {:.2f})'.format(svc_roc_auc*100)
)
ax1.plot([0, 1], [0, 1],'r--')
ax1.set_xlim([0.0, 1.0])
ax1.set_ylim([0.0, 1.05])
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax1.set_title('Receiver operating characteristic')
ax1.legend(loc="lower right")
ax1.grid()


# compute and plot the AUC and ROC values for the logistic regression classifier
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(x_test)[:,1])

ax2.plot(
    fpr, tpr, 
    label='Logistic Regression (area = {:.2f})'.format(logit_roc_auc*100)
)
ax2.plot([0, 1], [0, 1],'r--')
ax2.set_xlim([0.0, 1.0])
ax2.set_ylim([0.0, 1.05])
ax2.set_xlabel('False Positive Rate')
ax2.set_ylabel('True Positive Rate')
ax2.set_title('Receiver operating characteristic')
ax2.legend(loc="lower right")
ax2.grid()

plt.show()

**Discussion:** Which model do you think is better? 

In practice, data scientists also use n-fold cross-validation and permutation tests to better compare models, tune hyperparameters, and get better estimates for our metrics. <br>
This will result in a clearer picture of how the models preform, and which models to use for classification of novel data.

## 6. Summary

Lets briefly revisit some topics covered in this module:

- Machine learning is a task of learning models that can map input variables to desired outputs with high precision; features to labels
    - classification: outputs are discrete variables or classes
    - regression: outputs are continuous variables
- We can load our dataset in python from a csv file using pandas and explore it using different plotting types
    - normally we would have to preprocess our data
- Data is split up into training and test sets so that we can evaluate the model's generalization capacity on unseen data
- models are trained on the training set and their performance evaluated on the test set
- It is generally a good idea to use different metrics when evaluating a model 
    - this leads to a better understanding of how the model performs


## 7. Exercises
### Let us now use the different columns to predict if a subject's age is 35 years or older. 
Going through these exercises, you will develop a better undertanding of how to train and test models given a dataset. We suggest that you use the above code as a reference but do **NOT** simply copy and paste. You will gain a deeper understanding if your type the code yourself, implement the functions, and use docs to understand how functions work and what parameters to pass in.

Let's create a dataframe for our task

In [None]:
# set a common random seed to avoid getting very different scores due to model stochasticity
np.random.seed(0)

In [None]:
# add diabetes as another input variable
df_age = df_clean.rename(columns={"Outcome":"Diabetes"})
# Make age the new 'Outcome' variable
df_age["Outcome"] = (df_age["Age"] >= 35).astype(int)
df_age = df_age.drop(columns=["Age"])
df_age

### 7.1 Plot the new label's distribution or counts

In [None]:
# YOUR CODE GOES HERE
# Note: Play around with the plt methods and parameter to see how it changes the 
#       display of your plots.

<!-- 
ax = df_age["Outcome"].value_counts().plot(kind='bar',
                       title="Age<35 vs Age>=35",
                       ylabel="number of subjects", 
                       rot=0)
tick_names = ax.set_xticklabels(["Age<35", "Age>=35"])
plt.show()-->

### 7.2 Create the new X (features) and y (labels) variables

In [None]:
# YOUR CODE GOES HERE

<!-- X = df_age[["Pregnancies","Glucose","BloodPressure","SkinThickness",
              "Insulin","BMI","DiabetesPedigreeFunction","Age"]].values
y = df_age[["Outcome"]].values -->

In [None]:
# verification: this cell should return ((724, 8), (724, 1))
X.shape, y.shape 

### 7.3 Split your data into training and test sets

You should use the `train_test_split` function imported from sklearn. We want you to use a test set size of 25%. 

Explicitly pass `shuffle=False` in `train_test_split` to make sure we all have the same train and test splits.

In [None]:
# YOUR CODE GOES HERE

<!-- x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False) 
y_train = y_train.ravel()
y_test = y_test.ravel() -->

In [None]:
# verification: this cell should return numpy shapes ((543, 8), (181, 8), (543, 1), (181, 1))
(x_train.shape), (x_test.shape), (y_train.shape), (y_test.shape)

### 7.4 Train your Models

Implement the non-linear SVC and Logistic Regression models and fit them to the training data.

In [None]:
# YOUR CODE GOES HERE
# Hint: Remember to compute the probabilities for SVC and make sure that the logistic regression 
#       model converges (by increasing the max_iter argument if necessary).

### 7.5 Evaluate your Models

#### 7.5.1 Predict the labels of the test set using both classifiers

In [None]:
# Hint: Use the predict method from the classifier class
# YOUR CODE GOES HERE

<!-- y_pred_svc = svc.predict(x_test)
y_pred_log = logreg.predict(x_test) -->

#### 7.5.3 Compute and print the Accuracy for both models

In [None]:
# Hint: Use the score method from each model class.
# YOUR CODE GOES HERE

<!-- acc_svc = svc.score(x_test, y_test)
acc_log = logreg.score(x_test, y_test)
print('Accuracy of Support Vector classifier on test set: {:.1f}%'.format(acc_svc*100))
print('Accuracy of logistic regression classifier on test set: {:.1f}%'.format(acc_log*100)) -->

#### 7.5.3 Plot the confusion matrices from each classifier

In [None]:
# Hint: Use the plot_confusion_matrix function import from sklearn
# YOUR CODE GOES HERE

<!-- fig, (ax1,ax2) = plt.subplots(1,2,figsize=(15, 5))

plot_confusion_matrix(svc, x_test, y_test, ax=ax1) # , normalize='all'
ax1.set_title('Confusion Matrix - SVC') # , fontsize=20

plot_confusion_matrix(logreg, x_test, y_test, ax=ax2) # , normalize='all'
ax2.set_title('Confusion Matrix - LR') # , fontsize=20 

plt.show() -->

#### 7.5.4 Plot the ROC curves and compute the AUC for both classifiers

In [None]:
# Hint: Use the roc_auc_score and roc_curve functions
# YOUR CODE GOES HERE

<!-- fig, (ax1,ax2) = plt.subplots(1,2,figsize=(15, 5))


# compute and plot the AUC and ROC values for the support vector classifier
svc_roc_auc = roc_auc_score(y_test, svc.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, svc.predict_proba(x_test)[:,1])


ax1.plot(
    fpr, tpr, 
    label='SVC - RBF Kernel (area = {:0.2f})'.format(svc_roc_auc)
)
ax1.plot([0, 1], [0, 1],'r--')
ax1.set_xlim([0.0, 1.0])
ax1.set_ylim([0.0, 1.05])
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax1.set_title('Receiver operating characteristic')
ax1.legend(loc="lower right")
ax1.grid()


# compute and plot the AUC and ROC values for the logistic 
# regression classifier
logit_roc_auc = roc_auc_score(y_test, logreg.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(x_test)[:,1])


ax2.plot(
    fpr, tpr, 
    label='Logistic Regression (area = {:0.2f})'.format(logit_roc_auc)
)
ax2.plot([0, 1], [0, 1],'r--')
ax2.set_xlim([0.0, 1.0])
ax2.set_ylim([0.0, 1.05])
ax2.set_xlabel('False Positive Rate')
ax2.set_ylabel('True Positive Rate')
ax2.set_title('Receiver operating characteristic')
ax2.legend(loc="lower right")
ax2.grid()

plt.show() -->