## Lab 8.01 Neural Network Binary Classification
_By: Jeff Hale_

This lab uses a small [dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) provided by the Cleveland Clinic Foundation for Heart Disease. The goal is to predict whether a patient has heart disease (the *target* column).

We suggest running this lab in [Google Colab](https://colab.research.google.com/). 

Go *File*->*Upload notebook* to upload this notebook to colab.

If you are working in Colab and want to see if you can get a speed boost, set your runtime to *TPU* by going to *Runtime* -> *Change runtime type* in the menu above. 

The processor type won't make a big difference with this small dataset and small networks, but it's good to know how to change the processor.

### Read data and load using pandas

#### Imports

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.regularizers import l2
from tensorflow.keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import recall_score, accuracy_score, f1_score, precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
import seaborn as sns
import category_encoders as ce
from sklearn.preprocessing import StandardScaler

Retrieve the csv file containing the heart dataset that ships with TensorFlow.

In [10]:
csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')

#### Load the csv file using pandas

In [11]:
heart_df = pd.read_csv('./data/heart.csv')

#### Inspect

In [12]:
heart_df.dropna(inplace=True)

In [13]:
heart_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 297 entries, 0 to 301
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Age        297 non-null    int64  
 1   Sex        297 non-null    int64  
 2   ChestPain  297 non-null    object 
 3   RestBP     297 non-null    int64  
 4   Chol       297 non-null    int64  
 5   Fbs        297 non-null    int64  
 6   RestECG    297 non-null    int64  
 7   MaxHR      297 non-null    int64  
 8   ExAng      297 non-null    int64  
 9   Oldpeak    297 non-null    float64
 10  Slope      297 non-null    int64  
 11  Ca         297 non-null    float64
 12  Thal       297 non-null    object 
 13  AHD        297 non-null    object 
dtypes: float64(2), int64(9), object(3)
memory usage: 34.8+ KB


In [14]:
heart_df.head(3)

ImportError: cannot import name 'TableFormatter' from 'pandas.io.formats.format' (C:\Users\cmrjk\anaconda3\envs\gpt_env\lib\site-packages\pandas\io\formats\format.py)

   Age  Sex     ChestPain  RestBP  Chol  Fbs  RestECG  MaxHR  ExAng  Oldpeak  \
0   63    1       typical     145   233    1        2    150      0      2.3   
1   67    1  asymptomatic     160   286    0        2    108      1      1.5   
2   67    1  asymptomatic     120   229    0        2    129      1      2.6   

   Slope   Ca        Thal  AHD  
0      3  0.0       fixed   No  
1      2  3.0      normal  Yes  
2      2  2.0  reversable  Yes  

#### Check the value counts of the columns that are objects

You could load data into a format for TensorFlow using `tf.data.Dataset`, but these are unwieldly - it's a serious pain to create a validation dataset from a TF dataset. You used to have to make your data into NumPy arrays for TensorFlow. Now you can just keep them as pandas DataFrames!

#### Set up X and y
Convert `ChestPain`, `Thal`, and `AHD` (the target) columns to numeric.
To convert features to numeric, feel free to give this friendlier OHE a try:
```
!pip install category_encoders 
import category_encoders as ce
```

In [1]:
!pip install category_encoders 
import category_encoders as ce

Collecting category_encoders
  Downloading category_encoders-2.2.2-py2.py3-none-any.whl (80 kB)
Installing collected packages: category-encoders
Successfully installed category-encoders-2.2.2


#### Train-test split

#### Set the TensorFlow random seed

## Create and train a model

#### Create model structure

#### Visualize the structure

#### Compile

#### Make an EarlyStopping callback

#### Fit the model

#### Plot model performance on the validation data (accuracy, recall, and precision)

#### Save model

#### Load model

#### Evaluate on test set 

#### X_test predictions

#### What do those numbers look like? Let's get rid of that exponential notation and just round the predictions.

#### How is the model performing? 
Let's look at the confusion matrix using TF.

Or just use sklearn's confusion matrix. 

#### Evaluate with other sklearn metrics

### Make a null model

## Can you make a better model?

Change the model architecture and see if you can make a better model. Add nodes, dense layers, and dropout layers.

#### Save your best model. 


#### Compare with other algorithms
Compare your best neural net model to a scikit-learn LogisticRegression model. Also try a GradientBoostingClassifier.

#### Scale/Standardize for Logistic Regression


#### Logistic Regression

#### Evaluate performance on metrics other than accuracy

### GradientBoosting

#### Evaluate performance on metrics other than accuracy

#### Which evaluation metric(s) are best to use in this problem?

#### Which model would you recommend for use? Why?


#### If you used Google Colab, download your notebook and put it in your Submissions repo. 🎉