# Heart Disease Prediction Hackathon

This notebook challenges you to put your knowledge to work to create a classifier that could predict whether a patient has heart disease based on some presented clinical data. We have provided you with the necessary code to load the data and get started. 


### Problem Definition

Simply put, the problem is to train a model that could predict whether someone has a heart disease based on some clinical data about the patient.

### Submission Instructions

See the [Submission](#Submission) section. Make sure to update the last cell of the notebook to reflect your name, email, and your model's score, as shown in that cell.


### Goal Setting

Your goal is to improve different sections of this notebook (data wrangling, modeling, etc.) to create a more accurate predictor and submit your notebook. Your model's accuracy will determine your ranking in the challenge - the more accurate, the better. 

**Overfitting:** Beware of overfitting. The data we will use for testing your model is not included in this notebook, so it will be previously unseen by your model. Therefore, make sure your model is not overfitting!

## Preparing the notebook

In [None]:
# Importing the basic data science libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# configuring Seaborn plots
sns.set(rc={'figure.figsize':(6,4)})
sns.set_context('notebook')
sns.set_style("ticks")
sns.set(style="darkgrid")

## Data Collection

The First step is to collect the data. Here we are directly loading the data from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/heart+disease). You should visit their website and read more about the dataset.

### Notes:
1. We directly set the column names based on their description on the website above.
2. We are importing the pre-processed version of the data. If you want to challenge yourself you can load the raw data using the following address and experiment with that.
    > Raw Data: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/cleveland.data
    
### Acknowledgement:

From the [Dataset description](https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names):

>   The authors of the databases have requested:
>
>      ...that any publications resulting from the use of the data include the 
>      names of the principal investigator responsible for the data collection
>      at each institution.  They would be:
>
>       1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
>       2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
>       3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
>       4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:
>	  Robert Detrano, M.D., Ph.D.
>
>   Thanks in advance for abiding by this request.
>
>   David Aha
>   July 22, 1988

In [None]:
# Load the Dataset
# Specify column names
# Specify that ? denotes missing data, aka NA (or NaN: Not-a-Number)
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data',
                names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "issick"],
                na_values = "?")

In [None]:
# Combine all sick values (1,2,3,4) into a single label (1)
df.issick = df.issick.astype(bool).astype(int)

## Data Exploration

Here you will see some basic exploratory plots to gain insights into the data. You should explore the data more to test your hypotheses.

In [None]:
# Checking target column counts
df.issick.value_counts()

In [None]:
# Plotting target column counts
sns.countplot(x="issick", data=df)

In [None]:
# Plotting gender distribution counts
# HINT: (Advanced) Note that the data is skewed, this will teach our model bias (bad). You can improve you model by removing the skew from your data. An online search is a good start.
sns.countplot(x='sex', data=df)
plt.title("Sex Count (0 = female, 1= male)")
plt.show()

In [None]:
# Based on the following plot, age smmes to have some corrolation with heart disease. Let's use that in our model.
pd.crosstab(df.age,df.issick).plot(kind="bar",figsize=(20,6))
plt.title('Heart Disease Frequency for Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Plot Survival vs Sex
grid = sns.FacetGrid(df, col='sex')
grid.map(sns.countplot, 'issick')

# Add Legend
grid.add_legend()
grid.set_axis_labels("is sick", "Frequency")

# Add Title
grid.fig.subplots_adjust(top=0.8)
grid.fig.suptitle('Sex vs. sickness (0 = female, 1= male)')

In [None]:
# Plot sickness across HearRate and Age

fig = sns.scatterplot(data=df, x='age', y='thalach', hue='issick')

plt.xlabel("Age")
plt.ylabel("Maximum heart rate")
plt.title("sickness across HearRate and Age")

### Your turn:

You can use the following space to explore the data and test your hypotheses. For instance, you might find some skew in the data; that is, you may have significantly more data for one class than the other (e.g., more male samples than female). In that case, you can look into different methods to remove the skew from your data to improve your model later.

In [None]:
# Your Turn




## Data Cleaning

In this section, we will clean our data. Since UCI has already pre-processed this dataset, there is only a little work to be done.

In [None]:
# Let's see which rows have NA values
df[df.isna().any(axis=1)]

In [None]:
# Now let's fill in the NA values with the mean of each column
## Hint: You may want to drop those 4 rows, fill them with df.mean() instead of 0, or fill them in with a more realistic method.

df = df.fillna(0)

In [None]:
# Let's verify that they are no rows with NA anymore
df[df.isna().any(axis=1)]

### Your Turn:

You can use the space below to explore other methods of filling in missing data instead of filling in 0. You'd need to rerun the cells from the top until the point where we fill `NaN` with `mean()` (two cells above) and then do your filling techniques instead. Alternatively, you can try to remove those rows to see if that helps.

In [None]:
# Your turn



## Data Wrangling (Data Engineering)

Time to clean up our data. Reading the documentation for this dataset, we notice that `cp`, `thal`, and `slope` are categorical data. We should treat them as such instead of linear numerical values to improve our model. 

In [None]:
# Convert cp, thal, and slope to catogories and replace them with a number 
df["cp"] = df["cp"].astype('category')
df["thal"] = df["thal"].astype('category')
df["slope"] = df["slope"].astype('category')

In [None]:
# Let's make sure Pandas is treating cp, thal, and slope as categorical data

df.dtypes

### Your turn:

You can use your domain knowledge about what might indicate heart disease and try to create those columns. For instance, you might want to multiply two columns and save them as a new column, or you might want to do a nonlinear operation (think logarithm or exponentiation) on some columns.  If you think you can find something that may more directly indicate heart disease, try adding that and see how your model's performance changes.

In [None]:
## Your turn



## Data Analysis (Model Training)

In [None]:
# We now separate our input data and prediction label into X and y

y = df.issick.values
X = df.drop(['issick'], axis = 1)

In [None]:
# Let's now normalize (scale so all values are between 0-1) our columns
# HINT: This is useful for models such as sklearn.linear_model.LogisticRegression which performs well on this data. You can try that model later in this section.

from sklearn import preprocessing
X = preprocessing.normalize(X, norm='l2')

In [None]:
# Let's separate our data for training and testing (here test and validation are the same.)
# HINT: you may want to try to keep a smaller part of the data for testing. 50/50 may be too little for training

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.85,random_state=0)

In [None]:
# Time to train our models and test them. 
# Let's start with a Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

acc = dtc.score(X_test, y_test)*100
print("Decision Tree Test Accuracy {:.2f}%".format(acc))

In [None]:
# Let's also try a Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 1000, random_state = 1)
rf.fit(X_train, y_train)

acc = rf.score(X_test,y_test)*100
print("Random Forest Algorithm Accuracy Score : {:.2f}%".format(acc))

### Your Turn:

You may want to try the following models as well:
- SVC: [Support Vector Machine Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
- LR: [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
# Your Turn



## Going Further:

We explored some of the essential data cleaning and modeling techniques above. To go further, look at the hints that we embedded throughout this notebook. 

Additionally, since this is a well-studied dataset, you can look online for other public work that might inspire you. This [Kaggle forum](https://www.kaggle.com/ronitf/heart-disease-uci/code) is a good place to get started. For instance, [this notebook](https://www.kaggle.com/faressayah/predicting-heart-disease-using-machine-learning) might be a good read.

In [None]:
## Your turn



## Submission

Include your final model's score in the cell below and submit your notebook as follows:

1. Update the cell below: Include `name`, `email`, and final model score
1. Save Notebook: `File` > `Save`
1. Download Notebook: `File` > `Download As` > `Notebook`
1. Upload your notebook here: https://ibm.biz/kp-hack-submission

In [None]:
# Submission

# Name: 
# Email Address: 


## REPLACE THIS SECTION ########################## 
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

acc = dtc.score(X_test, y_test)*100
## REPLACE ABO ####################################

print("Decision Tree Test Accuracy {:.2f}%".format(acc))