## Heart disease prediction

In this problem you are presented with a tabular dataset with 13 attributes that are thought to be good indicators of a heart diseases. The dataset is stored in CSV format and can be found in `data/heart.csv`. This dataset originally comes from UCI and is available on [Kaggle](https://www.kaggle.com/ronitf/heart-disease-uci)

### Exploring the data

In [None]:
# First we download the data to colab envirnoment
!mkdir data
!wget -O data/heart.csv https://raw.githubusercontent.com/MJafarMashhadi/MachineLearningWorkshop/master/data/heart.csv

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

source_file = 'data/heart.csv'
data = pd.read_csv(source_file)
data

In [None]:
data.describe()

In [None]:
features, labels = data.iloc[:,:-1], data.target

print(features.shape, labels.shape)  # Expected: (303, 13) (303,)

### Splitting the data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

In [None]:
random_seeds = range(5000,5051)
k = 10

scores = {'LR': [], 'KNN': []}

for seed in random_seeds:
    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.25, random_state=seed)
    
    LogisReg = LogisticRegression(solver='liblinear').fit(X_train, y_train)
    KNN = KNeighborsClassifier(n_neighbors=k).fit(X_train, y_train)
    
    scores['LR'].append(LogisReg.score(X_test, y_test))
    scores['KNN'].append(KNN.score(X_test, y_test))

scores = pd.DataFrame(scores)
axis = scores.boxplot()
axis.set_ylim([0,1]);

### Exploring the asnwers

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.25, random_state=5011)
LogisReg = LogisticRegression(solver='liblinear').fit(X_train, y_train)  # Train once

In [None]:
idx = [5, 10, 15, 20, 22, 25]

inspectX, inspectY = X_test.iloc[idx], y_test.iloc[idx]
predictedY = LogisReg.predict(inspectX)
predictedY_df = pd.DataFrame({'prediction': predictedY}, index=inspectY.index)

pd.concat([inspectX, inspectY, predictedY_df], axis=1)

### What can be improved?

- Hyper parameters
- Data normalization
- One hot encoding of categorical data
- Regularization (L1 for example)
- More data?