# Explaining maching learning
You can find in this notebook the exploratory data analysis done over the heart disease dataset, coupled with the simple machine learning algorithm used to predict the diseases.

## Plan
1. Introduction
  1. What are we going to explain here
    1. Explain AI/ML
    2. Not a tutorial, un exemple concret tout public de à quoi ça pourrait servir
    3. Essayer d'expliquer le taff d'un DS pour les gens qu'on connait et ceux interessés en general
  2. Why do we need to talk about AI in general
    1. Future
    2. Generic term needs to be addressed
    3. See the good side
2. What is a Data Scientist
  1. Work with data
  2. Data Analysis
  2. create models
  3. communicate
3. Example dataset
  1. Explaining the dataset and the goal
  2. A few statistics/plots
  3. Predicting the heart disease
  4. Explaining results
4. Communicating results to business/boss
  1. Presenting the example's results
  2. Another level of abstraction ?
5. (OPT) Cleaning codebase/Explaining why we would need another language
6. Conclusions, pointing to the notebook.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import sklearn.metrics as metrics
import torch
%config InlineBackend.figure_format = 'retina'
pd.set_option('display.max_columns', 500)

In [None]:
from torch import FloatTensor
from torch.utils.data import Dataset
from torch.utils.data import SequentialSampler, DataLoader, WeightedRandomSampler, BatchSampler
from sklearn.preprocessing import normalize
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

## Dataset fields description
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the 
    hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
    - Value 1: upsloping
    - Value 2: flat
    - Value 3: downsloping
12. ca: number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. num: diagnosis of heart disease (angiographic disease status)
    - Value 0: < 50% diameter narrowing
    - Value 1: > 50% diameter narrowing
    (in any major vessel: attributes 59 through 68 are vessels)

In [None]:
heart_df = pd.read_csv("../data/heart-disease/processed.cleveland.data", delimiter=",",
            names=["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
                    "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"])
heart_df = heart_df.rename(columns={"cp":"chest_pain",
                         "thalach":"max_heart_rate",
                         "oldpeak":"st_dep_induced",
                         "ca":"num_maj_ves"})

heart_df['num_bin'] = heart_df['num'].apply(lambda x: 1 if x > 0 else 0)

heart_df = heart_df.replace('?', np.nan)
heart_df = heart_df.dropna(axis=0, how="any")

heart_df_noncat = heart_df.copy()
heart_df[["sex", "chest_pain", "fbs",
          "restecg", "exang", "slope",
          "num_maj_ves", "thal", "num", "num_bin"]] = heart_df[["sex", "chest_pain", "fbs", "restecg", "exang", "slope", "num_maj_ves", "thal", "num", "num_bin"]].apply(lambda x:x.astype('category'))

heart_df_noncat[['num_maj_ves', 'thal']] = heart_df_noncat[['num_maj_ves', 'thal']].astype('float')
heart_df_noncat = heart_df_noncat.reset_index(drop=True)

X = heart_df_noncat.drop(['num', 'num_bin'], axis=1)
X = pd.DataFrame(normalize(X, norm="l2", axis=0), columns=list(X))
y = heart_df_noncat['num_bin']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("Training set : {}".format(X_train.shape[0]))
print("Testing set : {}\n".format(X_test.shape[0]))

heart_df_noncat = pd.concat([X, y], axis=1, join='inner')

heart_df_noncat.head()

## Statistical exploratory analysis

In [None]:
heart_df.describe()

### Plotting the dataset

In [None]:
fig, ax = plt.subplots(nrows=8, ncols=2, figsize=(10, 25))
ax = ax.reshape(-1)
plt.subplots_adjust(wspace=0.4, hspace=0.5)
for i, col in enumerate(heart_df.select_dtypes(exclude="category").columns):
    sns.kdeplot(heart_df[col], ax=ax[i])
    ax[i].set_xlabel(col)
for i, col in enumerate(heart_df.select_dtypes(include="category").columns):
    i += 5
    sns.countplot(x=col, data=heart_df, ax=ax[i])
    ax[i].set_xlabel(col)
    ax[i].set_ylabel("count")

In [None]:
pd.plotting.scatter_matrix(heart_df_noncat, alpha = 0.3, figsize = (15,15), diagonal = 'kde');

In [None]:
correlation = heart_df_noncat.corr(min_periods=10)
plt.figure(figsize=(15, 13))
heatmap = sns.heatmap(correlation, annot=True, linewidths=0, vmin=-1, cmap="RdBu_r")

## Random Forest Classifier

In [None]:
# FINAL CELL FOR RFC START

# The goal is to predict if a person has a heart disease or not.
# Thus we have decided to try a random forest classifier
# Has we want to do a kind of diagnosis thus we want to optimize the recall
# Because it is more dangerous to miss a heart disease than to over diagnosis

clf = RandomForestClassifier()
parameters = {'n_estimators': [5, 10, 20, 30], 'max_features':[3,4,5,6, None], 'max_depth': [4,5,6,7, None]}
scorer = metrics.make_scorer(metrics.recall_score)
grid_obj = GridSearchCV(clf, parameters, scoring=scorer, cv=5)
grid_fit = grid_obj.fit(X_train, y_train)
best_clf = grid_fit.best_estimator_
best_predictions = best_clf.predict(X_test)
print("Best parameters : ", grid_fit.best_params_)
# accuracy = tp + tn / total, pas bon si desequilibre des labels 
# (le classifier peut donner toujours le même label)
print("\nFinal accuracy score on the testing data: {:.4f}".format(metrics.accuracy_score(y_test, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(metrics.fbeta_score(y_test, best_predictions, beta = 0.5,  average="micro")))
# recall = tp / tp + fn (total de vrai malade du test set) => très important ici,
# car on veut trouver le plus de malade possible (plus grave de rater un malade, que d'en diagnostiquer trop)
print("recall :", metrics.recall_score(y_test, best_predictions))
# precision = tp / tp + fp (total de malade que le classifier a trouvé) => moins important ici,
# car determine combien de diagnostiqués malades, sont vraiment malades 
print("precision :", metrics.precision_score(y_test, best_predictions))
# f1 score = 2 * (precision * recall) / (precision + recall)
print("f1 score :", metrics.f1_score(y_test, best_predictions))

### 3 other cells draft ! to discuss

In [None]:
# DRAFT
clf = RandomForestClassifier(max_depth=2, random_state=0, n_estimators=10)

clf.fit(X_train, y_train)

for i, feat in enumerate(clf.feature_importances_):
    print("{} : {}".format(list(X)[i], feat))
    
print("\naccuracy :", clf.score(X_test, y_test))

predictions_test = clf.predict(X_test)
# accuracy = tp + tn / total, pas bon si desequilibre des labels 
# (le classifier peut donner toujours le même label)
print("\naccuracy :", metrics.accuracy_score(y_test, predictions_test))
# recall = tp / tp + fn (total de vrai malade du test set) => très important ici,
# car on veut trouver le plus de malade possible (plus grave de rater un malade, que d'en diagnostiquer trop)
print("recall :", metrics.recall_score(y_test, predictions_test))
# precision = tp / tp + fp (total de malade que le classifier a trouvé) => moins important ici,
# car determine combien de diagnostiqués malades, sont vraiment malades 
print("precision :", metrics.precision_score(y_test, predictions_test))
# f1 score = 2 * (precision * recall) / (precision + recall)
print("f1 score :", metrics.f1_score(y_test, predictions_test))
# Confusion Matrix : 0 negatif, 1 positif => [[tn, fp], [fn, tp]]
print("\nconfusion matrix :\n", metrics.confusion_matrix(y_test, predictions_test))

In [None]:
# DRAFT
clf = RandomForestClassifier()

parameters = {'n_estimators': [5, 10, 20, 30], 'max_features':[3,4,5,6, None], 'max_depth': [4,5,6,7, None]}

scorer = metrics.make_scorer(metrics.recall_score)

# TODO: Perform grid search on the claszsifier using 'scorer' as the scoring method using GridSearchCV()
grid_obj = GridSearchCV(clf, parameters, scoring=scorer, cv=5)

# TODO: Fit the grid search object to the training data and find the optimal parameters using fit()
grid_fit = grid_obj.fit(X_train, y_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Make predictions using the unoptimized and model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)

# Report the before-and-afterscores
print("Unoptimized model\n------")
print("Accuracy score on testing data: {:.4f}".format(metrics.accuracy_score(y_test, predictions)))
print("F-score on testing data: {:.4f}".format(metrics.fbeta_score(y_test, predictions, beta = 0.5, average="micro")))
print("recall :", metrics.recall_score(y_test, predictions))
print("precision :", metrics.precision_score(y_test, predictions))
print("f1 score :", metrics.f1_score(y_test, predictions))
print("\nOptimized Model\n------")
print(grid_fit.best_params_)
print("\nFinal accuracy score on the testing data: {:.4f}".format(metrics.accuracy_score(y_test, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(metrics.fbeta_score(y_test, best_predictions, beta = 0.5,  average="micro")))
print("recall :", metrics.recall_score(y_test, best_predictions))
print("precision :", metrics.precision_score(y_test, best_predictions))
print("f1 score :", metrics.f1_score(y_test, best_predictions))

In [None]:
# DRAFT
from sklearn.datasets import load_iris
iris = load_iris()

# Model (can also use single decision tree)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(max_depth=2, random_state=0, n_estimators=10)

# Train
model.fit(X_train, y_train)
# Extract single tree
estimator = model.estimators_[5]

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = list(X_train),
                class_names = iris.target_names,
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

## Deep Learning

In [None]:
class HeartDataset(Dataset):
    
    def __init__(self, df):
        self.heart_df = df
        self.number_samples = len(df)
    
    def __len__(self):
        return self.number_samples
    def __getitem__(self, idx):
        features = self.heart_df.iloc[idx.item(), :-2]
        binary_class = self.heart_df.iloc[idx.item(), -1]
        return FloatTensor(features.values), binary_class
        

In [None]:
def collate(samples):
    features, labels = list(zip(*samples))
    features, labels = torch.cat(features, 0), torch.cat(labels, 0)
    return features, labels
    

In [None]:
dataset = HeartDataset(heart_df_noncat)
weight_one = 1 - sum(heart_df_noncat.iloc[:,-1]) / len(heart_df_noncat)
weight = [weight_one if c == 1 else 1 - weight_one for c in heart_df_noncat.iloc[:,-1] ]
print(len(dataset))
sampler = WeightedRandomSampler(weight, len(dataset))
b_sampler = BatchSampler(sampler, 2, True)
dataset_loader = DataLoader(dataset=dataset,  num_workers=0, batch_sampler=b_sampler )

In [None]:
temp = []
for i in range(100):
    a = [sum(c).numpy() for feat, c in dataset_loader]
    temp.append(sum(a)/(len(a)*8))
print("percent of True in the training set:", sum(temp)/(len(temp)))

#a = [sum(i[1]) for i in validation_set_loader]

#print("percent of True in the validation set:", sum(a)/sum([len(i[1]) for i in validation_set_loader]))

In [None]:
for feat, c in dataset_loader:
    print(feat.shape, c)

In [None]:
model = torch.nn.Sequential(torch.nn.Linear(13,100), torch.nn.SELU(), torch.nn.Linear(100,50), torch.nn.SELU(), torch.nn.Linear(50,2))

In [None]:
def train_model(model, dataset_loader, nb_epochs, eta):
    model.train()
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr = eta)
    
    for e in range(nb_epochs):
        sum_loss = 0
        for feat, c in dataset_loader:
            output = model(feat)
            loss = criterion(output, c)
            model.zero_grad()
            loss.backward()
            optimizer.step()
            sum_loss += loss.item()
        print("epoch {} : {}".format(e,sum_loss))

def compute_nb_errors(model, dataset_loader):
    
    model.eval()
    nb_data_errors = 0
    
    for feat, c in dataset_loader:
        output = model(feat)
        print(output)
        _, predicted_classes = torch.max(output, 1)
        nb_data_errors += sum(abs(predicted_classes - c)).item()
    print(nb_data_errors)
    print(len(dataset_loader))


    return (1 - nb_data_errors/ (len(dataset_loader)*8)) *100


In [None]:
train_model(model, dataset_loader, 20, 0.01)

In [None]:
compute_nb_errors(model, dataset_loader)