# Explaining maching learning
You can find in this notebook the exploratory data analysis done over the heart disease dataset, coupled with the simple machine learning algorithm used to predict the diseases.

## Plan
1. Introduction
  1. What are we going to explain here
    1. Explain AI/ML
    2. Not a tutorial, un exemple concret tout public de à quoi ça pourrait servir
    3. Essayer d'expliquer le taff d'un DS pour les gens qu'on connait et ceux interessés en general
  2. Why do we need to talk about AI in general
    1. Future
    2. Generic term needs to be addressed
    3. See the good side
2. What is a Data Scientist
  1. Work with data
  2. Data Analysis
  2. create models
  3. communicate
3. Example dataset
  1. Explaining the dataset and the goal
  2. A few statistics/plots
  3. Predicting the heart disease
  4. Explaining results
4. Communicating results to business/boss
  1. Presenting the example's results
  2. Another level of abstraction ?
5. (OPT) Cleaning codebase/Explaining why we would need another language
6. Conclusions, pointing to the notebook.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import copy
from torch.utils.data import Dataset
from torch.utils.data import SequentialSampler, DataLoader, WeightedRandomSampler, BatchSampler
from sklearn import metrics
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
%config InlineBackend.figure_format = 'retina'
pd.set_option('display.max_columns', 500)

## Dataset fields description
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the 
    hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
    - Value 1: upsloping
    - Value 2: flat
    - Value 3: downsloping
12. ca: number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. num: diagnosis of heart disease (angiographic disease status)
    - Value 0: < 50% diameter narrowing
    - Value 1: > 50% diameter narrowing
    (in any major vessel: attributes 59 through 68 are vessels)

In [None]:
heart_df = pd.read_csv("../data/heart-disease/processed.cleveland.data", delimiter=",",
                       names=["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
                              "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"])
heart_df = heart_df.rename(columns={"cp":"chest_pain",
                                    "thalach":"max_heart_rate",
                                    "oldpeak":"st_dep_induced",
                                    "ca":"num_maj_ves"})

heart_df['num_bin'] = heart_df['num'].apply(lambda x: 1 if x > 0 else 0)

heart_df = heart_df.replace('?', np.nan)
heart_df = heart_df.dropna(axis=0, how="any")

heart_df[['num_maj_ves', 'thal']] = heart_df[['num_maj_ves', 'thal']].astype('float')
heart_df= heart_df.reset_index(drop=True)

heart_df.drop("num", inplace=True, axis=1)
heart_df.head()

## Statistical exploratory analysis

In [None]:
heart_df.describe()

### Plotting the dataset

In [None]:
heart_df_plot = heart_df.copy()
heart_df_plot[["sex", "chest_pain", "fbs",
               "restecg", "exang", "slope",
               "num_maj_ves", "thal", "num_bin"]] = heart_df_plot[["sex", "chest_pain", "fbs", "restecg",
                                                                   "exang", "slope", "num_maj_ves", "thal", 
                                                                   "num_bin"]
                                                                 ].apply(lambda x: x.astype('category'))

fig, ax = plt.subplots(nrows=7, ncols=2, figsize=(10, 25))
ax = ax.reshape(-1)
plt.subplots_adjust(wspace=0.4, hspace=0.5)

for i, col in enumerate(heart_df_plot.columns):
    if heart_df_plot[col].dtype.name == "category":
        sns.countplot(x=col, data=heart_df_plot, ax=ax[i])
        ax[i].set_ylabel("count")
    else:
        sns.kdeplot(heart_df_plot[col], ax=ax[i])
    ax[i].set_xlabel(col)

In [None]:
# Standardizing dataset
standardizing_heart_df = heart_df.copy()
scaler = preprocessing.StandardScaler()
# Checking out the scattered matrix
standardizing_heart_df.iloc[:, :-1] = scaler.fit_transform(heart_df.drop(['num_bin'], axis=1))
pd.plotting.scatter_matrix(standardizing_heart_df, alpha=0.3, figsize=(15,15), diagonal='kde');

# Checking our the heatmap of correlations
plt.figure(figsize=(15, 15))
correlation = standardizing_heart_df.corr(min_periods=10)
heatmap = sns.heatmap(correlation, annot=True, linewidths=0, vmin=-1, cmap="RdBu_r")
plt.show()

### Training and testing splits

In [None]:
class HeartDataset(Dataset):
    def __init__(self, df, train=True):
        self.heart_df = df
        self.number_samples = len(df)
        self.train = train
    
    def __len__(self):
        return self.number_samples
    
    def __getitem__(self, idx):
        if not self.train:
            idx = torch.LongTensor([idx])
        features = self.heart_df.iloc[idx.item(), :-1]
        binary_class = self.heart_df.iloc[idx.item(), -1]
        return torch.FloatTensor(features.values), binary_class

In [None]:
train_size = 0.9
train_sample_number = int(train_size*len(heart_df))

weight_one = 1 - sum(heart_df.iloc[:,-1]) / len(heart_df)
weight = [weight_one if c == 1 else 1 - weight_one for c in heart_df.iloc[:,-1] ]

total_dataset = HeartDataset(heart_df)
sampler = WeightedRandomSampler(weight, len(total_dataset), replacement=False)
batch_sampler = BatchSampler(sampler, train_sample_number, drop_last=False)

train_set, test_set = ([i.item() for i in cl] for cl in batch_sampler)
train_df = heart_df.iloc[train_set]
test_df = heart_df.iloc[test_set]
print("training set: {} samples, {:.2f}% of which belong to the first class".format(len(train_df), train_df.num_bin.value_counts().values[0] / len(train_df) * 100))
print("testing set: {} samples, {:.2f}% of which belong to the first class".format(len(test_df), test_df.num_bin.value_counts().values[0] / len(test_df) * 100))

## Random Forest Classifier
# DO NOT STANDARDIZE FOR RFC, NO USE FOR IT, PROBLEMS WITH ANALYZING RESULTS IF WE DO
Training with 5-fold cross validation to find the best parameters, and computing the accuracy, precision/recall/f1score.

In [None]:
# FINAL CELL FOR RFC START

# The goal is to predict if a person has a heart disease or not.
# Thus we have decided to try a random forest classifier
# Has we want to do a kind of diagnosis thus we want to optimize the recall
# Because it is more dangerous to miss a heart disease than to over diagnosis
train_features, train_targets = train_df.iloc[:,:-1], train_df.iloc[:,-1]
test_features, test_targets = test_df.iloc[:,:-1], test_df.iloc[:,-1]

clf = RandomForestClassifier()
parameters = {'n_estimators': [5, 10, 20, 30], 'max_features':[3,4,5,6, None], 'max_depth': [4,5,6,7, None]}
scorer = metrics.make_scorer(metrics.recall_score)
grid_obj = GridSearchCV(clf, parameters, scoring=scorer, cv=5)
grid_fit = grid_obj.fit(train_features, train_targets)
best_clf = grid_fit.best_estimator_
best_predictions = best_clf.predict(test_features)
print("Best parameters : ", grid_fit.best_params_)
# accuracy = tp + tn / total, pas bon si desequilibre des labels 
# (le classifier peut donner toujours le même label)
print("\nFinal accuracy score on the testing data: {:.4f}".format(metrics.accuracy_score(test_targets, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(metrics.fbeta_score(test_targets, best_predictions, beta=0.5,  average="micro")))
# recall = tp / tp + fn (total de vrai malade du test set) => très important ici,
# car on veut trouver le plus de malade possible (plus grave de rater un malade, que d'en diagnostiquer trop)
print("recall :", metrics.recall_score(test_targets, best_predictions))
# precision = tp / tp + fp (total de malade que le classifier a trouvé) => moins important ici,
# car determine combien de diagnostiqués malades, sont vraiment malades 
print("precision :", metrics.precision_score(test_targets, best_predictions))
# f1 score = 2 * (precision * recall) / (precision + recall)
print("f1 score :", metrics.f1_score(test_targets, best_predictions))

### 3 other cells draft ! to discuss

In [None]:
# DRAFT
clf = RandomForestClassifier(max_depth=2, random_state=0, n_estimators=10)

clf.fit(train_features, train_targets)

for i, feat in enumerate(clf.feature_importances_):
    print("{} : {}".format(list(heart_df)[i], feat))
    
print("\naccuracy :", clf.score(test_features, test_targets))

predictions_test = clf.predict(test_features)
# accuracy = tp + tn / total, pas bon si desequilibre des labels 
# (le classifier peut donner toujours le même label)
print("\naccuracy :", metrics.accuracy_score(test_targets, predictions_test))
# recall = tp / tp + fn (total de vrai malade du test set) => très important ici,
# car on veut trouver le plus de malade possible (plus grave de rater un malade, que d'en diagnostiquer trop)
print("recall :", metrics.recall_score(test_targets, predictions_test))
# precision = tp / tp + fp (total de malade que le classifier a trouvé) => moins important ici,
# car determine combien de diagnostiqués malades, sont vraiment malades 
print("precision :", metrics.precision_score(test_targets, predictions_test))
# f1 score = 2 * (precision * recall) / (precision + recall)
print("f1 score :", metrics.f1_score(test_targets, predictions_test))
# Confusion Matrix : 0 negatif, 1 positif => [[tn, fp], [fn, tp]]
print("\nconfusion matrix :\n", metrics.confusion_matrix(test_targets, predictions_test))

In [None]:
"""
# DOES NOT WORK, I DID NOT REPLACE EVERYTHING
# DRAFT
clf = RandomForestClassifier()

parameters = {'n_estimators': [5, 10, 20, 30], 'max_features':[3,4,5,6, None], 'max_depth': [4,5,6,7, None]}

scorer = metrics.make_scorer(metrics.recall_score)

# TODO: Perform grid search on the claszsifier using 'scorer' as the scoring method using GridSearchCV()
grid_obj = GridSearchCV(clf, parameters, scoring=scorer, cv=5)

# TODO: Fit the grid search object to the training data and find the optimal parameters using fit()
grid_fit = grid_obj.fit(X_train, y_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Make predictions using the unoptimized and model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)

# Report the before-and-afterscores
print("Unoptimized model\n------")
print("Accuracy score on testing data: {:.4f}".format(metrics.accuracy_score(y_test, predictions)))
print("F-score on testing data: {:.4f}".format(metrics.fbeta_score(y_test, predictions, beta = 0.5, average="micro")))
print("recall :", metrics.recall_score(y_test, predictions))
print("precision :", metrics.precision_score(y_test, predictions))
print("f1 score :", metrics.f1_score(y_test, predictions))
print("\nOptimized Model\n------")
print(grid_fit.best_params_)
print("\nFinal accuracy score on the testing data: {:.4f}".format(metrics.accuracy_score(y_test, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(metrics.fbeta_score(y_test, best_predictions, beta = 0.5,  average="micro")))
print("recall :", metrics.recall_score(y_test, best_predictions))
print("precision :", metrics.precision_score(y_test, best_predictions))
print("f1 score :", metrics.f1_score(y_test, best_predictions))
"""

In [None]:
# Visualize the decision tree
from sklearn.datasets import load_iris
iris = load_iris()

# Model (can also use single decision tree)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(max_depth=2, random_state=0, n_estimators=10)

# Train
model.fit(train_features, train_targets)
# Extract single tree
estimator = model.estimators_[5]

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = list(train_df.iloc[:, :-1]),
                class_names = iris.target_names,
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

## Deep Learning
# STANDARDIZE TRAINING SET AND TESTING SET THE SAME WAY TO BE SURE WE GET THE SAME THING

In [None]:
# Creating a validation set
train_nn_size = 0.9
fold_size = 0.2
train_nn_sample_number = int(train_nn_size*len(train_df))

print(train_nn_sample_number)

weight_one = 1 - sum(train_df.iloc[:,-1]) / len(train_df)
weight = [weight_one if c == 1 else 1 - weight_one for c in train_df.iloc[:,-1] ]

# Standardizing training set
standardized_nn_train_df = train_df.copy()
scaler = preprocessing.StandardScaler()
scaler.fit(train_df.drop(['num_bin'], axis=1))
standardized_nn_train_df.iloc[:, :-1] = scaler.transform(train_df.drop(['num_bin'], axis=1))

train_nn_set = HeartDataset(standardized_nn_train_df)
sampler = WeightedRandomSampler(weight, len(train_nn_set), replacement=False)
batch_sampler = BatchSampler(sampler, train_nn_sample_number, drop_last=False)

train_set, valid_set = ([i.item() for i in cl] for cl in batch_sampler)
train_nn_df = train_df.iloc[train_set]
valid_df = train_df.iloc[valid_set]
print("training set: {} samples, {:.2f}% of which belong to the first class".format(len(train_nn_df), train_nn_df.num_bin.value_counts().values[0] / len(train_nn_df) * 100))
print("validation set: {} samples, {:.2f}% of which belong to the first class".format(len(valid_df), valid_df.num_bin.value_counts().values[0] / len(valid_df) * 100))
print("testing set: {} samples, {:.2f}% of which belong to the first class".format(len(test_df), test_df.num_bin.value_counts().values[0] / len(test_df) * 100))

In [None]:
def built_loader(train_df, valid_df, test_df, batch_size=8):
    train_dataset = HeartDataset(train_nn_df)
    valid_dataset = HeartDataset(valid_df, train=False)

    # Do not forget to apply the same preprocessing to the testing set
    standardized_test_df = test_df.copy()
    standardized_test_df.iloc[:, :-1] = scaler.transform(test_df.drop(['num_bin'], axis=1))
    test_dataset = HeartDataset(standardized_test_df, train=False)

    weight_one = 1 - sum(train_nn_df.iloc[:,-1]) / len(train_nn_df)
    weight = [weight_one if c == 1 else 1 - weight_one for c in train_nn_df.iloc[:,-1] ]

    sampler = WeightedRandomSampler(weight, len(train_dataset))
    b_sampler = BatchSampler(sampler, batch_size, True)

    train_dataset_loader = DataLoader(dataset=train_dataset,  num_workers=0, batch_sampler=b_sampler )
    valid_dataset_loader = DataLoader(dataset=valid_dataset, batch_size=batch_size)
    test_dataset_loader = DataLoader(dataset=test_dataset, batch_size=batch_size)
    
    return train_dataset_loader, valid_dataset_loader, test_dataset_loader

In [None]:
def train_model(model, train_dataset_loader, valid_dataset_loader, nb_epochs, regul=0):
    """
    Will train the model and retain the training and validation losses at each epoch
    """
    training_losses, validation_losses = [], []
    
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), weight_decay=regul)
    
    for e in range(nb_epochs):
        train_sum_loss = 0
        valid_sum_loss = 0
        
        model.train()
        for feat, c in train_dataset_loader:
            output = model(feat)
            loss = criterion(output, c)
            model.zero_grad()
            loss.backward()
            optimizer.step()
            train_sum_loss += loss.item()
            
        model.eval()
        for feat, c in valid_dataset_loader:
            output = model(feat)
            loss = criterion(output, c)
            valid_sum_loss += loss.item()
            
        if (e+1) % 100 == 0:
            print("epoch: {:4d}/{:4d} | training loss: {:3.4f} | validation loss: {:3.4f}".format(
                e+1, nb_epochs, normalized_train_loss, normalized_valid_loss))
            
        normalized_train_loss = train_sum_loss / len(train_dataset_loader)
        normalized_valid_loss = valid_sum_loss / len(valid_dataset_loader)
        training_losses.append(normalized_train_loss)
        validation_losses.append(normalized_valid_loss)
        
    plt.figure(figsize=(12, 8))
    plt.title("Training and validation losses")
    plt.plot(training_losses, 'red', label='training loss')
    plt.plot(validation_losses, 'green', label='validation loss')
    plt.legend(loc='best')
    plt.ylim(0, 1)
    plt.show()

def evaluate_model(model, dataset_loader, batch_size=8):
    # Precision and f1-score
    model.eval()
    nb_accuracy_errors = 0
    nb_true_pos = 0
    nb_pos = 0
    nb_false_pos = 0
    
    for feat, c in dataset_loader:
        output = model(feat)
        _, predicted_classes = torch.max(output, 1)
        nb_accuracy_errors += sum(abs(predicted_classes - c)).item()
        # Computing recall: summing the predictions with the targets
        # and counting the number of "2"
        nb_true_pos += ((predicted_classes + c) == 2).sum().item()
        nb_pos += c.sum().item()
        # Computing precision: nb of true pos divided by nb of
        # true pos + nb false pos
        nb_false_pos = predicted_classes.sum().item()

    accuracy = (1 - nb_accuracy_errors/ (len(dataset_loader)*batch_size))
    precision = nb_true_pos / (nb_true_pos + nb_false_pos)
    recall = nb_true_pos / nb_pos
    return accuracy, precision, recall

In [None]:

def cv(train_df, model, l2_penalty, epochs):


    total_accuracy_v = 0
    total_precision_v = 0
    total_recall_v = 0

    total_accuracy_t = 0
    total_precision_t = 0
    total_recall_t = 0
    
    

    for i in range(5):
        
        current_model = copy.deepcopy(model)
        fold_size = 0.2
        train_nn_index = int(fold_size*len(train_df))
        index_list = list(range(train_nn_index * i, train_nn_index * (i+1)))
        valid_index = train_df.index.isin(index_list)
        valid_df = train_df[valid_index]
        train_nn_df = train_df[~valid_index]
        print("training set: {} samples, {:.2f}% of which belong to the first class".format(len(train_nn_df), train_nn_df.num_bin.value_counts().values[0] / len(train_nn_df) * 100))
        print("validation set: {} samples, {:.2f}% of which belong to the first class".format(len(valid_df), valid_df.num_bin.value_counts().values[0] / len(valid_df) * 100))

        train_dataset_loader, valid_dataset_loader, test_dataset_loader = built_loader(train_nn_df, valid_df, test_df)
        
        train_model(current_model, train_dataset_loader, valid_dataset_loader, epochs, l2_penalty)
        accuracy_valid, precision_valid, recall_valid = evaluate_model(current_model, valid_dataset_loader)
        accuracy_train, precision_train, recall_train = evaluate_model(current_model, train_dataset_loader)

        total_accuracy_v += accuracy_valid
        total_accuracy_t += accuracy_train

        total_precision_v += precision_valid
        total_precision_t += precision_train

        total_recall_v += recall_valid
        total_recall_t += recall_train

    mean_accuracy_v = total_accuracy_v / 5
    mean_accuracy_t = total_accuracy_t / 5

    mean_precision_v = total_precision_v / 5
    mean_precision_t = total_precision_t / 5

    mean_recall_v = total_recall_v / 5
    mean_recall_t = total_recall_t / 5


    print("Training set:\n\t- accuracy: {:2.2%}\n\t- precision: {:2.2%}\n\t- recall: {:2.2%}\n"
          .format(mean_accuracy_t, mean_precision_t, mean_recall_t))
    print("Validation set:\n\t- accuracy: {:2.2%}\n\t- precision: {:2.2%}\n\t- recall: {:2.2%}\n"
          .format(mean_accuracy_v, mean_precision_v, mean_recall_v))

In [None]:
drop = 0.2
model = torch.nn.Sequential(torch.nn.Linear(13,50),
                            torch.nn.Dropout(drop),
                            torch.nn.ReLU(),
                            torch.nn.Linear(50,100),
                            torch.nn.Dropout(drop),
                            torch.nn.ReLU(),
                            torch.nn.Linear(100,50),
                            torch.nn.Dropout(drop),
                            torch.nn.ReLU(),
                            torch.nn.Linear(50,2))
cv(train_df, model, 0.03, 1000)

In [None]:


epochs = 1000
model = torch.nn.Sequential(torch.nn.Linear(13,50),
                            torch.nn.Dropout(drop),
                            torch.nn.ReLU(),
                            torch.nn.Linear(50,100),
                            torch.nn.Dropout(drop),
                            torch.nn.ReLU(),
                            torch.nn.Linear(100,50),
                            torch.nn.Dropout(drop),
                            torch.nn.ReLU(),
                            torch.nn.Linear(50,2))
train_model(model, train_dataset_loader, valid_dataset_loader, epochs, l2_penalty)
print("For regularizer:", l2_penalty, "dropout:", drop, "epochs:", epochs)

# Training set evaluation
accuracy, precision, recall = evaluate_model(model, train_dataset_loader)
f1_score = 2 * precision * recall / (precision + recall)
print("Training set:\n\t- accuracy: {:2.2%}\n\t- precision: {:2.2%}\n\t- recall: {:2.2%}\n\t- f1-score: {:2.2%}\n"
      .format(accuracy, precision, recall, f1_score))
# Validation set evaluation
accuracy, precision, recall = evaluate_model(model, valid_dataset_loader)
f1_score = 2 * precision * recall / (precision + recall)
print("Validation set:\n\t- accuracy: {:2.2%}\n\t- precision: {:2.2%}\n\t- recall: {:2.2%}\n\t- f1-score: {:2.2%}\n"
      .format(accuracy, precision, recall, f1_score))
# Testing set evaluation
accuracy, precision, recall = evaluate_model(model, test_dataset_loader)
f1_score = 2 * precision * recall / (precision + recall)
print("Testing set:\n\t- accuracy: {:2.2%}\n\t- precision: {:2.2%}\n\t- recall: {:2.2%}\n\t- f1-score: {:2.2%}\n"
      .format(accuracy, precision, recall, f1_score))