# Explaining maching learning
You can find in this notebook the exploratory data analysis done over the heart disease dataset, coupled with the simple machine learning algorithm used to predict the diseases.

## Plan
1. Introduction
  1. What are we going to explain here
    1. Explain AI/ML
    2. Not a tutorial, un exemple concret tout public de à quoi ça pourrait servir
    3. Essayer d'expliquer le taff d'un DS pour les gens qu'on connait et ceux interessés en general
  2. Why do we need to talk about AI in general
    1. Future
    2. Generic term needs to be addressed
    3. See the good side
2. What is a Data Scientist
  1. Work with data
  2. Data Analysis
  2. create models
  3. communicate
3. Example dataset
  1. Explaining the dataset and the goal
  2. A few statistics/plots
  3. Predicting the heart disease
  4. Explaining results
4. Communicating results to business/boss
  1. Presenting the example's results
  2. Another level of abstraction ?
5. (OPT) Cleaning codebase/Explaining why we would need another language
6. Conclusions, pointing to the notebook.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
%config InlineBackend.figure_format = 'retina'
pd.set_option('display.max_columns', 500)

## Dataset fields description
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the 
    hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
    - Value 1: upsloping
    - Value 2: flat
    - Value 3: downsloping
12. ca: number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. num: diagnosis of heart disease (angiographic disease status)
    - Value 0: < 50% diameter narrowing
    - Value 1: > 50% diameter narrowing
    (in any major vessel: attributes 59 through 68 are vessels)

In [None]:
heart_df = pd.read_csv("../data/heart-disease/processed.cleveland.data", delimiter=",",
            names=["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
                    "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"])
heart_df = heart_df.rename(columns={"cp":"chest_pain",
                         "thalach":"max_heart_rate",
                         "oldpeak":"st_dep_induced",
                         "ca":"num_maj_ves"})

heart_df['num_bin'] = heart_df['num'].apply(lambda x: 1 if x > 0 else 0)

heart_df_noncat = heart_df.copy()

heart_df[["sex", "chest_pain", "fbs",
          "restecg", "exang", "slope",
          "num_maj_ves", "thal", "num", "num_bin"]] = heart_df[["sex", "chest_pain", "fbs", "restecg", "exang", "slope", "num_maj_ves", "thal", "num", "num_bin"]].apply(lambda x:x.astype('category'))

heart_df_noncat.head()

## Statistical exploratory analysis

In [None]:
heart_df.describe()

### Plotting the dataset

In [None]:
fig, ax = plt.subplots(nrows=8, ncols=2, figsize=(10, 25))
ax = ax.reshape(-1)
plt.subplots_adjust(wspace=0.4, hspace=0.5)
for i, col in enumerate(heart_df.select_dtypes(exclude="category").columns):
    sns.kdeplot(heart_df[col], ax=ax[i])
    ax[i].set_xlabel(col)
for i, col in enumerate(heart_df.select_dtypes(include="category").columns):
    i += 5
    sns.countplot(x=col, data=heart_df, ax=ax[i])
    ax[i].set_xlabel(col)
    ax[i].set_ylabel("count")

In [None]:
pd.plotting.scatter_matrix(heart_df_noncat, alpha = 0.3, figsize = (15,15), diagonal = 'kde');

In [None]:
correlation = heart_df_noncat.corr(min_periods=10)
plt.figure(figsize=(15, 13))
heatmap = sns.heatmap(correlation, annot=True, linewidths=0, vmin=-1, cmap="RdBu_r")