# Explaining maching learning
You can find in this notebook the exploratory data analysis done over the heart disease dataset, coupled with the simple machine learning algorithm used to predict the diseases.

## Plan
1. Introduction
  1. What are we going to explain here
  2. Why do we need to talk about AI in general
  3. What we will try to do here
2. What is a Data Scientist
3. Example dataset
  1. Explaining the dataset and the goal
  2. A few statistics/plots
  3. Predicting the heart disease
  4. Explaining results
4. Communicating results to business/boss
5. Cleaning codebase/Explaining why we would need another language
6. Conclusions, pointing to the notebook.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
%config InlineBackend.figure_format = 'retina'
pd.set_option('display.max_columns', 500)

## Dataset fields description
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the 
    hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
    - Value 1: upsloping
    - Value 2: flat
    - Value 3: downsloping
12. ca: number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. num: diagnosis of heart disease (angiographic disease status)
    - Value 0: < 50% diameter narrowing
    - Value 1: > 50% diameter narrowing
    (in any major vessel: attributes 59 through 68 are vessels)

In [None]:
heart_df = pd.read_csv("../data/heart-disease/processed.cleveland.data", delimiter=",",
            names=["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
                    "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"])
heart_df = heart_df.rename(columns={"cp":"chest_pain",
                         "thalach":"max_heart_rate",
                         "oldpeak":"st_dep_induced",
                         "ca":"num_maj_ves"})
heart_df[["sex", "chest_pain", "fbs",
          "restecg", "exang", "slope",
          "num_maj_ves", "thal", "num"]] = heart_df[["sex", "chest_pain", "fbs", "restecg", "exang", "slope", "num_maj_ves", "thal", "num"]].apply(lambda x:x.astype('category'))

heart_df.head()

## Statistical exploratory analysis

In [None]:
heart_df.describe()

### Plotting the dataset

In [None]:
fig, ax = plt.subplots(nrows=7, ncols=2, figsize=(10, 25))
ax = ax.reshape(-1)
plt.subplots_adjust(wspace=0.4, hspace=0.5)
for i, col in enumerate(heart_df.select_dtypes(exclude="category").columns):
    sns.kdeplot(heart_df[col], ax=ax[i])
    ax[i].set_xlabel(col)
for i, col in enumerate(heart_df.select_dtypes(include="category").columns):
    i += 5
    sns.countplot(x=col, data=heart_df, ax=ax[i])
    ax[i].set_xlabel(col)
    ax[i].set_ylabel("count")

## Parkinsons dataset
### Attribute Information

##### Training sample information
Each subject has 26 voice samples including sustained vowels, numbers, words and short sentences. The voice samples in the training data file are given in the following order:

1: sustained vowel (aaaâ€¦â€¦)

2: sustained vowel (oooâ€¦...)

3: sustained vowel (uuuâ€¦...)

4-13: numbers from 1 to 10

14-17: short sentences

18-26: words

##### Testing sample information
28 PD patients are asked to say only the sustained vowels 'a' and 'o' three times respectively which makes a total of 168 recordings (each subject has 6 voice samples) The voice samples in the test data file are given in the following order:

1-3: sustained vowel (aaaâ€¦â€¦)

4-6: sustained vowel (oooâ€¦â€¦) 

##### Training Data File

column 1: Subject id

column 2-27: features

features 1-5: Jitter (local), Jitter (local, absolute), Jitter (rap), Jitter (ppq5), Jitter (ddp)

features 6-11: Shimmer (local), Shimmer (local, dB), Shimmer (apq3), Shimmer (apq5), Shimmer (apq11), Shimmer (dda)

features 12-14: AC, NTH, HTN

features 15-19: Median pitch, Mean pitch, Standard deviation, Minimum pitch, Maximum pitch

features 20-23: Number of pulses, Number of periods, Mean period, Standard deviation of period

features 24-26: Fraction of locally unvoiced frames, Number of voice breaks, Degree of voice breaks

column 28: UPDRS (Unified Parkinson's Disease Rating Scale)

column 29: class information

##### Test Data File

column 1: Subject id

column 2-27: same features

column 28: class information 

In [None]:
park_df = pd.read_csv('../data/parkinson/train_data.txt',
                      delimiter=',',
                      header=None)

park_df = park_df.rename(columns={0:"subject_id",
                         28:"class",
                         27:"UPDRS"})

park_df.set_index("subject_id", inplace=True)
park_df

## Census-Income (KDD) Data Set
#### Data description

1.   91 distinct values for attribute #0 (age) continuous
2.    9 distinct values for attribute #1 (class of worker) nominal
3.   52 distinct values for attribute #2 (detailed industry recode) nominal
4.   47 distinct values for attribute #3 (detailed occupation recode) nominal
5.   17 distinct values for attribute #4 (education) nominal
6. 1240 distinct values for attribute #5 (wage per hour) continuous
7.    3 distinct values for attribute #6 (enroll in edu inst last wk) nominal
8.    7 distinct values for attribute #7 (marital stat) nominal
9.   24 distinct values for attribute #8 (major industry code) nominal
10.   15 distinct values for attribute #9 (major occupation code) nominal
11.    5 distinct values for attribute #10 (race) nominal
12.   10 distinct values for attribute #11 (hispanic origin) nominal
13.    2 distinct values for attribute #12 (sex) nominal
14.    3 distinct values for attribute #13 (member of a labor union) nominal
15.    6 distinct values for attribute #14 (reason for unemployment) nominal
16.    8 distinct values for attribute #15 (full or part time employment stat) nominal
17.  132 distinct values for attribute #16 (capital gains) continuous
18.  113 distinct values for attribute #17 (capital losses) continuous
19. 1478 distinct values for attribute #18 (dividends from stocks) continuous
20.    6 distinct values for attribute #19 (tax filer stat) nominal
21.    6 distinct values for attribute #20 (region of previous residence) nominal
22.   51 distinct values for attribute #21 (state of previous residence) nominal
23.   38 distinct values for attribute #22 (detailed household and family stat) nominal
24.    8 distinct values for attribute #23 (detailed household summary in household) nominal
25.   10 distinct values for attribute #24 (migration code-change in msa) nominal
26.    9 distinct values for attribute #25 (migration code-change in reg) nominal
27.   10 distinct values for attribute #26 (migration code-move within reg) nominal
28.    3 distinct values for attribute #27 (live in this house 1 year ago) nominal
29.    4 distinct values for attribute #28 (migration prev res in sunbelt) nominal
30.    7 distinct values for attribute #29 (num persons worked for employer) continuous
31.    5 distinct values for attribute #30 (family members under 18) nominal
32.   43 distinct values for attribute #31 (country of birth father) nominal
33.   43 distinct values for attribute #32 (country of birth mother) nominal
34.   43 distinct values for attribute #33 (country of birth self) nominal
35.    5 distinct values for attribute #34 (citizenship) nominal
36.    3 distinct values for attribute #35 (own business or self employed) nominal
37.    3 distinct values for attribute #36 (fill inc questionnaire for veteran's admin) nominal
38.    3 distinct values for attribute #37 (veterans benefits) nominal
39.   53 distinct values for attribute #38 (weeks worked in year) continuous
40.    2 distinct values for attribute #39 (year) nominal

In [None]:
cols =  ["age","class of worker","detailed industry recode","detailed occupation recode",
         "education","wage per hour","enroll in edu inst last wk","marital stat",
         "major industry code","major occupation code","race","hispanic origin","sex",
         "member of a labor union","reason for unemployment","full or part time employment stat",
         "capital gains","capital losses","dividends from stocks","tax filer stat",
         "region of previous residence","state of previous residence",
         "detailed household and family stat","detailed household summary in household", "instance weight",
         "migration code-change in msa","migration code-change in reg",
         "migration code-move within reg","live in this house 1 year ago",
         "migration prev res in sunbelt","num persons worked for employer",
         "family members under 18","country of birth father","country of birth mother",
         "country of birth self","citizenship","own business or self employed",
         "fill inc questionnaire for veteran's admin","veterans benefits","weeks worked in year",
         "year","binary class"]

In [None]:
census_df = pd.read_csv("../data/census/census-income.data", delimiter=",", header=None)
census_df.columns = cols
census_df.drop(columns='instance weight', axis=1, inplace=True)
census_df.head()