# Assessment - Machine Learning, Data Coordinator with eHealth Africa

© 2024

---

## Section B

eHealth Africa has carried out investigation on various factors that can cause heart disease. The data on patients with the heart disease are collected in the southern and northern part of Nigeria and the description of the data is shown in Table 1.

### Table 1: Heart Disease Data Dictionary

| Variable Name | Description                               | Role   | Type     | Units     |
|---------------|-------------------------------------------|--------|----------|-----------|
| `age`           | age of the patient                        | Feature| Integer  | years     |
| `sex`           | gender of the patient                     | Feature| Categorical | -      |
| `cp`            | Chest pain type                           | Feature| Categorical | -      |
| `trestbps`      | resting blood pressure (on admission to the hospital) | Feature | Integer | mm/Hg |
| `chol`          | serum cholesterol                         | Feature| Integer  | mg/dl    |
| `fbs`           | fasting blood sugar > 120 mg/dl           | Feature| Categorical | -      |
| `restecg`       | Resting electrocardiographic results      | Feature| Categorical | -      |
| `thalach`       | maximum heart rate achieved               | Feature| Integer  | -         |
| `exang`         | exercise induced angina                   | Feature| Categorical | -      |
| `oldpeak`       | ST depression induced by exercise  relative to rest        | Feature| Float    | - |
| `slope`         | Slope of the peak exercise ST segment     | Feature| Categorical | -      |
| `ca`            | number of major vessels (0-3) colored by fluoroscopy | Feature | Integer | - |
| `thal`          | Thallium stress test                      | Feature| Categorical | -      |
| `status`        | diagnosis of heart disease                | Target | Categorical | -      |

- GOAL: Predict the presence of heart disease in patients using machine learning models and data science techniques for predictive modelling and risk analysis, particularly in the healthcare sector.
- Dataset: Heart disease data from southern and northern Nigeria.


<a id="cont"></a>

## Table of Contents

<a href=#packages>i. Importing Packages</a>

<a href=#loading>ii. Data Loading</a>

<a href=#eda> iii. Exploratory Data Analysis (EDA)

<a href=#one>1. Data Consolidation</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="packages"></a>
## i. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |

---

In [301]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, classification_report


<a id="loading"></a>
## ii. Data Loading
<!-- <a class="anchor" id="1.1"></a> -->
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [302]:
# Setting global constants to ensure notebook results are reproducible
# PARAMETER_CONSTANT = ###

DATA_PATH_NORTH = "northern heart diease data.csv"
DATA_PATH_SOUTH = "southern heart disease data.csv"

# load the data
df_north = pd.read_csv(DATA_PATH_NORTH)
df_south = pd.read_csv(DATA_PATH_SOUTH)

<a id="eda"></a>
## iii. Initial Exploration
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Initial Exploration ⚡ |
| :--------------------------- |
| In this section, I performed an Initial exploration  on all the variables in the DataFrame. |

---


In [303]:
# Get data info
df_north.info()
df_north.head()



# Get summary statistics for numerical features
#df_north.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       301 non-null    float64
 1   sex       303 non-null    object 
 2   cp        303 non-null    object 
 3   trestbps  302 non-null    float64
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    bool   
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    object 
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    object 
 11  ca        299 non-null    float64
 12  thal      301 non-null    object 
 13  status    303 non-null    object 
dtypes: bool(1), float64(4), int64(3), object(6)
memory usage: 31.2+ KB


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,status
0,63.0,male,typical angina,145.0,233,True,2,150,no,2.3,downsloping,0.0,fixed defect,absent
1,67.0,male,asymptomatic,160.0,286,False,2,108,yes,1.5,flat,3.0,normal,present
2,67.0,male,asymptomatic,120.0,229,False,2,129,yes,2.6,flat,2.0,reversable defect,present
3,37.0,male,non-anginal pain,130.0,250,False,0,187,no,3.5,downsloping,0.0,normal,absent
4,41.0,female,atypical angina,130.0,204,False,2,172,no,1.4,upsloping,0.0,normal,absent


In [304]:
# Get data info
df_south.info()
df_south.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270 entries, 0 to 269
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          268 non-null    float64
 1   sex          270 non-null    object 
 2   cp           270 non-null    object 
 3   trestbps     270 non-null    int64  
 4   chol         270 non-null    int64  
 5   fbs          270 non-null    object 
 6   restecg      270 non-null    int64  
 7   thalach      267 non-null    float64
 8   exang        270 non-null    bool   
 9   oldpeak      270 non-null    float64
 10  slope        270 non-null    object 
 11  ca           270 non-null    int64  
 12  thal         270 non-null    object 
 13  status       270 non-null    object 
 14  Unnamed: 14  0 non-null      float64
 15  Unnamed: 15  0 non-null      float64
dtypes: bool(1), float64(5), int64(4), object(6)
memory usage: 32.0+ KB


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,status,Unnamed: 14,Unnamed: 15
0,70.0,male,asymptomatic,130,322,no,2,109.0,False,2.4,flat,3,normal,present,,
1,67.0,female,non-anginal pain,115,564,no,2,160.0,False,1.6,flat,0,reversable defect,absent,,
2,57.0,male,atypical angina,124,261,no,0,141.0,False,0.3,upsloping,0,reversable defect,present,,
3,64.0,male,asymptomatic,128,263,no,0,105.0,True,0.2,flat,1,reversable defect,absent,,
4,74.0,female,atypical angina,120,269,no,2,121.0,True,0.2,upsloping,1,normal,absent,,


# Check unique values in categorical columns
categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']
for feature in categorical_features:
    print(f'{feature}: {data[feature].unique()}')

### Dataset Overview

The `df_south` DataFrame contains **270 entries** and **16 columns**. The following are the key details regarding the columns:

- **Data Types**:
  - **Float64**: 5 columns (e.g., `age`, `thalach`, `oldpeak`)
  - **Int64**: 4 columns (e.g., `trestbps`, `chol`, `ca`)
  - **Object**: 6 columns (e.g., `sex`, `cp`, `fbs`, `slope`, `thal`, `status`)
  - **Bool**: 1 column (`exang`)

- **Non-Null Counts**:
  - The columns `age` and `thalach` have missing values (268 and 267 non-null entries, respectively), while all other columns are fully populated.
  - There are two unnamed columns that contain no data.

This summary provides a foundational understanding of the data structure, types, and potential areas needing attention, such as missing values in specific columns.


As seen from the outputs above, the north dataset has `303` rows and `14` columns; while that of south is `270` rows and `16` columns. The `fbs` and `exang` columns are not consistent across both dataset.

 <a id="one"></a>
## 1. Data Consolidation
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| o Are any transformations required for specific variables to ensure they are directly comparable between the two datasets? How will these transformations affect the analysis? |
| o After consolidating the datasets, what steps will be taken to validate the integrity and accuracy of the combined data? |

---![Alt text](image.png)

In [305]:
df_south.drop(columns = ['Unnamed: 14', 'Unnamed: 15'], axis = 1, inplace = True)
df_south.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,status
0,70.0,male,asymptomatic,130,322,no,2,109.0,False,2.4,flat,3,normal,present
1,67.0,female,non-anginal pain,115,564,no,2,160.0,False,1.6,flat,0,reversable defect,absent
2,57.0,male,atypical angina,124,261,no,0,141.0,False,0.3,upsloping,0,reversable defect,present
3,64.0,male,asymptomatic,128,263,no,0,105.0,True,0.2,flat,1,reversable defect,absent
4,74.0,female,atypical angina,120,269,no,2,121.0,True,0.2,upsloping,1,normal,absent


In [306]:
df_south['exang'] = df_south['exang'].replace({False: 'no', True: 'yes'}).astype('str')
df_south.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,status
0,70.0,male,asymptomatic,130,322,no,2,109.0,no,2.4,flat,3,normal,present
1,67.0,female,non-anginal pain,115,564,no,2,160.0,no,1.6,flat,0,reversable defect,absent
2,57.0,male,atypical angina,124,261,no,0,141.0,no,0.3,upsloping,0,reversable defect,present
3,64.0,male,asymptomatic,128,263,no,0,105.0,yes,0.2,flat,1,reversable defect,absent
4,74.0,female,atypical angina,120,269,no,2,121.0,yes,0.2,upsloping,1,normal,absent


In [307]:
df_north['fbs'] = df_north['fbs'].replace({False: 'no', True: 'yes'}).astype('str')
df_north.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,status
0,63.0,male,typical angina,145.0,233,yes,2,150,no,2.3,downsloping,0.0,fixed defect,absent
1,67.0,male,asymptomatic,160.0,286,no,2,108,yes,1.5,flat,3.0,normal,present
2,67.0,male,asymptomatic,120.0,229,no,2,129,yes,2.6,flat,2.0,reversable defect,present
3,37.0,male,non-anginal pain,130.0,250,no,0,187,no,3.5,downsloping,0.0,normal,absent
4,41.0,female,atypical angina,130.0,204,no,2,172,no,1.4,upsloping,0.0,normal,absent


In [308]:
# Merge the datasets
df_combined = pd.concat([df_north, df_south], axis=0)

# Reset index for the combined DataFrame
df_combined.reset_index(drop=True, inplace=True)

# Display the first few rows of the merged dataset
df_combined.info()  # To check the structure of the combined dataset
df_combined.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 573 entries, 0 to 572
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       569 non-null    float64
 1   sex       573 non-null    object 
 2   cp        573 non-null    object 
 3   trestbps  572 non-null    float64
 4   chol      573 non-null    int64  
 5   fbs       573 non-null    object 
 6   restecg   573 non-null    int64  
 7   thalach   570 non-null    float64
 8   exang     573 non-null    object 
 9   oldpeak   573 non-null    float64
 10  slope     573 non-null    object 
 11  ca        569 non-null    float64
 12  thal      571 non-null    object 
 13  status    573 non-null    object 
dtypes: float64(5), int64(2), object(7)
memory usage: 62.8+ KB


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,status
0,63.0,male,typical angina,145.0,233,yes,2,150.0,no,2.3,downsloping,0.0,fixed defect,absent
1,67.0,male,asymptomatic,160.0,286,no,2,108.0,yes,1.5,flat,3.0,normal,present
2,67.0,male,asymptomatic,120.0,229,no,2,129.0,yes,2.6,flat,2.0,reversable defect,present
3,37.0,male,non-anginal pain,130.0,250,no,0,187.0,no,3.5,downsloping,0.0,normal,absent
4,41.0,female,atypical angina,130.0,204,no,2,172.0,no,1.4,upsloping,0.0,normal,absent


<a id="two"></a>
## 2. Data Cleaning
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| o Are there any missing values in the dataset? If so, how do you propose to handle them? |
| o How would you deal with any outliers in the dataset? Justify your approach. |
<!-- | In this section you are required to load the data from the `df_train` file into a DataFrame. | -->

---

In [310]:
df_combined.isnull().sum()

age         4
sex         0
cp          0
trestbps    1
chol        0
fbs         0
restecg     0
thalach     3
exang       0
oldpeak     0
slope       0
ca          4
thal        2
status      0
dtype: int64

### Imputation Strategy for Missing Values

### 1. Age
- **Imputer**: Median Imputation
- **Reason**: 
  - **Robustness to Outliers**: The median is less affected by extreme values, providing a stable measure of central tendency.
  - **Typical Distribution of Age**: Age data is often skewed; the median better reflects the "typical" patient age.
  - **Preservation of Data Integrity**: Retaining rows with imputed values helps maintain dataset size and integrity.
  - **Simplicity and Effectiveness**: Median imputation is straightforward to implement and computationally efficient.

### 2. Trestbps (Resting Blood Pressure)
- **Imputer**: Median Imputation
- **Reason**: 
  - **Robustness to Outliers**: Similar to age, blood pressure readings can also be influenced by outliers. The median provides a more reliable central value.
  - **Health Context**: Blood pressure values often fall within a certain range, making the median a suitable representation.

### 3. Thalach (Maximum Heart Rate Achieved)
- **Imputer**: Median Imputation
- **Reason**: 
  - **Robustness to Outliers**: Heart rate can vary significantly between individuals. The median ensures that extreme values do not skew the imputed results.
  - **Relevance to Health Outcomes**: Maintaining a representative central value is crucial for health-related analyses.

### 4. CA (Number of Major Vessels Colored by Fluoroscopy)
- **Imputer**: Mode Imputation
- **Reason**: 
  - **Nature of the Variable**: The `ca` variable is categorical and represents a count (0-3). The mode (most frequent value) is more appropriate for this type of data.
  - **Interpretability**: Imputing with the mode helps retain the most common case, which is relevant for interpreting the health status of patients.

### 5. Thal (Thallium Stress Test)
- **Imputer**: Mode Imputation
- **Reason**: 
  - **Nature of the Variable**: Like `ca`, the `thal` variable is categorical. Using the mode reflects the most common category, which provides meaningful insights for health outcomes.
  - **Simplifies Analysis**: This method helps maintain the categorical nature of the variable, ensuring that analyses remain interpretable.


In [311]:
# Impute missing values for each column
# Age
df_combined['age'].fillna(df_combined['age'].median(), inplace=True)

# Trestbps
df_combined['trestbps'].fillna(df_combined['trestbps'].median(), inplace=True)

# Thalach
df_combined['thalach'].fillna(df_combined['thalach'].median(), inplace=True)

# CA (mode)
df_combined['ca'].fillna(df_combined['ca'].mode()[0], inplace=True)

# Thal (mode)
df_combined['thal'].fillna(df_combined['thal'].mode()[0], inplace=True)


In [312]:
# Convert columns to appropriate data types

# Convert 'age' to categorical
df_combined['age'] = df_combined['age'].astype('int')

# Convert 'sex' to categorical
#df_combined['sex'] = df_combined['sex'].astype('str')

# Convert 'cp' (chest pain type) to categorical
#df_combined['cp'] = df_combined['cp'].astype('category')

# Convert 'trestbps' (Resting blood pressure) to integer
df_combined['trestbps'] = df_combined['trestbps'].astype('int')

# Convert 'restecg' (Resting blood pressure) to integer
df_combined['restecg'] = df_combined['restecg'].astype(str)

# Convert 'thalach' (Resting blood pressure) to integer
df_combined['thalach'] = df_combined['thalach'].astype('int')

# Convert 'slope' (slope of the peak exercise ST segment) to categorical
#df_combined['slope'] = df_combined['slope'].astype('category')

# Convert 'thal' (Thallium stress test) to categorical
#df_combined['thal'] = df_combined['thal'].astype('category')

# Convert 'ca' (number of major vessels) to integer (since it represents counts)
df_combined['ca'] = df_combined['ca'].astype('int')

# Convert 'ca' (number of major vessels) to integer (since it represents counts)
#df_combined['status'] = df_combined['status'].astype('str')

# Verify the changes
df_combined.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 573 entries, 0 to 572
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       573 non-null    int32  
 1   sex       573 non-null    object 
 2   cp        573 non-null    object 
 3   trestbps  573 non-null    int32  
 4   chol      573 non-null    int64  
 5   fbs       573 non-null    object 
 6   restecg   573 non-null    object 
 7   thalach   573 non-null    int32  
 8   exang     573 non-null    object 
 9   oldpeak   573 non-null    float64
 10  slope     573 non-null    object 
 11  ca        573 non-null    int32  
 12  thal      573 non-null    object 
 13  status    573 non-null    object 
dtypes: float64(1), int32(4), int64(1), object(8)
memory usage: 53.8+ KB


In [313]:
df_combined.duplicated().sum()

162

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---