# Assessment - Machine Learning, Data Coordinator with eHealth Africa

© 2024

---

## Section B

eHealth Africa has carried out investigation on various factors that can cause heart disease. The data on patients with the heart disease are collected in the southern and northern part of Nigeria and the description of the data is shown in Table 1.

### Table 1: Heart Disease Data Dictionary

| Variable Name | Description                               | Role   | Type     | Units     |
|---------------|-------------------------------------------|--------|----------|-----------|
| `age`           | age of the patient                        | Feature| Integer  | years     |
| `sex`           | gender of the patient                     | Feature| Categorical | -      |
| `cp`            | Chest pain type                           | Feature| Categorical | -      |
| `trestbps`      | resting blood pressure (on admission to the hospital) | Feature | Integer | mm/Hg |
| `chol`          | serum cholesterol                         | Feature| Integer  | mg/dl    |
| `fbs`           | fasting blood sugar > 120 mg/dl           | Feature| Categorical | -      |
| `restecg`       | Resting electrocardiographic results      | Feature| Categorical | -      |
| `thalach`       | maximum heart rate achieved               | Feature| Integer  | -         |
| `exang`         | exercise induced angina                   | Feature| Categorical | -      |
| `oldpeak`       | ST depression induced by exercise  relative to rest        | Feature| Float    | - |
| `slope`         | Slope of the peak exercise ST segment     | Feature| Categorical | -      |
| `ca`            | number of major vessels (0-3) colored by fluoroscopy | Feature | Integer | - |
| `thal`          | Thallium stress test                      | Feature| Categorical | -      |
| `status`        | diagnosis of heart disease                | Target | Categorical | -      |

- GOAL: Predict the presence of heart disease in patients using machine learning models and data science techniques for predictive modelling and risk analysis, particularly in the healthcare sector.
- Dataset: Heart disease data from southern and northern Nigeria.



# Library Imports

In [1]:
# Standard Libraries
import pandas as pd
import numpy as np

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import os
import json


from sklearn.svm import SVC
# Machine Learning Libraries
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression


# Data Loading and Initial Exploration

In [2]:
# Load datasets
DATA_PATH_North = os.getenv("DATA_PATH", default="northern heart diease data.csv")
DATA_PATH_South = os.getenv("DATA_PATH", default="southern heart disease data.csv")
df_north = pd.read_csv(DATA_PATH_North)
df_south = pd.read_csv(DATA_PATH_South)

In [3]:
display(df_north.head())
display(df_south.head())


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,status
0,63.0,male,typical angina,145.0,233,True,2,150,no,2.3,downsloping,0.0,fixed defect,absent
1,67.0,male,asymptomatic,160.0,286,False,2,108,yes,1.5,flat,3.0,normal,present
2,67.0,male,asymptomatic,120.0,229,False,2,129,yes,2.6,flat,2.0,reversable defect,present
3,37.0,male,non-anginal pain,130.0,250,False,0,187,no,3.5,downsloping,0.0,normal,absent
4,41.0,female,atypical angina,130.0,204,False,2,172,no,1.4,upsloping,0.0,normal,absent


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,status,Unnamed: 14,Unnamed: 15
0,70.0,male,asymptomatic,130,322,no,2,109.0,False,2.4,flat,3,normal,present,,
1,67.0,female,non-anginal pain,115,564,no,2,160.0,False,1.6,flat,0,reversable defect,absent,,
2,57.0,male,atypical angina,124,261,no,0,141.0,False,0.3,upsloping,0,reversable defect,present,,
3,64.0,male,asymptomatic,128,263,no,0,105.0,True,0.2,flat,1,reversable defect,absent,,
4,74.0,female,atypical angina,120,269,no,2,121.0,True,0.2,upsloping,1,normal,absent,,


### Data Information Summary for `df_north` and `df_south`

Based on the `.info()` output, both `df_north` and `df_south` DataFrames provide insights into heart disease characteristics across regions, but they vary in row counts, column structures, and missing values.

1. **`df_north`**:
   - Contains **303 rows** with **14 columns**.
   - Column types include `float64`, `int64`, `bool`, and `object`.
   - Key columns such as `age`, `trestbps`, and `ca` have some missing values.
   

2. **`df_south`**:
   - Contains **270 rows** with **16 columns** (two additional columns, `Unnamed: 14` and `Unnamed: 15`, which are completely empty).
   - Column types are similar, though it has an extra `float64` type due to the additional columns.
   - Some columns, such as `age` and `thalach`, also have missing values.

Overall, both datasets contain categorical and numerical features crucial for predictive analysis, but missing values and irrelevant empty columns in `df_south` will need cleaning. The analysis may also require alignment in column structures for consistent modeling.


# Data Consolidation
- Concate dataset
- Check for data type inconsistencies
- Check for duplicates

## Concate Datasets

In [4]:
combined_df = pd.concat([df_north, df_south], axis=0)
display(combined_df.head())


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,status,Unnamed: 14,Unnamed: 15
0,63.0,male,typical angina,145.0,233,True,2,150.0,no,2.3,downsloping,0.0,fixed defect,absent,,
1,67.0,male,asymptomatic,160.0,286,False,2,108.0,yes,1.5,flat,3.0,normal,present,,
2,67.0,male,asymptomatic,120.0,229,False,2,129.0,yes,2.6,flat,2.0,reversable defect,present,,
3,37.0,male,non-anginal pain,130.0,250,False,0,187.0,no,3.5,downsloping,0.0,normal,absent,,
4,41.0,female,atypical angina,130.0,204,False,2,172.0,no,1.4,upsloping,0.0,normal,absent,,


## Check for Data type Inconsistencies

In [5]:
combined_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 573 entries, 0 to 269
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          569 non-null    float64
 1   sex          573 non-null    object 
 2   cp           573 non-null    object 
 3   trestbps     572 non-null    float64
 4   chol         573 non-null    int64  
 5   fbs          573 non-null    object 
 6   restecg      573 non-null    int64  
 7   thalach      570 non-null    float64
 8   exang        573 non-null    object 
 9   oldpeak      573 non-null    float64
 10  slope        573 non-null    object 
 11  ca           569 non-null    float64
 12  thal         571 non-null    object 
 13  status       573 non-null    object 
 14  Unnamed: 14  0 non-null      float64
 15  Unnamed: 15  0 non-null      float64
dtypes: float64(7), int64(2), object(7)
memory usage: 76.1+ KB


The `Unnamed: 14` and `Unnamed: 15` Features don't contain any useful contents therefore we would need to drop it in the data cleaning stage.


## Check for Duplicates Values


In [6]:
combined_df.duplicated().sum()

0

There are no duplicate values present in our dataset.

# Data Cleaning
- Drop unwanted features
- Address Datatypes Inconsistencies
- Addressing Missing data
- Outlier Detection

## Drop unwanted Features


In [7]:
combined_df.drop(columns = ['Unnamed: 14', 'Unnamed: 15'], axis = 1, inplace = True)
combined_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,status
0,63.0,male,typical angina,145.0,233,True,2,150.0,no,2.3,downsloping,0.0,fixed defect,absent
1,67.0,male,asymptomatic,160.0,286,False,2,108.0,yes,1.5,flat,3.0,normal,present
2,67.0,male,asymptomatic,120.0,229,False,2,129.0,yes,2.6,flat,2.0,reversable defect,present
3,37.0,male,non-anginal pain,130.0,250,False,0,187.0,no,3.5,downsloping,0.0,normal,absent
4,41.0,female,atypical angina,130.0,204,False,2,172.0,no,1.4,upsloping,0.0,normal,absent
