# Assessment - Machine Learning, Data Coordinator with eHealth Africa

© 2024

---

## Section B

eHealth Africa has carried out investigation on various factors that can cause heart disease. The data on patients with the heart disease are collected in the southern and northern part of Nigeria and the description of the data is shown in Table 1.

### Table 1: Heart Disease Data Dictionary

| Variable Name | Description                               | Role   | Type     | Units     |
|---------------|-------------------------------------------|--------|----------|-----------|
| `age`           | age of the patient                        | Feature| Integer  | years     |
| `sex`           | gender of the patient                     | Feature| Categorical | -      |
| `cp`            | Chest pain type                           | Feature| Categorical | -      |
| `trestbps`      | resting blood pressure (on admission to the hospital) | Feature | Integer | mm/Hg |
| `chol`          | serum cholesterol                         | Feature| Integer  | mg/dl    |
| `fbs`           | fasting blood sugar > 120 mg/dl           | Feature| Categorical | -      |
| `restecg`       | Resting electrocardiographic results      | Feature| Categorical | -      |
| `thalach`       | maximum heart rate achieved               | Feature| Integer  | -         |
| `exang`         | exercise induced angina                   | Feature| Categorical | -      |
| `oldpeak`       | ST depression induced by exercise  relative to rest        | Feature| Float    | - |
| `slope`         | Slope of the peak exercise ST segment     | Feature| Categorical | -      |
| `ca`            | number of major vessels (0-3) colored by fluoroscopy | Feature | Integer | - |
| `thal`          | Thallium stress test                      | Feature| Categorical | -      |
| `status`        | diagnosis of heart disease                | Target | Categorical | -      |

This data is intended to analyze the presence of heart disease in patients. You will use your machine learning and data science techniques for predictive modelling and risk analysis, particularly in the healthcare sector.

## Duration:

You have 24 hours from the time you receive this test to complete these exercises. Good luck!


<a id="cont"></a>

## Table of Contents

<a href=#packages>i. Importing Packages</a>

<a href=#loading>ii. Data Loading</a>

<a href=#eda> iii. Exploratory Data Analysis (EDA)

<a href=#one>1. Data Consolidation</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="packages"></a>
## i. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |

---

<a id="loading"></a>
## ii. Data Loading
<!-- <a class="anchor" id="1.1"></a> -->
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

<a id="eda"></a>
## iii. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, I performed an in-depth analysis of all the variables in the DataFrame. |

---


### Dataset Overview

The `df_south` DataFrame contains **270 entries** and **16 columns**. The following are the key details regarding the columns:

- **Data Types**:
  - **Float64**: 5 columns (e.g., `age`, `thalach`, `oldpeak`)
  - **Int64**: 4 columns (e.g., `trestbps`, `chol`, `ca`)
  - **Object**: 6 columns (e.g., `sex`, `cp`, `fbs`, `slope`, `thal`, `status`)
  - **Bool**: 1 column (`exang`)

- **Non-Null Counts**:
  - The columns `age` and `thalach` have missing values (268 and 267 non-null entries, respectively), while all other columns are fully populated.
  - There are two unnamed columns that contain no data.

This summary provides a foundational understanding of the data structure, types, and potential areas needing attention, such as missing values in specific columns.


 <a id="one"></a>
## 1. Data Consolidation
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| o Are any transformations required for specific variables to ensure they are directly comparable between the two datasets? How will these transformations affect the analysis? |
| o After consolidating the datasets, what steps will be taken to validate the integrity and accuracy of the combined data? |

---![Alt text](image.png)

<a id="two"></a>
## 2. Data Cleaning
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| o Are there any missing values in the dataset? If so, how do you propose to handle them? |
| o How would you deal with any outliers in the dataset? Justify your approach. |
<!-- | In this section you are required to load the data from the `df_train` file into a DataFrame. | -->

---

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---