# Insert Title of the Case Study

Submitted by:   
**Lorraine Renee B. Cortel**  
**Dennis H. Lu**  
**Danielle Kirsten T. Sison**

**CSMODEL – S14**

Submitted to:   
**Mr. Thomas James Z. Tiam-Lee**

Date of Submission:  
**September 22, 2020**

***

`Required Parts`:
1. Data Description
2. Exploratory Data Analysis
3. Research Questions
4. Data Modelling
5. Insights and Conclusions

For reference on how to use markdown, check link: https://daringfireball.net/projects/markdown/syntax  
Link of dataset: http://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records  
Link of research that used the dataset: https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5 *

*Note: PDF files of the research papers can be found in Github under /resources.

## 1. Data Description

The dataset used in this case study is based on the medical records of heart failure patients collected at the Faisalabad Institute of Cardiology and at the Allied Hospital in Faisalabad (Punjab, Pakistan) from April 2015 to December 2015 [1, 2]. The researchers of the original study ensured that all patients who participated in the data collection are aware of the situation and have provided their consent [2]. Additionally, the original study was approved by Institutional Review Board of Government College University, Faisalabad-Pakistan and followed the principles of Helsinki Declaration [2].

The dataset consisted of 299 observations, wherein one observation is a medical report of a heart failure patient. In total, there are 105 female patients and 194 male patients whose medical report was included in the dataset. All of the patients are above 40 years old and have left ventricular systolic dysfunction (LVSD), a complication caused when the left ventricle of the heart loses its ability to contract. The left ventricle of the heart is responsible for pumping oxygen-rich blood into circulation and LVSD results to the heart having weaker pumping force. This puts all the patients for this dataset under Class III or IV of New York Heart Association's (NYHA) Functional Classification – a classification for how limited a patient is during physical activity. Patients under Class III and IV are vulnerable to fatigue easily and experiences discomfort when any physical activity is performed. 

Based on Ahmad et. al's case study, the following features were considered as potential variables to explain mortality caused by cardiovascular diseases like heart failure: age, gender, blood pressure (BP), ejection fraction (EF), creatinine phosphokinase (CPK), platelet count, serum sodium, serum creatinine, and whether the patient has anemia, diabetes, and is a smoker [2]. 

Some information like anemia, high blood pressure, diabetes, sex, and smoking where represented as binary data – with 1 as an indicator that the feature is true for that particular patient and 0 amounting to false. For example, a patient who has 1 on anemia but 0 for high blood pressure means that the patient has anemia but no symptoms for high blood pressure. Anemia classifications were determined by the hospital physician using the patient's haematocrit level. If a patient has haematocrit levels less than 36%, the minimum for a normal haematocrit level, then they were considered as anemic. For the classification of high blood pressure and smoking status, the original researchers used physician reports. 

Numerical risk factors were generated from the patients' blood reports and medical records. EF pertains to the amount of blood pumped out by the left ventricle for each contraction and is represented in percentage format. CPK describes the level of CPK enzymes in the blood and is represented in micrograms per liter (mcg/L). CPK was used as a risk factor for high levels of CPK in the blood could indicate heart failure or injury because more CPK is flowing in the blood during muscle tissue damage. A risk factor related with CPK is serum creatinine. Serum creatinine is a waste product generated by creatinine that comes from the normal wear and tear on muscles of the body. Like CPK, high levels of serum creatinine could indicate heart failure. In the dataset, serum creatinine is represented in values of milligrams per deciliter (mg/dL). Serum sodium is the indicator used to measure the sodium level in a patient's blood. It is represented as milliequivalents per litre (mEq/L). Sodium is important for ensuring that muscles and nerves are functioning correctly. Therefore, abnormally low levels of sodium in the blood might be caused by a heart failure. Platelet count was recorded as kiloplatelets per millilitre (kiloplatelets/mL).

Other features of the dataset includes a follow-up period measured in days and a variable for whether a patient is still alive or not called "death event". Like the rest of binary data in the study, 1 indicates that a patient is already dead, while 0 is the indicator for patients that are still alive. Follow-up period refers to the act of monitoring a patient's health after treatment, including when a patient participates in a clinical study or trial for a period of time. In the study, follow-up time varies per patient and is the basis for the death event feature, meaning patients that died within the follow-up period will have their death event recorded as 1.

Regardless of the field of study, accurate data collection is important to maintain the integrity of the research. Appropriate selection of data collection instruments and clearly outlined instructions for the actual data gathering procedure reduces the likelihood of errors occuring. Thus, the dataset used in the original case study and this case study should not be an exception. Failure to comply to these methods could possibly result to having difficulties in answering the proposed research questions. Furthermore, even if the data analysis was executed, there is no guarantee that acquired results are accurate and clearly reflects the initial intention of the whole data gathering process. 

## 2. Exploratory Data Analysis

### 2.1. Dataset Description and Variables

In this case study, a dataset called `heart_failure_clinical_records_dataset` is used for data analysis and data modelling. Data modelling is based on the formulated research questions made by the authors of this case study. The aim of the research questions is to aid the authors in creating a comprehensive analysis of the dataset to extract insights, answer the formulated questions, and draw conclusions about the contents of the dataset.

The dataset is available as a **.csv** file under the folder `dataset`.

As mentioned before, the dataset contains medical records of patients that have heart failure. The dataset consists of **299 observantions**, meaning that there is a total of 299 patients that participated in the data gathering, and these observations are represented as rows in the Excel file. The columns in the Excel file dennotes the variables of the dataset. In total, there are **13 variables** displayed.

The description of each variable in the dataset is provided below.
* **`age`:** Refers to the age of the patient.
* **`anaemia`:** Indicates whether a patient has anemia (1) or not (0).
* **`creatinine_phosphokinase`:** Refers to the amount of the CPK enzyme in the blood – measured in mcg/L.
* **`diabetes`:** Indicates whether a patient has diabetes (1) or not (0).
* **`ejection_fraction`:** Refers to the percentage of blood that leaves the heart at each contraction.
* **`high_blood_pressure`:** Indicates whether a patient has high blood pressure (1) or not (0).
* **`platelets`:** Refers to the platelet count in blood – measured in kiloplatelets/mL.
* **`serum_creatinine`:** Refers to the amount of creatinine in the blood – measured in mg/dL.
* **`serum_sodium`:** Refers to the amount of sodium in the blood – measured in mEq/L.
* **`sex`:** Indicates whether a patient is male (1) or female (0).
* **`smoking`:** Indicates whether a patient is a smoker (1) or not (0).
* **`time`:** Refers to the number of days for the follow-up period of each patient.
* **`DEATH_EVENT`:** Indicates whether the patient is died during the follow-up period (1) or is still alive (0).

### 2.2. Importing Libraries

To properly visualize the dataset, libraries such `numpy` and `pandas`, and function collections like `matplotlib.pyplot` has to be imported.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### 2.3. Loading the Dataset

Displaying the dataset requires the use of `pandas`. By implementing the functions `read_csv`, the dataset can be loaded as a pandas DataFrame.

In [2]:
heartfailure_df = pd.read_csv("dataset/heart_failure_clinical_records_dataset.csv")

To check whether the variables described before coincides with the actual content of the loaded dataset file, the `info` function will be called. The `info` function also displays general information about the dataset including the index range count and column count. It also displays information about the number and name of the columns, non-null data count per column, and the data type of observations in a column. 

In [3]:
heartfailure_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


### 2.4. Data Cleaning

## 3. Research Questions

## 4. Data Modelling

## 5. Insights and Conclusion

## 6. References
<pre>
[1] D. Chicco and G. Jurman, "Machine learning can predict survival of patients with heart failure from serum creatinine 
     and ejection fraction alone," BMC Med Inform Decis Mak, vol. 20, no. 1, Feb. 2020, doi: 10.1186/s12911-020-1023-5.  
[2] T. Ahmad, A. Munir, S. H. Bhatti, M. Aftab, and M. A. Raza, "Survival analysis of heart failure patients: A case 
     study," PloS one, vol. 12, no. 7, Jul. 2017, doi: https://doi.org/10.1371/journal.pone.0181001.
</pre>