<div style='background-color:#4e9bf8; padding: 15px; border-radius: 5px;'>
<h1 style='color:#FEFEFE; text-align:center;'>Heart Failure Risk Analysis – Data Analytics Project</h1>
</div>

<h2 style='color:#4e9bf8;'>Objectives</h2>

- Load and preprocess the heart failure risk dataset.
- Perform exploratory data analysis (EDA) to understand data distribution and relationships.
- Identify key risk factors associated with heart failure.
- Store cleaned and processed data for further analysis and model training.
- Develop data visualizations to support insights.
- Create linear regression to predict future heart failure.



<h2 style='color:#4e9bf8;'>Inputs</h2>

- **Dataset:** `Heart dataset.csv` (https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction)
- **Required Libraries:** Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, Plotly
- **Columns of Interest:**
  - **Demographics:** Age, Sex
  - **Medical Indicators:** ChestPainType, RestingBP, FastingBS, Cholesterol, RestingECG, MaxHR, ExerciseAngina
- **Wireframe:**`Wireframe.png`

<h2 style='color:#4e9bf8;'>Outputs</h2>

- **Cleaned dataset:** Processed dataset stored as a CSV file for analysis (`Cleaned_Heartdataset.csv`).
- **Exploratory Data Analysis (EDA) Visuals:**
  - Distribution of heart failure risk across demographics.
  - Histograms: For features like Age, RestingBP, and Cholesterol to observe their distribution.
  - Box plots: To identify outliers in numerical data.
  - Bar charts: To compare categorical variables like Sex or Smoking with the target variable.
- **Feature-engineered dataset:** Enhanced dataset with new derived features.
- **Insights & Summary Reports:** Key findings documented for further decision-making.
- **PowerBI Dashboard:** `Heart Failure Risk Dashboard.pbix`

<h2 style='color:#4e9bf8;'>Additional Comments</h2>

- Ensure proper handling of missing, duplicated and outlier values to maintain data integrity.
- Perform bias detection to identify imbalances in demographic representation.
- Use visualization techniques to communicate insights effectively to both technical and non-technical stakeholders.




---

<div style='background-color:#4e9bf8; padding: 15px; border-radius: 5px;'>
<h1 style='color:#FEFEFE; text-align:center;'>Section 1 :  Data Extraction, Transformation, and Loading (ETL)</h1>
</div>

<h2 style='color:#4e9bf8;'>Changing work directory</h2>

To run the notebook in the editor, the working directory needs to be changed from its current folder to its parent folder. Thus, we first access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

's:\\Documents\\Code Institute\\vscode-projects\\Heart-Failure-Capstone\\Heart-Failure-Risk-Analysis\\jupyter_notebooks'

Then we make the parent of the current directory the new current directory by using:
  * os.path.dirname() to get the parent directory
  * os.chir() to define the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory.")

You set a new current directory.


Confirming the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

's:\\Documents\\Code Institute\\vscode-projects\\Heart-Failure-Capstone\\Heart-Failure-Risk-Analysis'

<div style='background-color:#4e9bf8; padding: 15px; border-radius: 5px;'>
<h1 style='color:#FEFEFE; text-align:left;'>Section 1 :  Data Extraction, Transformation, and Loading (ETL)</h1>
</div>

<h2 style='color:#4e9bf8;'>Importing Libraries and Packages</h2>

Loading Python packages that we will be using in this project to carry out the analysis. For example Numpy to compute numerical operations and handle arrays, Pandas for data manipulation and analysis, Matplotlib, Seaborn and Plotly to create different data visualisations

In [20]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import plotly.express as px
from scipy.stats import chi2_contingency

Loading the CSV dataset containing the data collected previously and extracting it into dataframe using pd.read_csv() function

In [5]:
df = pd.read_csv("Inputs\heart_dataset.csv")



  df = pd.read_csv("Inputs\heart_dataset.csv")


<h2 style='color:#4e9bf8;'>Data Analysis</h2>

First, Exploratory data analysis will be done to gain an initial understanding of the dataset. We will start by checking general information regarding the data such as column names, datatypes of columns, number of entries and the memory space used through .info() method

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


Getting a general overview of the dataset with .head() method

In [7]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


Getting list of Column names in dataset

In [8]:
df.columns.tolist()

['Age',
 'Sex',
 'ChestPainType',
 'RestingBP',
 'Cholesterol',
 'FastingBS',
 'RestingECG',
 'MaxHR',
 'ExerciseAngina',
 'Oldpeak',
 'ST_Slope',
 'HeartDisease']

Checking for missing values

In [9]:
df.isnull().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

Checking for any duplicate values

In [10]:
duplicate_check= df.duplicated().any()
print('There are duplicates:', duplicate_check)

There are duplicates: False


Checking for NAN or empty values

In [25]:
df.dropna(axis=1, how='all')

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1


Checking for unique values

In [11]:
unique_counts = df.nunique()
unique_table = pd.DataFrame({'Column': unique_counts.index, 'Unique Values': unique_counts.values})
unique_table

Unnamed: 0,Column,Unique Values
0,Age,50
1,Sex,2
2,ChestPainType,4
3,RestingBP,67
4,Cholesterol,222
5,FastingBS,2
6,RestingECG,3
7,MaxHR,119
8,ExerciseAngina,2
9,Oldpeak,53


Checking each Column's datatype

In [12]:
df.dtypes

Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         object
MaxHR               int64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object

Generating a summary of the statistics of the dataset for finding mean, median, total count of entries, standard deviation(std), minimum and maximum values

In [13]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,918.0,53.510893,9.432617,28.0,47.0,54.0,60.0,77.0
RestingBP,918.0,132.396514,18.514154,0.0,120.0,130.0,140.0,200.0
Cholesterol,918.0,198.799564,109.384145,0.0,173.25,223.0,267.0,603.0
FastingBS,918.0,0.233115,0.423046,0.0,0.0,0.0,0.0,1.0
MaxHR,918.0,136.809368,25.460334,60.0,120.0,138.0,156.0,202.0
Oldpeak,918.0,0.887364,1.06657,-2.6,0.0,0.6,1.5,6.2
HeartDisease,918.0,0.553377,0.497414,0.0,0.0,1.0,1.0,1.0


The heart_dataset.csv is relatively clean, with no missing, empty or duplicate values.

Carrying out frequency count to understand the distribution of the 5 categorical variables(Sex, ChestPainType, RestingECG, ExerciseAngina and HeartDisease)      

In [14]:
categorical_features = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina' and 'HeartDisease']  
for col in categorical_features:
    print(f"\nFrequency counts for {col}:")
    print(df[col].value_counts())


Frequency counts for Sex:
Sex
M    725
F    193
Name: count, dtype: int64

Frequency counts for ChestPainType:
ChestPainType
ASY    496
NAP    203
ATA    173
TA      46
Name: count, dtype: int64

Frequency counts for RestingECG:
RestingECG
Normal    552
LVH       188
ST        178
Name: count, dtype: int64

Frequency counts for HeartDisease:
HeartDisease
1    508
0    410
Name: count, dtype: int64


Performing univariate analysis to explore individual features and their relationship to heart disease by calculating the proportions of individuals with and without heart failure for each categorical variable

In [16]:
# Defining the target variable and categorical variables
target = 'HeartDisease'  
categorical_features = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina']  

# -------------------------
# Calculating Proportions
# -------------------------
print("Proportions of Heart Disease vs No Heart Disease for Each Categorical Variable:")
for col in categorical_features:
    proportion = df.groupby(col)[target].value_counts(normalize=True).unstack()
    print(f"\nProportions for '{col}':\n")
    print(proportion)

Proportions of Heart Disease vs No Heart Disease for Each Categorical Variable:

Proportions for 'Sex':

HeartDisease         0         1
Sex                             
F             0.740933  0.259067
M             0.368276  0.631724

Proportions for 'ChestPainType':

HeartDisease          0         1
ChestPainType                    
ASY            0.209677  0.790323
ATA            0.861272  0.138728
NAP            0.645320  0.354680
TA             0.565217  0.434783

Proportions for 'RestingECG':

HeartDisease         0         1
RestingECG                      
LVH           0.436170  0.563830
Normal        0.483696  0.516304
ST            0.342697  0.657303

Proportions for 'ExerciseAngina':

HeartDisease           0         1
ExerciseAngina                    
N               0.648995  0.351005
Y               0.148248  0.851752


Performing chi-square test to check the statistical significance of categorical variables

In [24]:
print("Chi-Square Test Results:")
for col in categorical_features:
    # Create a contingency table
    contingency_table = pd.crosstab(df[col], df[target])

    # Perform chi-square test
    chi2, p, dof, expected = chi2_contingency(contingency_table)

    # Print the results
    print(f"\nVariable: {col}")
    print(f"Chi-Square Statistic: {chi2}")
    print(f"p-value: {p}")
    print(f"Degrees of Freedom: {dof}")
    print("Expected Frequencies:\n", expected)

    # Interpretation
    if p < 0.05:
        print(f"\n'{col}' has a statistically significant association with '{target}'.")
    else:
        print(f"\n'{col}' does not have a statistically significant association with '{target}'.")

Chi-Square Test Results:

Variable: Sex
Chi-Square Statistic: 84.14510134633775
p-value: 4.597617450809164e-20
Degrees of Freedom: 1
Expected Frequencies:
 [[ 86.19825708 106.80174292]
 [323.80174292 401.19825708]]

'Sex' has a statistically significant association with 'HeartDisease'.

Variable: ChestPainType
Chi-Square Statistic: 268.06723902181767
p-value: 8.08372842808765e-58
Degrees of Freedom: 3
Expected Frequencies:
 [[221.52505447 274.47494553]
 [ 77.26579521  95.73420479]
 [ 90.66448802 112.33551198]
 [ 20.54466231  25.45533769]]

'ChestPainType' has a statistically significant association with 'HeartDisease'.

Variable: RestingECG
Chi-Square Statistic: 10.931469339140978
p-value: 0.0042292328167544925
Degrees of Freedom: 2
Expected Frequencies:
 [[ 83.96514161 104.03485839]
 [246.53594771 305.46405229]
 [ 79.49891068  98.50108932]]

'RestingECG' has a statistically significant association with 'HeartDisease'.

Variable: ExerciseAngina
Chi-Square Statistic: 222.25938271530583


<h2 style='color:#4e9bf8;'>Data transformation and loading</h2>

Converting categorical variables to numerical using one-hot encoding for visualisations 

In [None]:
categorical_cols = df.select_dtypes(include="object").columns
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
df.head()

Checking if the categorical data columns has been updated into numericals

In [None]:
info_table = pd.DataFrame({
                          "Column": df.columns,                      # Column names
                          "Non-Null Count": df.notnull().sum(),      # Non-null counts
                          "Data Type": df.dtypes                     # Data types of each column
                          }).reset_index(drop=True)
info_table

Creating a correlation matrix to assess relationships between numerical variables and the target variable (HeartDisease)

In [None]:
correlation_matrix = df.corr()
correlation_matrix

Summary of key insights from correlation matrix above:

1. **Strong Positive Correlation (From 0.8 to +1)**:
    - **Oldpeak**: A higher Oldpeak value is strongly associated with a higher likelihood of heart disease.

2. **Moderate Positive Correlation (between 0.5 - 0.8)**:
    - **FastingBS**: Higher fasting blood sugar levels are moderately associated with heart disease.
    - **RestingBP**: Higher resting blood pressure shows a moderate positive correlation with heart disease.

3. **Strong Negative Correlation (-1 to -0.8)**:
    - **MaxHR**: Higher maximum heart rate achieved during exercise is strongly associated with a lower likelihood of heart disease.

4. **Weak or No Correlation (closer to 0)**:
    - **Age**: Age shows a weak positive correlation with heart disease.
    - **Cholesterol**: Cholesterol levels show a weak positive correlation with heart disease.

These insights help in identifying which features are more important for predicting heart disease and can guide further analysis and model development.

---