<div style='background-color:#4e9bf8; padding: 15px; border-radius: 5px;'>
<h1 style='color:#FEFEFE; text-align:center;'>Heart Failure Risk Analysis – Data Analytics Project</h1>
</div>

<h2 style='color:#4e9bf8;'>Objectives</h2>

- Load and preprocess the heart failure risk dataset.
- Perform exploratory data analysis (EDA) to understand data distribution and relationships.
- Identify key risk factors associated with heart failure.
- Store cleaned and processed data for further analysis and model training.
- Develop data visualizations to support insights.
- Create linear regression to predict future heart failure.



<h2 style='color:#4e9bf8;'>Inputs</h2>

- **Dataset:** `heart_dataset.csv` (https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction)
- **Required Libraries:** Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, Plotly
- **Columns of Interest:**
  - **Demographics:** Age, Sex
  - **Medical Indicators:** ChestPainType, RestingBP, FastingBS, Cholesterol, RestingECG, MaxHR, ExerciseAngina

<h2 style='color:#4e9bf8;'>Outputs</h2>

- **Cleaned dataset:** Processed dataset stored as a CSV file for analysis (`cleaned_heartdataset.csv`).
- **Exploratory Data Analysis (EDA):**
  - Distribution of heart failure risk across demographics.
  - Observing the distribution or features like Age, RestingBP, and Cholesterol.
  - Identify outliers in numerical data.
  - Compare categorical variables with the target variable.
- **Feature-engineered dataset:** Enhanced dataset with new derived features.
- **Insights & Summary Reports:** Key findings documented for further decision-making.

<h2 style='color:#4e9bf8;'>Additional Comments</h2>

- Ensure proper handling of missing, duplicated and outlier values to maintain data integrity.
- Perform bias detection to identify imbalances in demographic representation.




---

<div style='background-color:#4e9bf8; padding: 15px; border-radius: 5px;'>
<h1 style='color:#FEFEFE; text-align:center;'>Section 1 :  Data Extraction, Transformation, and Loading (ETL)</h1>
</div>

<h2 style='color:#4e9bf8;'>Changing work directory</h2>

To run the notebook in the editor, the working directory needs to be changed from its current folder to its parent folder. Thus, we first access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

's:\\Documents\\Code Institute\\vscode-projects\\Heart-Failure-Capstone\\Heart-Failure-Risk-Analysis\\jupyter_notebooks'

Then we make the parent of the current directory the new current directory by using:
  * os.path.dirname() to get the parent directory
  * os.chir() to define the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory.")

You set a new current directory.


Confirming the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

's:\\Documents\\Code Institute\\vscode-projects\\Heart-Failure-Capstone\\Heart-Failure-Risk-Analysis'

<div style='background-color:#4e9bf8; padding: 15px; border-radius: 5px;'>
<h1 style='color:#FEFEFE; text-align:left;'>Section 1 :  Data Extraction, Transformation, and Loading (ETL)</h1>
</div>

<h2 style='color:#4e9bf8;'>Importing Libraries and Packages</h2>

Loading Python packages that we will be using in this project to carry out the analysis. For example Numpy to compute numerical operations and handle arrays, Pandas for data manipulation and analysis, Matplotlib, Seaborn and Plotly to create different data visualisations

In [4]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import plotly.express as px
from scipy.stats import chi2_contingency, kurtosis, skew
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

Loading the CSV dataset containing the data collected previously and extracting it into dataframe using pd.read_csv() function

In [5]:
df = pd.read_csv("Inputs\heart_dataset.csv")

  df = pd.read_csv("Inputs\heart_dataset.csv")


<h2 style='color:#4e9bf8;'>Data Analysis</h2>

<h3 style='color:#4e9bf8;'>Exploratory Data Analysis (EDA)</h3>

First, Exploratory data analysis (EDA) will be done to gain an initial understanding of the dataset. We will start by checking general information regarding the data such as column names, datatypes of columns, number of entries and the memory space used through .info() method

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


Getting a general overview of the dataset with .head() method

In [7]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


Getting list of Column names in dataset

In [8]:
df.columns.tolist()

['Age',
 'Sex',
 'ChestPainType',
 'RestingBP',
 'Cholesterol',
 'FastingBS',
 'RestingECG',
 'MaxHR',
 'ExerciseAngina',
 'Oldpeak',
 'ST_Slope',
 'HeartDisease']

Checking for missing values

In [9]:
df.isnull().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

Checking for any duplicate values

In [10]:
duplicate_check= df.duplicated().any()
print('There are duplicates:', duplicate_check)

There are duplicates: False


Checking for NAN or empty values

In [11]:
df.dropna(axis=1, how='all')

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1


Checking for unique values

In [12]:
unique_counts = df.nunique()
unique_table = pd.DataFrame({'Column': unique_counts.index, 'Unique Values': unique_counts.values})
unique_table

Unnamed: 0,Column,Unique Values
0,Age,50
1,Sex,2
2,ChestPainType,4
3,RestingBP,67
4,Cholesterol,222
5,FastingBS,2
6,RestingECG,3
7,MaxHR,119
8,ExerciseAngina,2
9,Oldpeak,53


Checking each Column's datatype

In [13]:
df.dtypes

Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         object
MaxHR               int64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object

Generating a summary of the statistics of the dataset for finding mean, median, total count of entries, standard deviation(std), minimum and maximum values

In [14]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,918.0,53.510893,9.432617,28.0,47.0,54.0,60.0,77.0
RestingBP,918.0,132.396514,18.514154,0.0,120.0,130.0,140.0,200.0
Cholesterol,918.0,198.799564,109.384145,0.0,173.25,223.0,267.0,603.0
FastingBS,918.0,0.233115,0.423046,0.0,0.0,0.0,0.0,1.0
MaxHR,918.0,136.809368,25.460334,60.0,120.0,138.0,156.0,202.0
Oldpeak,918.0,0.887364,1.06657,-2.6,0.0,0.6,1.5,6.2
HeartDisease,918.0,0.553377,0.497414,0.0,0.0,1.0,1.0,1.0


Checking the distribution of the  categorical variables(Sex, ChestPainType, RestingECG and HeartDisease)      

In [15]:
categorical_features = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina' and 'HeartDisease']  
for col in categorical_features:
    print(f"\nFrequency counts for {col}:")
    print(df[col].value_counts())


Frequency counts for Sex:
Sex
M    725
F    193
Name: count, dtype: int64

Frequency counts for ChestPainType:
ChestPainType
ASY    496
NAP    203
ATA    173
TA      46
Name: count, dtype: int64

Frequency counts for RestingECG:
RestingECG
Normal    552
LVH       188
ST        178
Name: count, dtype: int64

Frequency counts for HeartDisease:
HeartDisease
1    508
0    410
Name: count, dtype: int64


Checking for outliers and skewness of data distribution using skew() and kurtosis() statistical measures that describe the shape of a distribution. They help us understand how data points are spread and how they deviate from a normal distribution

In [16]:
# Select numerical columns
numerical_features = df.select_dtypes(include=[float, int]).columns

# Calculate skewness and kurtosis
print("\nSkewness and Kurtosis for Numerical Features")
results = []
for col in numerical_features:
    col_skewness = skew(df[col].dropna())  # Drop NaN values for calculation
    col_kurtosis = kurtosis(df[col].dropna())  # Drop NaN values for calculation
    results.append({'Feature': col, 'Skewness': col_skewness, 'Kurtosis': col_kurtosis})

# Create a DataFrame to display the results neatly
result_df = pd.DataFrame(results)
result_df


Skewness and Kurtosis for Numerical Features


Unnamed: 0,Feature,Skewness,Kurtosis
0,Age,-0.195613,-0.390568
1,RestingBP,0.179545,3.246932
2,Cholesterol,-0.609089,0.111037
3,FastingBS,1.262417,-0.406303
4,MaxHR,-0.144123,-0.452339
5,Oldpeak,1.0212,1.189992
6,HeartDisease,-0.214735,-1.953889


**Summary of Skewness and Kurtosis for Numerical Features**

**Skewness (Symmetry of Data Distribution)**
- Negative Skewness (< 0): Data is slightly left-skewed (longer left tail).
- Positive Skewness (> 0): Data is right-skewed (longer right tail).

  ***Observations:***

     - Most features are nearly symmetrical (Age, RestingBP, MaxHR, HeartDisease).
     - Fasting Blood Sugar (FastingBS) and Oldpeak are highly right-skewed, indicating potential outliers in high values.
     - Cholesterol has moderate left skew, meaning lower values are more frequent.

**Kurtosis (Tailedness of Data Distribution)**
- Kurtosis < 3 (Platykurtic): Flatter distribution, fewer extreme values.
- Kurtosis ≈ 3 (Mesokurtic): Normal-like distribution.
- Kurtosis > 3 (Leptokurtic): Peaked distribution with more extreme values.

  ***Observations:***
     -Most features have low kurtosis (< 3), meaning they do not have many extreme values.
     -RestingBP (3.25) is slightly leptokurtic, indicating the presence of some extreme values.
      -HeartDisease has the lowest kurtosis (-1.95), suggesting a nearly uniform distribution.

**Key Findings**
- Most features are close to normal distribution but FastingBS and Oldpeak are right-skewed, suggesting outliers.
- RestingBP is slightly peaked, meaning some extreme values exist.
- HeartDisease has a very flat distribution, indicating balanced cases.

<h3 style='color:#4e9bf8;'>Advance Analyses</h3>

<h4 style='color:#4e9bf8;'>Univariate Analysis</h4>

Performing univariate analysis to explore individual features and their relationship to heart disease by calculating the proportions of individuals with and without heart failure for each categorical variable

In [17]:
# Defining the target variable and categorical variables
target = 'HeartDisease'  
categorical_features = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina']  

# Calculating Proportions

print("Proportions of Heart Disease vs No Heart Disease for Each Categorical Variable:")
for col in categorical_features:
    proportion = df.groupby(col)[target].value_counts(normalize=True).unstack()
    print(f"\nProportions for '{col}':\n")
    print(proportion)

Proportions of Heart Disease vs No Heart Disease for Each Categorical Variable:

Proportions for 'Sex':

HeartDisease         0         1
Sex                             
F             0.740933  0.259067
M             0.368276  0.631724

Proportions for 'ChestPainType':

HeartDisease          0         1
ChestPainType                    
ASY            0.209677  0.790323
ATA            0.861272  0.138728
NAP            0.645320  0.354680
TA             0.565217  0.434783

Proportions for 'RestingECG':

HeartDisease         0         1
RestingECG                      
LVH           0.436170  0.563830
Normal        0.483696  0.516304
ST            0.342697  0.657303

Proportions for 'ExerciseAngina':

HeartDisease           0         1
ExerciseAngina                    
N               0.648995  0.351005
Y               0.148248  0.851752


**Summary of Univariate Analysis for Categorical Features:**

**1️. Sex and Heart Disease**
- Females (F): 74.1% do not have heart disease, while 25.9% have it.
- Males (M): 63.2% have heart disease, while 36.8% do not.

***Observation:*** Males have a significantly higher proportion of heart disease cases than females.

**2. Chest Pain Type and Heart Disease**
- Atypical Angina (ATA): 86.1% do not have heart disease, while 13.9% do.
- Non-Anginal Pain (NAP): 64.5% do not have heart disease, while 35.5% do.
- Typical Angina (TA): 56.5% do not have heart disease, while 43.5% do.
- Asymptomatic (ASY): 79.0% have heart disease, while only 20.9% do not.
   
***Observation:*** The ASY (Asymptomatic) chest pain type is highly associated with heart disease (79.0% cases), while ATA (Atypical Angina) is the least likely to be associated with heart disease.

**3. Resting ECG and Heart Disease**
- LVH (Left Ventricular Hypertrophy): 56.4% have heart disease.
- Normal ECG: 51.6% have heart disease.
- ST-T wave abnormality (ST): 65.7% have heart disease.

***Observation:*** ST abnormalities are most associated with heart disease (65.7%), while normal ECG results have a lower association.

**4️. Exercise-Induced Angina and Heart Disease**
No Exercise Angina (N): 65.0% do not have heart disease, while 35.1% do.
Exercise Angina Present (Y): 85.2% have heart disease, while only 14.8% do not.
   
***Observation:*** People who experience exercise-induced angina (Y) are at a much higher risk of heart disease (85.2%).


**Key Takeaways:**

- Males have a higher prevalence of heart disease
- ASY (Asymptomatic) chest pain type is a strong indicator of heart disease
- ST-T wave abnormality in ECG is associated with heart disease
- Exercise-induced angina is a significant risk factor (85.2% with heart disease)




<h4 style='color:#4e9bf8;'>Chi-Square Tests</h4>

Checking for demographic bias based on gender and age using chi-square test, a statistical test used to determine whether there is a significant association between two categorical variables. It checks whether the distribution of categorical data differs from what we would expect by chance. 

Also added new age group column to dataframe to divide age into smaller age groups, for better visualisations.

In [18]:
# Checking for bias in the dataset regarding gender and representation

# 1. Gender distribution analysis
gender_dist = df['Sex'].value_counts(normalize=True)

# 2. Age distribution analysis
age_bins = [0, 20, 30, 40, 50, 60, 70, 80, 90, 100]
age_labels = ['0-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90-99']
df['AgeGroup'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=False)
age_dist = df['AgeGroup'].value_counts(normalize=True)

# 3. Diagnosis distribution by gender and age
diagnosis_dist = pd.crosstab([df['Sex'], df['AgeGroup']], df['HeartDisease'], normalize='index')

# 4. Statistical significance test
contingency_table = pd.crosstab([df['Sex'], df['AgeGroup']], df['HeartDisease'])
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Displaying the results
print("Gender Distribution:\n", gender_dist)
print("\nAge Distribution:\n", age_dist)
print("\nDiagnosis Proportions by Gender and Age:\n", diagnosis_dist)
print(f"\nChi-squared test p-value: {p:.4f}")

# Interpretation
if p < 0.05:
    print("\nThere is a statistically significant association between gender, age, and heart disease.")
else:
    print("\nThere is no statistically significant association between gender, age, and heart disease.")

Gender Distribution:
 Sex
M    0.78976
F    0.21024
Name: proportion, dtype: float64

Age Distribution:
 AgeGroup
50-59    0.407407
60-69    0.241830
40-49    0.229847
30-39    0.082789
70-79    0.033769
20-29    0.004357
0-19     0.000000
80-89    0.000000
90-99    0.000000
Name: proportion, dtype: float64

Diagnosis Proportions by Gender and Age:
 HeartDisease         0         1
Sex AgeGroup                    
F   30-39     0.842105  0.157895
    40-49     0.884615  0.115385
    50-59     0.722222  0.277778
    60-69     0.545455  0.454545
    70-79     0.833333  0.166667
M   30-39     0.596491  0.403509
    40-49     0.503145  0.496855
    50-59     0.364238  0.635762
    60-69     0.196629  0.803371
    70-79     0.160000  0.840000
    20-29     1.000000  0.000000

Chi-squared test p-value: 0.0000

There is a statistically significant association between gender, age, and heart disease.


**Summary of Chi-Square Test Findings**

**Gender Distribution:**
- Males (79%) dominate the dataset, while females make up only 21%
- Heart disease is more prevalent in males (63.2%) compared to females (25.9%)

  Females are more likely to be free from heart disease (74.1%) than males (36.8%)

**Age Distribution:**
- The largest age group is 50-59 years (40.7%), followed by 60-69 years (24.2%) and 40-49 years (22.9%)
- Very few individuals are under 30 years (0.4%) or over 70 years (3.3%)

**Heart Disease by Gender & Age:**

- Women **under 60** have a lower likelihood of heart disease (below 28%)
- Women aged **60-69** show a higher risk (45.5%), but it drops again for 70-79 (16.7%)

- Men have a much higher risk of heart disease, especially as they age:
  - Men in their **20s** show no cases of heart disease
  - Ages **30-39:** 40.4% have heart disease
  - Ages **40-49:** Nearly half (49.7%) are affected
  - Ages **50-59:** Risk increases to 63.6%
  - Ages **60-69:** A striking 80.3% have heart disease
  - Ages **70-79:** The highest risk, with 84% affected
  

**Chi-Square Test Results:** p-value: 0.0000 (extremely low). This indicates a statistically significant relationship between age, gender and heart disease.

**Key findings:**
- Heart disease risk increases with age, especially for men.
- Women are at lower risk overall but see an increase in their 60s.
- The association between gender, age, and heart disease is highly significant (not due to random chance).

Performing chi-square test to check the statistical significance of categorical variables

In [19]:
print("Chi-Square Test Results:")
for col in categorical_features:
    # Create a contingency table
    contingency_table = pd.crosstab(df[col], df[target])

    # Perform chi-square test
    chi2, p, dof, expected = chi2_contingency(contingency_table)

    # Print the results
    print(f"\nVariable: {col}")
    print(f"Chi-Square Statistic: {chi2}")
    print(f"p-value: {p}")
    print(f"Degrees of Freedom: {dof}")
    print("Expected Frequencies:\n", expected)

    # Interpretation
    if p < 0.05:
        print(f"\n'{col}' has a statistically significant association with '{target}'.")
    else:
        print(f"\n'{col}' does not have a statistically significant association with '{target}'.")

Chi-Square Test Results:

Variable: Sex
Chi-Square Statistic: 84.14510134633775
p-value: 4.597617450809164e-20
Degrees of Freedom: 1
Expected Frequencies:
 [[ 86.19825708 106.80174292]
 [323.80174292 401.19825708]]

'Sex' has a statistically significant association with 'HeartDisease'.

Variable: ChestPainType
Chi-Square Statistic: 268.06723902181767
p-value: 8.08372842808765e-58
Degrees of Freedom: 3
Expected Frequencies:
 [[221.52505447 274.47494553]
 [ 77.26579521  95.73420479]
 [ 90.66448802 112.33551198]
 [ 20.54466231  25.45533769]]

'ChestPainType' has a statistically significant association with 'HeartDisease'.

Variable: RestingECG
Chi-Square Statistic: 10.931469339140978
p-value: 0.0042292328167544925
Degrees of Freedom: 2
Expected Frequencies:
 [[ 83.96514161 104.03485839]
 [246.53594771 305.46405229]
 [ 79.49891068  98.50108932]]

'RestingECG' has a statistically significant association with 'HeartDisease'.

Variable: ExerciseAngina
Chi-Square Statistic: 222.25938271530583


<h2 style='color:#4e9bf8;'>Data Transformation and Loading</h2>

Converting categorical variables to numerical using one-hot encoding for visualisations 

In [20]:
categorical_cols = df.select_dtypes(include="object").columns
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
df.head()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,AgeGroup,Sex_M,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ExerciseAngina_Y,ST_Slope_Flat,ST_Slope_Up
0,40,140,289,0,172,0.0,0,40-49,True,True,False,False,True,False,False,False,True
1,49,160,180,0,156,1.0,1,40-49,False,False,True,False,True,False,False,True,False
2,37,130,283,0,98,0.0,0,30-39,True,True,False,False,False,True,False,False,True
3,48,138,214,0,108,1.5,1,40-49,False,False,False,False,True,False,True,True,False
4,54,150,195,0,122,0.0,0,50-59,True,False,True,False,True,False,False,False,True


Checking if the categorical data columns has been updated into numericals

In [21]:
info_table = pd.DataFrame({
                          "Column": df.columns,                      # Column names
                          "Non-Null Count": df.notnull().sum(),      # Non-null counts
                          "Data Type": df.dtypes                     # Data types of each column
                          }).reset_index(drop=True)
info_table

Unnamed: 0,Column,Non-Null Count,Data Type
0,Age,918,int64
1,RestingBP,918,int64
2,Cholesterol,918,int64
3,FastingBS,918,int64
4,MaxHR,918,int64
5,Oldpeak,918,float64
6,HeartDisease,918,int64
7,AgeGroup,918,category
8,Sex_M,918,bool
9,ChestPainType_ATA,918,bool


Next we are going to calculate risk score of having a heart disease and add the column to the dataframe. We are going to use Principal Component Analysis (PCA) to determine feature importance as we don't know which are the key features for a predefined target

In [22]:
# Standardizing the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df.select_dtypes(include=[np.number]))

# Applying PCA
pca = PCA(n_components=1)  # Reduces to 1 principal component
pca.fit(scaled_data)

# Extracting PCA Loadings (Feature Weights)
pca_weights = abs(pca.components_[0]) / sum(abs(pca.components_[0]))

# Assigning Weights to Features
df['RiskScore'] = np.dot(scaled_data, pca_weights)

df[['Age', 'AgeGroup', 'Cholesterol', 'RestingBP', 'FastingBS', 'RiskScore']]


Unnamed: 0,Age,AgeGroup,Cholesterol,RestingBP,FastingBS,RiskScore
0,40,40-49,289,140,0,-0.292009
1,49,40-49,180,160,0,0.288617
2,37,30-39,283,130,0,-0.918545
3,48,40-49,214,138,0,-0.074396
4,54,50-59,195,150,0,-0.417702
...,...,...,...,...,...,...
913,45,40-49,264,110,0,-0.089314
914,68,60-69,193,144,1,1.077270
915,57,50-59,131,130,0,-0.009320
916,57,50-59,236,130,0,0.342005


Creating a correlation matrix to assess relationships between numerical variables and the target variable (HeartDisease)

In [24]:
numeric_df = df.select_dtypes(include=[float, int])  # Only numeric columns
correlation_matrix = numeric_df.corr()
correlation_matrix

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,RiskScore
Age,1.0,0.254399,-0.095282,0.198039,-0.382045,0.258612,0.282039,0.571677
RestingBP,0.254399,1.0,0.100893,0.070193,-0.112135,0.164803,0.107589,0.42302
Cholesterol,-0.095282,0.100893,1.0,-0.260974,0.235792,0.050148,-0.232741,0.148859
FastingBS,0.198039,0.070193,-0.260974,1.0,-0.131438,0.052698,0.267291,0.421629
MaxHR,-0.382045,-0.112135,0.235792,-0.131438,1.0,-0.160691,-0.400421,0.015753
Oldpeak,0.258612,0.164803,0.050148,0.052698,-0.160691,1.0,0.403951,0.630763
HeartDisease,0.282039,0.107589,-0.232741,0.267291,-0.400421,0.403951,1.0,0.606602
RiskScore,0.571677,0.42302,0.148859,0.421629,0.015753,0.630763,0.606602,1.0


**Summary of key insights from correlation matrix:**

1. **Strong Positive Correlation (From 0.8 to +1)**:
    - **Oldpeak**: A higher Oldpeak value is strongly associated with a higher likelihood of heart disease.

2. **Moderate Positive Correlation (between 0.5 - 0.8)**:
    - **FastingBS**: Higher fasting blood sugar levels are moderately associated with heart disease.
    - **RestingBP**: Higher resting blood pressure shows a moderate positive correlation with heart disease.

3. **Strong Negative Correlation (-1 to -0.8)**:
    - **MaxHR**: Higher maximum heart rate achieved during exercise is strongly associated with a lower likelihood of heart disease.

4. **Weak or No Correlation (closer to 0)**:
    - **Age**: Age shows a weak positive correlation with heart disease.
    - **Cholesterol**: Cholesterol levels show a weak positive correlation with heart disease.

These insights help in identifying which features are more important for predicting heart disease and can guide further analysis and model development.

<h2 style='color:#4e9bf8;'>Ouput</h2>

Data has been fully cleaned, processed and analysed. Some new transformed features have been added to the dataset like AgeGroup and Riskscore which we will save in a new csv file called `cleaned_heartdataset.csv` to be used in further data visualisations.

In [25]:
df.to_csv("Outputs\cleaned_heartdataset.csv", index=False)

  df.to_csv("Outputs\cleaned_heartdataset.csv", index=False)


<div style='background-color:#4e9bf8; padding: 15px; border-radius: 5px;'>
<h2 style='color:#FEFEFE; text-align:center;'>Conclusion & Next Step</h2>
</div>

- The heart_dataset.csv is relatively clean, with no missing, empty or duplicate values. 
- There are some outliers but will be fixed as appropriate in furter data visualisation. 
- Some new transformed features have been added to the dataset like AgeGroup and Riskscore which are saved in a new csv file called `cleaned_heartdataset`.
- All analyses insights were summarized below each analysis and will be displayed in `Data visualisation` jupyter notebook.



---