

*   Remove attributes with an absolute correlation

The purpose of this analysis is to identify and remove attributes that show high correlation in a health dataset. This helps to avoid redundancy and enhance model efficiency. Highly correlated features can lead to Complexity, which might negatively affect the performance of statistical models.
In this analysis, we computed the correlation matrix for numerical variables and set a threshold of 0.75 to highlight pairs of highly correlated attributes.
  
 **Removal of Attributes:**The analysis identified pairs of attributes with an absolute correlation coefficient of 0.75 or higher. One attribute from each pair was systematically removed, streamlining the dataset by retaining only one representative feature from each highly correlated group.

- **Original DataFrame:**The original dataset consists of 4000 rows and 13 columns, including Patient ID, Age, Sex, Cholesterol, Systolic Blood Pressure, diastolic blood pressure, Heart Rate, Diabetes, Family History, Smoking, Diet, Continent, and Heart Attack Risk.
  
- **DataFrame After Removing Highly Correlated Attributes:** After applying the correlation threshold, the filtered DataFrame retained only essential attributes. However, in this specific analysis, no attributes were removed because the highest correlation coefficient observed was below 0.75. Therefore, the original DataFrame remains unchanged.

In [8]:
import pandas as pd
import numpy as np

data = pd.read_csv('heart.attack.csv')
df = pd.DataFrame(data)

correlation_matrix = df.corr(numeric_only=True)
correlation_threshold = 0.75
highly_correlated_pairs = np.where(np.abs(correlation_matrix) >= correlation_threshold)
numeric_attributes = list(df.select_dtypes(include='number').columns)

attributes_to_remove = set()

for i, j in zip(*highly_correlated_pairs):
    if i != j and numeric_attributes[i] not in attributes_to_remove and numeric_attributes[j] not in attributes_to_remove:
        attribute_i = numeric_attributes[i]
        attribute_j = numeric_attributes[j]
        attributes_to_remove.add(attribute_j)

df_filtered = df.drop(columns=attributes_to_remove)

print("\nOriginal DataFrame:")
print(df)

print("\nDataFrame after removing highly correlated attributes:")
print(df_filtered)


Original DataFrame:
     Patient ID   Age     Sex   Cholesterol  Systolic BP  Diastolic BP  \
0       BMW7812    67    Male           208          158            88   
1       CZE1114    21    Male           389          165            93   
2       BNI9906    21  Female           324          174            99   
3       JLN3497    84    Male           383          163           100   
4       GFO8847    66    Male           318           91            88   
...         ...   ...     ...           ...          ...           ...   
3995    UII9280    66    Male           201          172            91   
3996    SZU8764    42  Female           129          109            63   
3997    CQJ6551    81    Male           127          153           110   
3998    DZQ4343    81    Male           244          109           103   
3999    WER4678    44  Female           150          100            97   

       Heart Rate  Diabetes  Family History  Smoking       Diet  \
0              72      