# <b>1 <span style='color:#0386f7de'>|</span> Introduction</b>

## What topic does the dataset cover?
According to the CDC, heart disease is one of the leading causes of death for people of most races in the US (African Americans, American Indians and Alaska Natives, and white people). About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Other key indicator include diabetic status, obesity (high BMI), not getting enough physical activity or drinking too much alcohol. Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare. Computational developments, in turn, allow the application of machine learning methods to detect "patterns" from the data that can predict a patient's condition.


## Explanation of the variables of the dataset
1. HeartDisease : Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI).
2. BMI : Body Mass Index (BMI).
3. Smoking : Have you smoked at least 100 cigarettes in your entire life? ( The answer Yes or No ).
4. AlcoholDrinking : Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week
5. Stroke : (Ever told) (you had) a stroke?
6. PhysicalHealth : Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? (0-30 days).
7. MentalHealth : Thinking about your mental health, for how many days during the past 30 days was your mental health not good? (0-30 days).
8. DiffWalking : Do you have serious difficulty walking or climbing stairs?
9. Sex : Are you male or female?
10. AgeCategory: Fourteen-level age category.
11. Race : Imputed race/ethnicity value.
12. Diabetic : (Ever told) (you had) diabetes?
13. PhysicalActivity : Adults who reported doing physical activity or exercise during the past 30 days other than their regular job.
14. GenHealth : Would you say that in general your health is...
15. SleepTime : On average, how many hours of sleep do you get in a 24-hour period?
16. Asthma : (Ever told) (you had) asthma?
17. KidneyDisease : Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?
18. SkinCancer : (Ever told) (you had) skin cancer?











# <b>2 <span style='color:#0386f7de'>|</span> Importing libraries</b>

In [None]:
import pandas as pd
import numpy as np


%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import matplotlib.pyplot as plt
import seaborn as sns

# <b>3 <span style='color:#0386f7de'>|</span> Reading the dataset</b>

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/FrancescaFati/SmartHospitals2024/main/heart_2020_cleaned.csv')
df.head()

# <b>4 <span style='color:#0386f7de'>|</span> Data Visualization </b>

In [None]:
n_rows = 3
n_cols = int(np.ceil(len(df.columns) / n_rows))
fig, axes = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 5 * n_rows))

for i, column in enumerate(df.columns):
    row = i // n_cols
    col = i % n_cols
    ax = axes[row, col]
    df[column].hist(ax=ax)
    ax.set_title(f'{column}')
    ax.set_xlabel(column)
    ax.set_ylabel('Frequency')
plt.tight_layout()
plt.show()

In [None]:
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

n_rows = 3
n_cols = int(np.ceil(len(categorical_cols) / n_rows))
fig, axes = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 5 * n_rows), squeeze=False)

for i, column in enumerate(categorical_cols):
    row = i // n_cols
    col = i % n_cols
    ax = axes[row, col]
    sns.countplot(data=df, x=column, palette='autumn', hue='HeartDisease', ax=ax)
    ax.set_title(f'{column}')
    ax.set_xlabel(column)
    ax.set_ylabel('Count')

plt.tight_layout()
plt.show()



In [None]:
df =  df[df.columns].replace({'Yes':1, 'No':0, 'Male':1,'Female':0,'No, borderline diabetes':'0','Yes (during pregnancy)':'1' })
df['Diabetic'] = df['Diabetic'].astype(int)

fig, ax = plt.subplots(figsize = (13,6))

ax.hist(df[df["HeartDisease"]==1]["Smoking"], bins=15, alpha=0.5, color="red", label="HeartDisease")
ax.hist(df[df["HeartDisease"]==0]["Smoking"], bins=15, alpha=0.5, color="#fccc79", label="Normal")

ax.set_xlabel("Smoking")
ax.set_ylabel("Frequency")

fig.suptitle("Distribution of Cases with Yes/No heartdisease according to being a smkoer or not.")

ax.legend();

In [None]:
# Setup the matplotlib figure with 2 rows and 2 columns
fig, axes = plt.subplots(2, 2, figsize=(13, 10))

ax = axes.ravel()

# Plot 1: Distribution of BMI
sns.kdeplot(df[df["HeartDisease"]==1]["BMI"], alpha=0.5, fill=True, color="red", label="HeartDisease", ax=ax[0])
sns.kdeplot(df[df["HeartDisease"]==0]["BMI"], alpha=0.5, fill=True, color="#fccc79", label="Normal", ax=ax[0])
ax[0].set_title('Distribution of Body Mass Index', fontsize=14)
ax[0].set_xlabel("BodyMass")
ax[0].set_ylabel("Frequency")
ax[0].legend()

# Plot 2: Distribution of SleepTime
sns.kdeplot(df[df["HeartDisease"]==1]["SleepTime"], alpha=0.5, fill=True, color="red", label="HeartDisease", ax=ax[1])
sns.kdeplot(df[df["HeartDisease"]==0]["SleepTime"], alpha=0.5, fill=True, color="#fccc79", label="Normal", ax=ax[1])
ax[1].set_title('Distribution of SleepTime values', fontsize=14)
ax[1].set_xlabel("SleepTime")
ax[1].set_ylabel("Frequency")
ax[1].legend()

# Plot 3: Distribution of PhysicalHealth
sns.kdeplot(df[df["HeartDisease"]==1]["PhysicalHealth"], alpha=0.5, fill=True, color="red", label="HeartDisease", ax=ax[2])
sns.kdeplot(df[df["HeartDisease"]==0]["PhysicalHealth"], alpha=0.5, fill=True, color="#fccc79", label="Normal", ax=ax[2])
ax[2].set_title('Distribution of PhysicalHealth state for the last 30 days', fontsize=14)
ax[2].set_xlabel("PhysicalHealth")
ax[2].set_ylabel("Frequency")
ax[2].legend()

# Plot 4: Distribution of MentalHealth
sns.kdeplot(df[df["HeartDisease"]==1]["MentalHealth"], alpha=0.5, fill=True, color="red", label="HeartDisease", ax=ax[3])
sns.kdeplot(df[df["HeartDisease"]==0]["MentalHealth"], alpha=0.5, fill=True, color="#fccc79", label="Normal", ax=ax[3])
ax[3].set_title('Distribution of MentalHealth state for the last 30 days', fontsize=14)
ax[3].set_xlabel("MentalHealth")
ax[3].set_ylabel("Frequency")
ax[3].legend()

plt.tight_layout()
plt.show()


In [None]:
fig, ax = plt.subplots(figsize=(10, 6))  # You can adjust the size as needed

# Create the scatter plot
sns.scatterplot(data=df, x='BMI', y='PhysicalHealth', hue='HeartDisease', ax=ax)

ax.set_title('BMI vs PhysicalHealth ')
ax.set_xlabel('BMI')
ax.set_ylabel('PhysicalHealth')
ax.legend(title='Heart Disease')

plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))  # You can adjust the size as needed

# Create the scatter plot
sns.scatterplot(data=df, x='AgeCategory', y='PhysicalHealth', hue='HeartDisease', ax=ax)

ax.set_title('AgeCategory vs PhysicalHealth')
ax.set_xlabel('AgeCategory')
ax.set_ylabel('PhysicalHealth')
ax.legend(title='Heart Disease')

plt.tight_layout()
plt.show()