# <b>1 <span style='color:#0386f7de'>|</span> Introduction</b>

##What topic does the dataset cover?
According to the CDC, heart disease is one of the leading causes of death for people of most races in the US (African Americans, American Indians and Alaska Natives, and white people). About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Other key indicator include diabetic status, obesity (high BMI), not getting enough physical activity or drinking too much alcohol. Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare. Computational developments, in turn, allow the application of machine learning methods to detect "patterns" from the data that can predict a patient's condition.


## Explanation of the variables of the dataset
1. HeartDisease : Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI).
2. BMI : Body Mass Index (BMI).
3. Smoking : Have you smoked at least 100 cigarettes in your entire life? ( The answer Yes or No ).
4. AlcoholDrinking : Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week
5. Stroke : (Ever told) (you had) a stroke?
6. PhysicalHealth : Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? (0-30 days).
7. MentalHealth : Thinking about your mental health, for how many days during the past 30 days was your mental health not good? (0-30 days).
8. DiffWalking : Do you have serious difficulty walking or climbing stairs?
9. Sex : Are you male or female?
10. AgeCategory: Fourteen-level age category.
11. Race : Imputed race/ethnicity value.
12. Diabetic : (Ever told) (you had) diabetes?
13. PhysicalActivity : Adults who reported doing physical activity or exercise during the past 30 days other than their regular job.
14. GenHealth : Would you say that in general your health is...
15. SleepTime : On average, how many hours of sleep do you get in a 24-hour period?
16. Asthma : (Ever told) (you had) asthma?
17. KidneyDisease : Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?
18. SkinCancer : (Ever told) (you had) skin cancer?











# <b>2 <span style='color:#0386f7de'>|</span> Importing libraries</b>

In [None]:
import pandas as pd
import numpy as np


%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt


# <b>3 <span style='color:#0386f7de'>|</span> Reading the dataset</b>

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/FrancescaFati/SmartHospitals2024/main/heart_2020_cleaned.csv')
nan_percentage = 0.1
nan_mask = np.random.rand(*df.shape) < nan_percentage
df[nan_mask] = np.nan

df.head(15)

# <b>4 <span style='color:#0386f7de'>|</span> Missing Data </b>

In [None]:
df.shape

In [None]:
total_nans = df.isna().sum().sum()
print("Total number of NaNs in the DataFrame:", total_nans)

In [None]:
nans_per_column = df.isna().sum()
print("Number of NaNs per column:\n", nans_per_column)

In [None]:
nan_percentage_per_column = (df.isna().sum() / len(df)) * 100
print("Percentage of NaNs per column:\n", round(nan_percentage_per_column,2))

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(df.isna(), cbar=False, cmap='viridis')
plt.title('Visualization of NaN Values in DataFrame')
plt.show()


# <b>6.1 <span style='color:#0386f7de'>|</span> Deletion Methods </b>

In [None]:
df.shape

In [None]:
df.head(10)

In [None]:
# Drop All Rows with Any NaN Values
df_dropped_rows_any = df.dropna()
df_dropped_rows_any.shape

In [None]:
df_dropped_rows_any.head(10)

In [None]:
# Drop Rows Where All Values are NaN
df_dropped_rows_all = df.dropna(how='all')
df_dropped_rows_all.shape

In [None]:
df_dropped_rows_all.head(10)

In [None]:
# Drop Rows Where All Values are NaN
df = df.dropna(subset=['HeartDisease'])
df.shape

In [None]:
df.head(10)

In [None]:
# Drop Columns with Any NaN Values
df_dropped_columns_any = df.dropna(axis=1)
df_dropped_columns_any.shape


In [None]:
df_dropped_columns_any.head(10)

In [None]:
#  Drop Columns Where All Values are NaN
df_dropped_columns_all = df.dropna(axis=1, how='all')
df_dropped_columns_all.shape

In [None]:
# Drop Rows or Columns with NaN Values Above a Certain Threshold

# Drop rows if they have less than thresh non-NaN values
df_dropped_thresh_rows = df.dropna(thresh=9)
df_dropped_thresh_rows.shape


In [None]:
# Drop columns if they have less than thresh non-NaN values
df_dropped_thresh_columns = df.dropna(axis=1, thresh=150000)
df_dropped_thresh_columns.shape

In [None]:
# Drop specific columns by name:
df_dropped_specific_columns = df.drop(columns=['KidneyDisease', 'Smoking'])
df_dropped_specific_columns.shape

# <b>6.2 <span style='color:#0386f7de'>|</span> Imputation Methods </b>


# <b>6.2.1 <span style='color:#0386f7de'>|</span> Replacing Methods </b>

In [None]:
df.head()

In [None]:
# Foward Fill
df_filled_ffill = df.fillna(method='ffill')
df_filled_ffill.head(10)

In [None]:
# Backward Fill
df_filled_bfill = df.fillna(method='bfill')
df_filled_bfill.head(10)

In [None]:
# Fill with Mean
df_filled_mean = df
# Select only numeric columns from df
numeric_cols = df.select_dtypes(include=[np.number])

# Calculate the mean of these numeric columns
means = numeric_cols.mean()
print(means)

# Fill NaN values in the numeric columns with their respective means
df_filled_mean[numeric_cols.columns] = numeric_cols.fillna(means)
df_filled_mean.head(10)

# <b>6.2.2 <span style='color:#0386f7de'>|</span> Regression Methods </b>

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/FrancescaFati/SmartHospitals2024/main/heart_2020_cleaned.csv')
nan_percentage = 0.1
nan_mask = np.random.rand(*df.shape) < nan_percentage
df[nan_mask] = np.nan

df.head(20)

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression

df = df.replace({
    'Yes': 1, 'No': 0, 'Male': 1, 'Female': 0,
    'No, borderline diabetes': 0, 'Yes (during pregnancy)': 1
})

# Simple Linear Regression for a column
model = LinearRegression()
# Filter to ensure all used data has no NaN values in both 'HeartDisease' and 'BMI'
not_null_data = df.dropna(subset=['HeartDisease', 'BMI'])
model.fit(not_null_data[['HeartDisease']], not_null_data[['BMI']])

# Prepare for predictions by dropping rows with NaN in 'HeartDisease'
mask = df['BMI'].isnull() & df['HeartDisease'].notnull()  # Ensure no NaN in HeartDisease for prediction
# Only predict where BMI is NaN but HeartDisease is not NaN
predicted_values = model.predict(df.loc[mask, ['HeartDisease']])
df.loc[mask, 'BMI'] = predicted_values


In [None]:
df.head(20)