In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

First we initiate all our imports and dependencies.

In [None]:
# Load the data
df = pd.read_csv('heart_2022_with_nans.csv')

# Strip whitespace from string columns
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

# Drop rows where target variable is missing
df = df.dropna(subset=['HadHeartAttack'])

Reading in the csv file and stripping any unessicary data

In [None]:
# Feature selection
df_model = df[['HadHeartAttack', 'Sex', 'AgeCategory', 'BMI', 'SmokerStatus', 'SleepHours']]

# Drop rows with missing feature values
df_model = df_model.dropna()

# Separate X and y
y = df_model['HadHeartAttack']
X = df_model.drop('HadHeartAttack', axis=1)

# One-hot encode categorical variables
X_encoded = pd.get_dummies(X, drop_first=True)

selects specific features, drops missing values, separates inputs and target, and one-hot encodes categorical variables for modeling.

In [None]:
# Scaling
scaler = MinMaxScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_encoded), columns=X_encoded.columns)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Initialize and train **LogisticRegression** model
model = LogisticRegression(max_iter=1000)  # optional: increase iterations if needed
model.fit(X_train, y_train)


scales the features, splits the data into training and testing sets, and trains a logistic regression model on the training data.

In [None]:
# Predict
y_pred = model.predict(X_test)

# Evaluate using classification metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label="Yes")
recall = recall_score(y_test, y_pred, pos_label="Yes")
f1 = f1_score(y_test, y_pred, pos_label="Yes")

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

Run the prediction model and print out its Accuracy, Precision, Recall, and F1 score

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=["No", "Yes"])
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=["No", "Yes"], yticklabels=["No", "Yes"])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Run a table displaying the models confusion matrix.

What features seemed most important in your model? Why?
    The features selected for the model (Sex, AgeCategory, BMI, SmokerStatus, SleepHours) are important as they are directly related to health and lifestyle factors that influence heart disease. BMI is related due to obesity being a known risk factor to heart disease. SleepHours because poor sleep is associated with health risks in general, AgeCategory as older folks are at higher risk of heart issues. Sex because biological differences may influence heart disease.

Did one model perform better than the others? What trade-offs did you see?
    I did not try multiple models so this is not applicable. 

What would you recommend to someone using this model to make real decisions?
    Take it with a grain of salt. I am not a medical professional and would not be on the best standing to predict heart disease risks. The model was trained on a set of data that I cannot confirm the authenticity of. In a perfect world this data would be gathered by medical professionals across all regions of the world in order to gain a diverse set of data spanning all types of lifestyles and locations. 

What are the risks or limitations of this model in the real world?
    High risk of genuine inaccuracy. While this model is considered accurate inside its own bubble of data it's been trained on. I would recommend having this model be trained on data gathered by professionals that we can verify the authenticity of. In the case that all of this data can have its authenticity verified I would say the limitations come down to the age group of people as well as the location as only a few states from the US were taken into consideration along with the virgin islands and Puerto Rico. 