## Learning Outcomes
- Exploratory data analysis & preparing the data for model building. 
- Machine Learning - Supervised Learning Classification
  - Logistic Regression
  - Naive bayes Classifier
  - KNN Classifier
  - Decision Tree Classifier
  - Random Forest Classifier
  - Ensemble methods
- Training and making predictions using different classification models.
- Model evaluation

## Objective: 
- The Classification goal is to predict “heart disease” in a person with regards to different factors given. 

## Context:
- Heart disease is one of the leading causes of death for people of most races in the US. At least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. 
- Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare. Machine learning methods may detect "patterns" from the data and can predict whether a patient is suffering from any heart disease or not..

## Dataset Information

#### Source: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?datasetId=1936563&sortBy=voteCount
Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. 

This dataset consists of eighteen columns
- HeartDisease: Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)
- BMI: Body Mass Index (BMI)
- Smoking: smoked at least 100 cigarettes in your entire life
- AlcoholDrinking: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week
- Stroke:Ever had a stroke?
- PhysicalHealth: physical health, which includes physical illness and injury
- MentalHealth: for how many days during the past 30 days was your mental health not good?
- DiffWalking: Do you have serious difficulty walking or climbing stairs?
- Sex: male or female?
- AgeCategory: Fourteen-level age category
- Race: Imputed race/ethnicity value
- Diabetic: diabetes?
- PhysicalActivity: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
- GenHealth: Would you say that in general your health is good, fine or excellent?
- SleepTime: On average, how many hours of sleep do you get in a 24-hour period?
- Asthma: you had asthma?
- KidneyDisease: Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?
- SkinCancer: Ever had skin cancer?

### 1. Importing Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

### 2. Load the dataset and display a sample of five rows of the data frame.

In [7]:
data = pd.read_csv("heart_2020_cleaned.csv")
print(data.head())

NameError: name 'pd' is not defined

### 3. Check the shape of the data (number of rows and columns). Check the general information about the dataframe using the .info() method.

In [None]:
print(data.shape)
print(data.info())

### 4. Check the statistical summary of the dataset and write your inferences.

In [None]:
descriptive_stats = data.describe(include='all')
print(descriptive_stats)
print("\nInferences:")
print("- Numerical features have minimum, maximum, mean, standard deviation, etc.")
print("- Categorical features have counts and frequencies for each category.")

### 5. Check the percentage of missing values in each column of the data frame. Drop the missing values if there are any.

In [None]:
print(data.isnull().sum()) 
dropped_data = data.dropna()
print(f"\nShape of data after dropping rows with missing values: {dropped_data.shape}")

### 6. Check if there are any duplicate rows. If any drop them and check the shape of the dataframe after dropping duplicates.

In [None]:
duplicates = data.duplicated()
if duplicates.any():
  print("Duplicate rows found!")
  dropped_duplicates = data.drop_duplicates()
  print(f"\nShape of data after dropping duplicates: {dropped_duplicates.shape}")
else:
  print("No duplicate rows found.")

### 7. Check the distribution of the target variable (i.e. 'HeartDisease') and write your observations.

In [None]:
heart_disease_counts = data["HeartDisease"].value_counts()
print(heart_disease_counts)
print("\nObservations:")
print("- Number of people with heart disease:", heart_disease_counts[1])  
print("- Number of people without heart disease:", heart_disease_counts[0])  
print("- The data might be imbalanced if the difference is significant.")

### 8. Visualize the distribution of the target column 'Heart disease' with respect to various categorical features and write your observations.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
categorical_features = ["Smoking", "Sex", "PhysicalActivity"]
for feature in categorical_features:
  plt.figure()
  data.groupby(feature)["HeartDisease"].value_counts().unstack().plot(kind="bar", stacked=False)
  plt.xlabel(feature)
  plt.ylabel("Count")
  plt.title(f"Distribution of Heart Disease by {feature}")
  plt.xticks(rotation=0)  
  plt.tight_layout()
  plt.show()
print("Observations:")
print("- Visualizations help identify potential relationships between categorical features and heart disease.")

### 9. Check the unique categories in the column 'Diabetic'. Replace 'Yes (during pregnancy)' as 'Yes' and 'No, borderline diabetes' as 'No'.

In [None]:
unique_categories = data['Diabetic'].unique()
replace_map = {'Yes (during pregnancy)': 'Yes', 'No, borderline diabetes': 'No'}
if any(category in unique_categories for category in replace_map.keys()):
  data['Diabetic'] = data['Diabetic'].replace(replace_map)
  print("Replaced categories in 'Diabetic'")
else:
  print("No replacements needed for 'Diabetic' categories")

### 10. For the target column 'HeartDiease', Replace 'No' as 0 and 'Yes' as 1. 

In [None]:
data['HeartDisease'] = data['HeartDisease'].replace({'No': 0, 'Yes': 1})
print(data.head())

### 11. Label Encode the columns "AgeCategory", "Race", and "GenHealth". Encode the rest of the columns using dummy encoding approach.

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
categorical_features = ['AgeCategory', 'Race', 'GenHealth']
numerical_features = [col for col in data.columns if col not in categorical_features]
label_encoder = LabelEncoder()
for feature in categorical_features:
  data[feature] = label_encoder.fit_transform(data[feature])
onehot_encoder = OneHotEncoder(sparse=False)  
encoded_features = onehot_encoder.fit_transform(data[numerical_features])
new_feature_names = onehot_encoder.get_feature_names_out(numerical_features)
encoded_df = pd.DataFrame(encoded_features, columns=new_feature_names)
data = pd.concat([data[categorical_features], encoded_df], axis=1)
print(data.head())

### 12. Store the target column (i.e.'HeartDisease') in the y variable and the rest of the columns in the X variable.

In [None]:
y = data['HeartDisease']
X = data.drop('HeartDisease', axis=1) 
print(f"Shape of X (features): {X.shape}")
print(f"Shape of y (target): {y.shape}")

### 13. Split the dataset into two parts (i.e. 70% train and 30% test) and print the shape of the train and test data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) 
print(f"Shape of X_train (training features): {X_train.shape}")
print(f"Shape of X_test (testing features): {X_test.shape}")
print(f"Shape of y_train (training target): {y_train.shape}")
print(f"Shape of y_test (testing target): {y_test.shape}")

### 14. Standardize the numerical columns using Standard Scalar approach for both train and test data.

In [None]:
from sklearn.preprocessing import StandardScaler
numerical_features = ['feature1', 'feature2', ...]  
scaler = StandardScaler()
scaler.fit(X_train[numerical_features])
X_train_scaled = scaler.transform(X_train[numerical_features])
X_test_scaled = scaler.transform(X_test[numerical_features])
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=numerical_features)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=numerical_features)
print(X_train_scaled_df.head())

### 15. Write a function.
- i) Which can take the model and data as inputs.
- ii) Fits the model with the train data.
- iii) Makes predictions on the test set.
- iv) Returns the Accuracy Score.

In [None]:
from sklearn.metrics import accuracy_score

def fit_predict_evaluate(model, X_train, y_train, X_test, y_test):
"""
  Fits a model, makes predictions on a testing set, and returns the accuracy score.

  Args:
      model: The machine learning model to be trained and evaluated.
      X_train: The training features.
      y_train: The training target variable.
      X_test: The testing features.
      y_test: The testing target variable.

  Returns:
      The accuracy score of the model on the testing set.
  """
 model.fit(X_train, y_train)
  y_pred = model.predict(X_test)
  accuracy = accuracy_score(y_test, y_pred)
  return accuracy


### 16. Use the function and train a Logistic regression, KNN, Naive Bayes, Decision tree, Random Forest, Adaboost, GradientBoost, and Stacked Classifier models and make predictions on test data and evaluate the models, compare and write your conclusions and steps to be taken in future in order to improve the accuracy of the model.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
def fit_predict_evaluate(model, X_train, y_train, X_test, y_test):
models = [
  LogisticRegression(),
  KNeighborsClassifier(n_neighbors=5), 
  GaussianNB(),
  DecisionTreeClassifier(),
  RandomForestClassifier(n_estimators=100),
  AdaBoostClassifier(n_estimators=100),  
  GradientBoostingClassifier(n_estimators=100),
]
for model in models:
  model_name = model.__class__.__name__ 
  accuracy = fit_predict_evaluate(model, X_train_scaled, y_train, X_test_scaled, y_test)
  print(f"{model_name} Accuracy: {accuracy:.4f}")
print("\nConclusions:")
print("\nSteps to improve accuracy:")
from sklearn.model_selection import GridSearchCV
param_grid = {
  'n_estimators': [50, 100, 200],
  'max_depth': [3, 5, 8]
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)
best_forest = grid_search.best_estimator_
best_forest_accuracy = fit_predict_evaluate(best_forest, X_train_scaled, y_train, X_test_scaled, y_test)
print(f"\nBest Random Forest (after tuning) Accuracy: {best_forest_accuracy:.4f}")

### Conclusion

In [None]:
The provided code offers a framework for training, evaluating, and comparing various machine learning models for heart disease prediction. By analyzing the accuracy scores of different models, you can gain insights into their effectiveness.

Key Takeaways:

The code trains and evaluates several classification models (Logistic Regression, KNN, Naive Bayes, Decision Tree, Random Forest, AdaBoost, GradientBoost) on a heart disease dataset.
The fit_predict_evaluate function provides a reusable approach to assess model performance.
The conclusions section prompts you to analyze the accuracy scores and identify the best performing model. It highlights factors like data quality, feature selection, and hyperparameter tuning that can influence performance.

----
## Happy Learning:)
----