## Learning Outcomes
- Exploratory data analysis & preparing the data for model building. 
- Machine Learning - Supervised Learning Classification
  - Logistic Regression
  - Naive bayes Classifier
  - KNN Classifier
  - Decision Tree Classifier
  - Random Forest Classifier
  - Ensemble methods
- Training and making predictions using different classification models.
- Model evaluation

## Objective: 
- The Classification goal is to predict “heart disease” in a person with regards to different factors given. 

## Context:
- Heart disease is one of the leading causes of death for people of most races in the US. At least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. 
- Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare. Machine learning methods may detect "patterns" from the data and can predict whether a patient is suffering from any heart disease or not..

## Dataset Information

#### Source: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?datasetId=1936563&sortBy=voteCount
Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. 

This dataset consists of eighteen columns
- HeartDisease: Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)
- BMI: Body Mass Index (BMI)
- Smoking: smoked at least 100 cigarettes in your entire life
- AlcoholDrinking: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week
- Stroke:Ever had a stroke?
- PhysicalHealth: physical health, which includes physical illness and injury
- MentalHealth: for how many days during the past 30 days was your mental health not good?
- DiffWalking: Do you have serious difficulty walking or climbing stairs?
- Sex: male or female?
- AgeCategory: Fourteen-level age category
- Race: Imputed race/ethnicity value
- Diabetic: diabetes?
- PhysicalActivity: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
- GenHealth: Would you say that in general your health is good, fine or excellent?
- SleepTime: On average, how many hours of sleep do you get in a 24-hour period?
- Asthma: you had asthma?
- KidneyDisease: Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?
- SkinCancer: Ever had skin cancer?

### 1. Importing Libraries

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


### 2. Load the dataset and display a sample of five rows of the data frame.

In [None]:
# Load the dataset
file_path = "Downloads"  # Replace with the actual path to your dataset
df = pd.read_csv(file_path)

# Display a sample of five rows
sample_data = df.sample(5)
print(sample_data)


### 3. Check the shape of the data (number of rows and columns). Check the general information about the dataframe using the .info() method.

In [None]:
# Check the shape of the data (number of rows and columns)
print("Shape of the DataFrame:", df.shape)

# Check general information about the DataFrame
df.info()


### 4. Check the statistical summary of the dataset and write your inferences.

In [None]:
# Check the statistical summary of the dataset
summary = df.describe()

# Display the summary
print(summary)


### 5. Check the percentage of missing values in each column of the data frame. Drop the missing values if there are any.

In [None]:
# Check the percentage of missing values in each column
missing_percentage = df.isnull().mean() * 100

# Display the missing percentage for each column
print("Percentage of missing values in each column:")
print(missing_percentage)

# Drop rows with missing values
df_cleaned = df.dropna()

# Display the shape of the cleaned DataFrame
print("Shape of the cleaned DataFrame:", df_cleaned.shape)

### 6. Check if there are any duplicate rows. If any drop them and check the shape of the dataframe after dropping duplicates.

In [None]:
# Check for duplicate rows
duplicates_exist = df.duplicated().any()

if duplicates_exist:
    # Drop duplicate rows
    df_no_duplicates = df.drop_duplicates()

    # Display the shape of the DataFrame after dropping duplicates
    print("Shape of the DataFrame after dropping duplicates:", df_no_duplicates.shape)
else:
    print("No duplicate rows found.")


### 7. Check the distribution of the target variable (i.e. 'HeartDisease') and write your observations.

In [None]:

To check the distribution of the target variable 'HeartDisease', you can use the value_counts() method along with visualization. Here's an example:

python
Copy code
# Check the distribution of the target variable
heart_disease_distribution = df['HeartDisease'].value_counts()

# Display the distribution
print("Distribution of HeartDisease:")
print(heart_disease_distribution)

# Visualize the distribution
sns.countplot(x='HeartDisease', data=df)
plt.title('Distribution of HeartDisease')
plt.show()


### 8. Visualize the distribution of the target column 'Heart disease' with respect to various categorical features and write your observations.

In [None]:
categorical_features = ['Smoking', 'AlcoholDrinking', 'Stroke', 'Sex', 'AgeCategory', 'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer']

plt.figure(figsize=(15, 10))

for i, feature in enumerate(categorical_features, 1):
    plt.subplot(3, 4, i)
    sns.countplot(x=feature, hue='HeartDisease', data=df)
    plt.title(f'Distribution of HeartDisease by {feature}')

plt.tight_layout()
plt.show()

### 9. Check the unique categories in the column 'Diabetic'. Replace 'Yes (during pregnancy)' as 'Yes' and 'No, borderline diabetes' as 'No'.

In [None]:
# Check unique categories in the 'Diabetic' column
unique_categories = df['Diabetic'].unique()
print("Unique categories in 'Diabetic':", unique_categories)

# Replace categories
df['Diabetic'] = df['Diabetic'].replace({'Yes (during pregnancy)': 'Yes', 'No, borderline diabetes': 'No'})

# Verify the changes
print("Updated unique categories in 'Diabetic':", df['Diabetic'].unique())

### 10. For the target column 'HeartDiease', Replace 'No' as 0 and 'Yes' as 1. 

In [None]:
# Replace 'No' with 0 and 'Yes' with 1 in the 'HeartDisease' column
df['HeartDisease'] = df['HeartDisease'].replace({'No': 0, 'Yes': 1})

# Verify the changes
print("Updated unique categories in 'HeartDisease':", df['HeartDisease'].unique())


### 11. Label Encode the columns "AgeCategory", "Race", and "GenHealth". Encode the rest of the columns using dummy encoding approach.

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Columns to label encode
label_encode_columns = ['AgeCategory', 'Race', 'GenHealth']

# Label encode the specified columns
label_encoder = LabelEncoder()
for column in label_encode_columns:
    df[column] = label_encoder.fit_transform(df[column])

# Columns to dummy encode
dummy_encode_columns = ['Smoking', 'AlcoholDrinking', 'Stroke', 'Sex', 'Diabetic', 'PhysicalActivity', 'Asthma', 'KidneyDisease', 'SkinCancer']

# Dummy encode the specified columns
df = pd.get_dummies(df, columns=dummy_encode_columns, drop_first=True)

# Verify the changes
print("Updated DataFrame after encoding:")
print(df.head())


### 12. Store the target column (i.e.'HeartDisease') in the y variable and the rest of the columns in the X variable.

In [None]:
# Store the target column in y
y = df['HeartDisease']

# Store the rest of the columns in X
X = df.drop('HeartDisease', axis=1)

# Verify the shapes of X and y
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)


### 13. Split the dataset into two parts (i.e. 70% train and 30% test) and print the shape of the train and test data

In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print the shapes of the train and test data
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)


### 14. Standardize the numerical columns using Standard Scalar approach for both train and test data.

In [None]:
from sklearn.preprocessing import StandardScaler

# Identify numerical columns
numerical_columns = X.select_dtypes(include=['float64', 'int64']).columns

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the scaler on the training data
X_train[numerical_columns] = scaler.fit_transform(X_train[numerical_columns])

# Transform the test data using the same scaler
X_test[numerical_columns] = scaler.transform(X_test[numerical_columns])


### 15. Write a function.
- i) Which can take the model and data as inputs.
- ii) Fits the model with the train data.
- iii) Makes predictions on the test set.
- iv) Returns the Accuracy Score.

In [None]:
from sklearn.metrics import accuracy_score

def train_and_evaluate_model(model, X_train, y_train, X_test, y_test):
   
    # Fit the model with the training data
    model.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred = model.predict(X_test)
    
    # Calculate the accuracy score
    accuracy = accuracy_score(y_test, y_pred)
    
    return accuracy


In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create a RandomForestClassifier
rf_model = RandomForestClassifier(random_state=42)

# Use the function to train and evaluate the model
accuracy = train_and_evaluate_model(rf_model, X_train, y_train, X_test, y_test)

# Print the accuracy score
print("Accuracy Score:", accuracy)

### 16. Use the function and train a Logistic regression, KNN, Naive Bayes, Decision tree, Random Forest, Adaboost, GradientBoost, and Stacked Classifier models and make predictions on test data and evaluate the models, compare and write your conclusions and steps to be taken in future in order to improve the accuracy of the model.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score

# Define models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'KNN': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB(),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Adaboost': AdaBoostClassifier(random_state=42),
    'Gradient Boost': GradientBoostingClassifier(random_state=42)
}

# Train and evaluate models
for model_name, model in models.items():
    accuracy = train_and_evaluate_model(model, X_train, y_train, X_test, y_test)
    print(f"{model_name} Accuracy Score: {accuracy}")

# Stacked Classifier
stacked_model = StackingClassifier(estimators=list(models.items()), final_estimator=RandomForestClassifier(random_state=42))
stacked_accuracy = train_and_evaluate_model(stacked_model, X_train, y_train, X_test, y_test)
print(f"Stacked Classifier Accuracy Score: {stacked_accuracy}")


### Conclusion

----
## Happy Learning:)
----