<a href="https://colab.research.google.com/github/CristianRzf/APO23/blob/main/CardioIA_pynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/content/heart_disease_dataset.csv')
display(df.head())

Unnamed: 0,Age,Gender,Cholesterol,Blood Pressure,Heart Rate,Smoking,Alcohol Intake,Exercise Hours,Family History,Diabetes,Obesity,Stress Level,Blood Sugar,Exercise Induced Angina,Chest Pain Type,Heart Disease
0,75,Female,228,119,66,Current,Heavy,1,No,No,Yes,8,119,Yes,Atypical Angina,1
1,48,Male,204,165,62,Current,,5,No,No,No,9,70,Yes,Typical Angina,0
2,53,Male,234,91,67,Never,Heavy,3,Yes,No,Yes,5,196,Yes,Atypical Angina,1
3,69,Female,192,90,72,Current,,4,No,Yes,No,7,107,Yes,Non-anginal Pain,0
4,62,Female,172,163,93,Never,,6,No,Yes,No,2,183,Yes,Asymptomatic,0


# Task
Perform data preprocessing on the "heart_disease_dataset.csv" dataset, including handling missing values, encoding categorical features, splitting the data into training and testing sets, and scaling numerical features.

## Identify missing values

### Subtask:
Check for missing values in the dataset.


**Reasoning**:
Check for missing values in each column of the dataframe.



In [None]:
missing_values = df.isnull().sum()
display(missing_values)

Unnamed: 0,0
Age,0
Gender,0
Cholesterol,0
Blood Pressure,0
Heart Rate,0
Smoking,0
Alcohol Intake,340
Exercise Hours,0
Family History,0
Diabetes,0


## Handle missing values

### Subtask:
Address missing values, potentially by imputation or removal.


**Reasoning**:
Fill the missing values in the 'Alcohol Intake' column with the mode and then verify that there are no remaining missing values.



In [None]:
df['Alcohol Intake'].fillna(df['Alcohol Intake'].mode()[0], inplace=True)
missing_values_after_imputation = df.isnull().sum()
display(missing_values_after_imputation)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Alcohol Intake'].fillna(df['Alcohol Intake'].mode()[0], inplace=True)


Unnamed: 0,0
Age,0
Gender,0
Cholesterol,0
Blood Pressure,0
Heart Rate,0
Smoking,0
Alcohol Intake,0
Exercise Hours,0
Family History,0
Diabetes,0


## Identify categorical features

### Subtask:
Determine which columns contain categorical data.


**Reasoning**:
Iterate through the columns to identify categorical features based on data type and unique value counts.



In [None]:
categorical_cols = []
print("Unique value counts for each column:")
for col in df.columns:
    unique_count = df[col].nunique()
    print(f"Column '{col}': {unique_count} unique values, dtype: {df[col].dtype}")
    # Consider object type or columns with limited unique values as categorical
    if df[col].dtype == 'object' or unique_count < 20: # Using 20 as a threshold, can be adjusted
        categorical_cols.append(col)

print("\nLikely categorical columns:")
print(categorical_cols)

Unique value counts for each column:
Column 'Age': 55 unique values, dtype: int64
Column 'Gender': 2 unique values, dtype: object
Column 'Cholesterol': 200 unique values, dtype: int64
Column 'Blood Pressure': 90 unique values, dtype: int64
Column 'Heart Rate': 40 unique values, dtype: int64
Column 'Smoking': 3 unique values, dtype: object
Column 'Alcohol Intake': 2 unique values, dtype: object
Column 'Exercise Hours': 10 unique values, dtype: int64
Column 'Family History': 2 unique values, dtype: object
Column 'Diabetes': 2 unique values, dtype: object
Column 'Obesity': 2 unique values, dtype: object
Column 'Stress Level': 10 unique values, dtype: int64
Column 'Blood Sugar': 130 unique values, dtype: int64
Column 'Exercise Induced Angina': 2 unique values, dtype: object
Column 'Chest Pain Type': 4 unique values, dtype: object
Column 'Heart Disease': 2 unique values, dtype: int64

Likely categorical columns:
['Gender', 'Smoking', 'Alcohol Intake', 'Exercise Hours', 'Family History', 'Di

## Encode categorical features

### Subtask:
Convert categorical features into a numerical format suitable for machine learning models.


**Reasoning**:
Identify and one-hot encode the categorical features, excluding the target variable 'Heart Disease', and then concatenate them with the original dataframe while dropping the original categorical columns.



In [None]:
categorical_cols = ['Gender', 'Smoking', 'Alcohol Intake', 'Exercise Hours', 'Family History', 'Diabetes', 'Obesity', 'Stress Level', 'Exercise Induced Angina', 'Chest Pain Type']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
display(df.head())

Unnamed: 0,Age,Cholesterol,Blood Pressure,Heart Rate,Blood Sugar,Heart Disease,Gender_Male,Smoking_Former,Smoking_Never,Alcohol Intake_Moderate,...,Stress Level_5,Stress Level_6,Stress Level_7,Stress Level_8,Stress Level_9,Stress Level_10,Exercise Induced Angina_Yes,Chest Pain Type_Atypical Angina,Chest Pain Type_Non-anginal Pain,Chest Pain Type_Typical Angina
0,75,228,119,66,119,1,False,False,False,False,...,False,False,False,True,False,False,True,True,False,False
1,48,204,165,62,70,0,True,False,False,False,...,False,False,False,False,True,False,True,False,False,True
2,53,234,91,67,196,1,True,False,True,False,...,True,False,False,False,False,False,True,True,False,False
3,69,192,90,72,107,0,False,False,False,False,...,False,False,True,False,False,False,True,False,True,False
4,62,172,163,93,183,0,False,False,True,False,...,False,False,False,False,False,False,True,False,False,False


## Split data

### Subtask:
Split the dataset into training and testing sets.


**Reasoning**:
Separate the target variable from the features and then split the data into training and testing sets using train_test_split.



In [None]:
X = df.drop('Heart Disease', axis=1)
y = df['Heart Disease']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (800, 34)
Shape of X_test: (200, 34)
Shape of y_train: (800,)
Shape of y_test: (200,)


## Scale numerical features

### Subtask:
Apply scaling to numerical features to ensure they have a similar range.


**Reasoning**:
Instantiate a StandardScaler, identify numerical columns, fit the scaler on the training data's numerical columns, and transform both training and testing data's numerical columns, then replace the original columns with scaled values.



In [None]:
scaler = StandardScaler()
numerical_cols = ['Age', 'Cholesterol', 'Blood Pressure', 'Heart Rate', 'Blood Sugar']
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])
display(X_train.head())
display(X_test.head())

Unnamed: 0,Age,Cholesterol,Blood Pressure,Heart Rate,Blood Sugar,Gender_Male,Smoking_Former,Smoking_Never,Alcohol Intake_Moderate,Exercise Hours_1,...,Stress Level_5,Stress Level_6,Stress Level_7,Stress Level_8,Stress Level_9,Stress Level_10,Exercise Induced Angina_Yes,Chest Pain Type_Atypical Angina,Chest Pain Type_Non-anginal Pain,Chest Pain Type_Typical Angina
29,1.70694,1.190459,-0.770091,-1.416285,-1.101925,True,False,True,True,False,...,False,False,False,False,True,False,False,False,False,True
535,-0.91494,1.553029,1.533196,1.661639,0.531615,False,False,False,False,False,...,False,False,False,False,False,False,True,False,True,False
695,0.619819,-0.639655,0.211638,1.221936,0.613292,True,False,False,True,False,...,False,True,False,False,False,False,True,False,True,False
557,-1.170734,1.518499,-1.185438,0.342529,-0.829668,True,False,True,False,False,...,False,False,True,False,False,False,False,False,True,False
836,0.300077,-1.123082,-0.354744,1.309877,-0.802442,False,False,True,True,True,...,True,False,False,False,False,False,False,False,False,True


Unnamed: 0,Age,Cholesterol,Blood Pressure,Heart Rate,Blood Sugar,Gender_Male,Smoking_Former,Smoking_Never,Alcohol Intake_Moderate,Exercise Hours_1,...,Stress Level_5,Stress Level_6,Stress Level_7,Stress Level_8,Stress Level_9,Stress Level_10,Exercise Induced Angina_Yes,Chest Pain Type_Atypical Angina,Chest Pain Type_Non-anginal Pain,Chest Pain Type_Typical Angina
521,1.579044,-0.035372,-1.487508,-0.624819,-1.047473,False,False,False,True,False,...,False,False,False,False,True,False,False,False,False,False
737,-0.659147,0.741563,0.287156,0.078707,-0.230703,True,False,True,False,False,...,False,False,False,False,False,True,False,False,True,False
740,-0.659147,-0.760512,-0.316985,1.573699,0.74942,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
660,-0.787044,0.655237,-1.298714,0.43047,0.477164,False,False,True,False,False,...,False,False,True,False,False,False,True,False,True,False
411,1.642992,-0.639655,-0.619056,1.74958,1.13058,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


## Summary:

### Data Analysis Key Findings

*   The dataset initially contained 340 missing values in the 'Alcohol Intake' column.
*   Missing values in the 'Alcohol Intake' column were successfully imputed using the mode of that column.
*   The dataset contains 11 columns identified as likely categorical: 'Gender', 'Smoking', 'Alcohol Intake', 'Exercise Hours', 'Family History', 'Diabetes', 'Obesity', 'Stress Level', 'Exercise Induced Angina', 'Chest Pain Type', and 'Heart Disease'.
*   The specified categorical columns (excluding 'Heart Disease') were successfully one-hot encoded, resulting in an increase in the number of columns in the DataFrame.
*   The dataset was split into training (80%) and testing (20%) sets, with `X_train` having a shape of (800, 34), `X_test` having a shape of (200, 34), `y_train` having a shape of (800,), and `y_test` having a shape of (200,).
*   The numerical features ('Age', 'Cholesterol', 'Blood Pressure', 'Heart Rate', 'Blood Sugar') in both the training and testing sets were successfully scaled using `StandardScaler`.

### Insights or Next Steps

*   The preprocessed data is now ready for training various machine learning models to predict heart disease.
*   Further exploration of feature selection techniques could be beneficial to potentially improve model performance and reduce dimensionality.


In [None]:
display(df.head())

Unnamed: 0,Age,Cholesterol,Blood Pressure,Heart Rate,Blood Sugar,Heart Disease,Gender_Male,Smoking_Former,Smoking_Never,Alcohol Intake_Moderate,...,Stress Level_5,Stress Level_6,Stress Level_7,Stress Level_8,Stress Level_9,Stress Level_10,Exercise Induced Angina_Yes,Chest Pain Type_Atypical Angina,Chest Pain Type_Non-anginal Pain,Chest Pain Type_Typical Angina
0,75,228,119,66,119,1,False,False,False,False,...,False,False,False,True,False,False,True,True,False,False
1,48,204,165,62,70,0,True,False,False,False,...,False,False,False,False,True,False,True,False,False,True
2,53,234,91,67,196,1,True,False,True,False,...,True,False,False,False,False,False,True,True,False,False
3,69,192,90,72,107,0,False,False,False,False,...,False,False,True,False,False,False,True,False,True,False
4,62,172,163,93,183,0,False,False,True,False,...,False,False,False,False,False,False,True,False,False,False


In [None]:
# Make predictions on the imputed simulated data using the Logistic Regression model
predictions_lr_on_imputed_data = model.predict(simulated_new_data_imputed)

print("Predictions with Logistic Regression on Imputed Simulated Data (first 10):")
print(predictions_lr_on_imputed_data[:10])

print("\nDistribution of Predictions with Logistic Regression on Imputed Simulated Data:")
display(pd.Series(predictions_lr_on_imputed_data).value_counts())

Predictions with Logistic Regression on Imputed Simulated Data (first 10):
[1 1 0 1 1 0 0 0 0 0]

Distribution of Predictions with Logistic Regression on Imputed Simulated Data:


Unnamed: 0,count
0,6
1,4


In [None]:
# Make predictions on the imputed simulated data
predictions_on_imputed_data = rf_model.predict(simulated_new_data_imputed)

print("Predictions on Imputed Simulated Data (first 10):")
print(predictions_on_imputed_data[:10])

print("\nDistribution of Predictions on Imputed Simulated Data:")
display(pd.Series(predictions_on_imputed_data).value_counts())

Predictions on Imputed Simulated Data (first 10):
[1 1 0 1 1 0 0 0 1 0]

Distribution of Predictions on Imputed Simulated Data:


Unnamed: 0,count
1,5
0,5


## Implement Simple Imputation

### Subtask:
Calculate imputation values from training data and apply to a simulated dataset with missing values.

**Reasoning**:
Calculate the mean of numerical columns and the mode of categorical columns from the training data (`X_train`) to use for imputation, then create a sample dataset with missing values and apply these imputation values.

In [None]:
# Calculate imputation values from the training data
# For numerical columns, use the mean
imputation_means = X_train[numerical_cols].mean()

# For categorical columns, use the mode.
# Need to identify the categorical columns in X_train after one-hot encoding
# These are the columns that were created from the original categorical_cols, excluding numerical_cols
categorical_cols_after_encoding = [col for col in X_train.columns if col not in numerical_cols]
imputation_modes = X_train[categorical_cols_after_encoding].mode().iloc[0] # .iloc[0] because mode can return multiple values if there's a tie

print("Imputation Means from X_train:")
display(imputation_means)

print("\nImputation Modes from X_train:")
display(imputation_modes)

# --- Demonstrate imputation on a simulated dataset with missing values ---

# Create a sample simulated dataset with some missing values
# Let's take a few rows from X_test and introduce some NaNs for demonstration
simulated_new_data_with_missing = X_test.head(10).copy()
# Introduce some missing values - intentionally picking some potentially influential columns
# Use iloc for positional indexing to ensure NaNs are placed in the desired rows of the head(10) subset
simulated_new_data_with_missing.iloc[[1, 3], simulated_new_data_with_missing.columns.get_loc('Age')] = np.nan
simulated_new_data_with_missing.iloc[[2, 4], simulated_new_data_with_missing.columns.get_loc('Cholesterol')] = np.nan
simulated_new_data_with_missing.iloc[[5, 6], simulated_new_data_with_missing.columns.get_loc('Blood Pressure')] = np.nan
simulated_new_data_with_missing.iloc[[7, 8], simulated_new_data_with_missing.columns.get_loc('Stress Level_8')] = np.nan # Example for an encoded categorical column


print("\nSimulated New Data with Missing Values (first 10 rows):")
display(simulated_new_data_with_missing)

# Impute missing values using the calculated means and modes
# Impute numerical columns first
simulated_new_data_imputed = simulated_new_data_with_missing.copy()
simulated_new_data_imputed[numerical_cols] = simulated_new_data_imputed[numerical_cols].fillna(imputation_means)

# Impute categorical columns (after encoding)
# Need to be careful with boolean columns created by get_dummies. fillna works differently.
# A simple approach for boolean columns created by get_dummies that were originally binary is to fill NaN with False (or the mode if mode is True)
# Given the nature of the encoded columns (True/False), filling NaN with the mode (which will be False for most one-hot encoded columns) is reasonable for demonstration.
for col in categorical_cols_after_encoding:
     simulated_new_data_imputed[col] = simulated_new_data_imputed[col].fillna(imputation_modes[col])


print("\nSimulated New Data After Imputation:")
display(simulated_new_data_imputed)

# Now you can use rf_model.predict(simulated_new_data_imputed) to make predictions

Imputation Means from X_train:


Unnamed: 0,0
Age,1.665335e-16
Cholesterol,-2.176037e-16
Blood Pressure,-3.907985e-16
Heart Rate,-3.441691e-16
Blood Sugar,1.287859e-16



Imputation Modes from X_train:


Unnamed: 0,0
Gender_Male,False
Smoking_Former,False
Smoking_Never,False
Alcohol Intake_Moderate,False
Exercise Hours_1,False
Exercise Hours_2,False
Exercise Hours_3,False
Exercise Hours_4,False
Exercise Hours_5,False
Exercise Hours_6,False



Simulated New Data with Missing Values (first 10 rows):


  simulated_new_data_with_missing.iloc[[7, 8], simulated_new_data_with_missing.columns.get_loc('Stress Level_8')] = np.nan # Example for an encoded categorical column


Unnamed: 0,Age,Cholesterol,Blood Pressure,Heart Rate,Blood Sugar,Gender_Male,Smoking_Former,Smoking_Never,Alcohol Intake_Moderate,Exercise Hours_1,...,Stress Level_5,Stress Level_6,Stress Level_7,Stress Level_8,Stress Level_9,Stress Level_10,Exercise Induced Angina_Yes,Chest Pain Type_Atypical Angina,Chest Pain Type_Non-anginal Pain,Chest Pain Type_Typical Angina
521,1.579044,-0.035372,-1.487508,-0.624819,-1.047473,False,False,False,True,False,...,False,False,False,False,True,False,False,False,False,False
737,,0.741563,0.287156,0.078707,-0.230703,True,False,True,False,False,...,False,False,False,False,False,True,False,False,True,False
740,-0.659147,,-0.316985,1.573699,0.74942,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
660,,0.655237,-1.298714,0.43047,0.477164,False,False,True,False,False,...,False,False,True,False,False,False,True,False,True,False
411,1.642992,,-0.619056,1.74958,1.13058,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
678,-0.787044,-1.071286,,-1.504225,1.430062,False,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False
626,-0.211509,-0.62239,,-1.680107,-1.31973,False,True,False,False,False,...,False,False,False,False,False,False,False,False,True,False
513,-0.595199,0.603441,0.702503,0.078707,-0.557411,True,False,True,False,False,...,True,False,False,,False,False,True,False,True,False
859,0.619819,-0.363412,0.740261,1.573699,-0.938571,True,True,False,False,False,...,True,False,False,,False,False,True,False,False,True
136,0.747716,-0.950429,-1.147679,-0.273056,0.177681,True,False,False,False,False,...,True,False,False,False,False,False,True,False,False,True



Simulated New Data After Imputation:


  simulated_new_data_imputed[col] = simulated_new_data_imputed[col].fillna(imputation_modes[col])


Unnamed: 0,Age,Cholesterol,Blood Pressure,Heart Rate,Blood Sugar,Gender_Male,Smoking_Former,Smoking_Never,Alcohol Intake_Moderate,Exercise Hours_1,...,Stress Level_5,Stress Level_6,Stress Level_7,Stress Level_8,Stress Level_9,Stress Level_10,Exercise Induced Angina_Yes,Chest Pain Type_Atypical Angina,Chest Pain Type_Non-anginal Pain,Chest Pain Type_Typical Angina
521,1.579044,-0.03537214,-1.487508,-0.624819,-1.047473,False,False,False,True,False,...,False,False,False,False,True,False,False,False,False,False
737,1.665335e-16,0.7415633,0.2871557,0.078707,-0.230703,True,False,True,False,False,...,False,False,False,False,False,True,False,False,True,False
740,-0.6591472,-2.176037e-16,-0.3169851,1.573699,0.74942,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
660,1.665335e-16,0.6552371,-1.298714,0.43047,0.477164,False,False,True,False,False,...,False,False,True,False,False,False,True,False,True,False
411,1.642992,-2.176037e-16,-0.6190556,1.74958,1.13058,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
678,-0.7870439,-1.071286,-3.907985e-16,-1.504225,1.430062,False,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False
626,-0.211509,-0.62239,-3.907985e-16,-1.680107,-1.31973,False,True,False,False,False,...,False,False,False,False,False,False,False,False,True,False
513,-0.5951989,0.6034414,0.7025025,0.078707,-0.557411,True,False,True,False,False,...,True,False,False,False,False,False,True,False,True,False
859,0.619819,-0.3634116,0.7402613,1.573699,-0.938571,True,True,False,False,False,...,True,False,False,False,False,False,True,False,False,True
136,0.7477156,-0.9504294,-1.147679,-0.273056,0.177681,True,False,False,False,False,...,True,False,False,False,False,False,True,False,False,True


## Evaluate Random Forest on Simulated Data

### Subtask:
Generate a simulated dataset and evaluate the trained Random Forest model on it.

**Reasoning**:
Create a simulated dataset with the same columns as the training data, make predictions using the trained Random Forest model, and evaluate the model's performance on the simulated data.

In [None]:
# Generate simulated data (example: creating a random dataset with the same structure as X_test)
# In a real scenario, you would load or generate meaningful new data.
simulated_data = pd.DataFrame(np.random.rand(100, X_test.shape[1]), columns=X_test.columns)

# Scale the numerical features in the simulated data using the same scaler fitted on the training data
simulated_data[numerical_cols] = scaler.transform(simulated_data[numerical_cols])


# Make predictions on the simulated data
simulated_predictions = rf_model.predict(simulated_data)

# Since we don't have true labels for simulated data, we can't calculate standard evaluation metrics like accuracy,
# classification report, or confusion matrix in the same way as with the test set.
# However, we can display the simulated data and the predictions.

print("Simulated Data (first 5 rows):")
display(simulated_data.head())

print("\nSimulated Predictions (first 10):")
print(simulated_predictions[:10])

# You can further analyze the distribution of predictions on the simulated data
print("\nSimulated Predictions Distribution:")
display(pd.Series(simulated_predictions).value_counts())

Simulated Data (first 5 rows):


Unnamed: 0,Age,Cholesterol,Blood Pressure,Heart Rate,Blood Sugar,Gender_Male,Smoking_Former,Smoking_Never,Alcohol Intake_Moderate,Exercise Hours_1,...,Stress Level_5,Stress Level_6,Stress Level_7,Stress Level_8,Stress Level_9,Stress Level_10,Exercise Induced Angina_Yes,Chest Pain Type_Atypical Angina,Chest Pain Type_Non-anginal Pain,Chest Pain Type_Typical Angina
0,-3.314818,-4.302011,-5.097596,-6.945705,-3.684727,0.917559,0.580055,0.787936,0.312808,0.035106,...,0.66209,0.068771,0.035612,0.193073,0.468663,0.016639,0.321828,0.961418,0.284096,0.484178
1,-3.281202,-4.311626,-5.098451,-6.899687,-3.670027,0.799205,0.188477,0.018638,0.102411,0.840271,...,0.012125,0.104718,0.860916,0.66915,0.31781,0.378055,0.215376,0.654664,0.405828,0.384143
2,-3.325247,-4.316261,-5.096885,-6.948334,-3.679123,0.943967,0.965732,0.410632,0.621908,0.776222,...,0.112082,0.694139,0.145641,0.381478,0.88613,0.807475,0.854286,0.043892,0.135446,0.06082
3,-3.316957,-4.307599,-5.09159,-6.896674,-3.662032,0.015895,0.905546,0.7167,0.383839,0.379352,...,0.023771,0.13427,0.796654,0.892583,0.673484,0.505794,0.594476,0.119806,0.664515,0.864606
4,-3.297137,-4.303182,-5.103445,-6.878938,-3.665048,0.27975,0.981651,0.823064,0.489077,0.374366,...,0.217205,0.883107,0.290454,0.544933,0.935163,0.97572,0.139669,0.597418,0.246778,0.084462



Simulated Predictions (first 10):
[0 0 0 0 0 0 0 0 0 0]

Simulated Predictions Distribution:


Unnamed: 0,count
0,100


## Perform Cross-Validation

### Subtask:
Perform cross-validation for both Logistic Regression and Random Forest models.

**Reasoning**:
Use cross_val_score to perform k-fold cross-validation (e.g., 5 folds) on both the Logistic Regression and Random Forest models and display the mean accuracy and standard deviation.

In [None]:
from sklearn.model_selection import cross_val_score

# Perform cross-validation for Logistic Regression
lr_cv_scores = cross_val_score(model, X, y, cv=5) # Using the whole dataset X, y for cross-validation
print(f"Logistic Regression Cross-Validation Accuracy: {lr_cv_scores.mean():.4f} (+/- {lr_cv_scores.std():.4f})")

# Perform cross-validation for Random Forest
rf_cv_scores = cross_val_score(rf_model, X, y, cv=5) # Using the whole dataset X, y for cross-validation
print(f"Random Forest Cross-Validation Accuracy: {rf_cv_scores.mean():.4f} (+/- {rf_cv_scores.std():.4f})")

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Logistic Regression Cross-Validation Accuracy: 0.8380 (+/- 0.0406)
Random Forest Cross-Validation Accuracy: 0.9990 (+/- 0.0020)


## Train and Evaluate Random Forest Model

### Subtask:
Train a Random Forest Classifier and evaluate its performance.

**Reasoning**:
Instantiate a Random Forest Classifier, train it on the training data, make predictions on the test data, and then evaluate its performance using accuracy, classification report, and confusion matrix.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Instantiate the Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions on the test data
rf_y_pred = rf_model.predict(X_test)

# Evaluate the model
rf_accuracy = accuracy_score(y_test, rf_y_pred)
rf_classification_rep = classification_report(y_test, rf_y_pred)
rf_conf_matrix = confusion_matrix(y_test, rf_y_pred)

print(f"Random Forest Accuracy: {rf_accuracy}")
print("\nRandom Forest Classification Report:")
print(rf_classification_rep)
print("\nRandom Forest Confusion Matrix:")
display(rf_conf_matrix)

Random Forest Accuracy: 1.0

Random Forest Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       118
           1       1.00      1.00      1.00        82

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200


Random Forest Confusion Matrix:


array([[118,   0],
       [  0,  82]])

In [None]:
# Get the coefficients of the model
coefficients = model.coef_[0]

# Get the feature names
feature_names = X_train.columns

# Create a DataFrame to display coefficients and feature names
coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Sort the coefficients for better interpretation
coef_df = coef_df.sort_values(by='Coefficient', ascending=False)

print("Model Coefficients:")
display(coef_df)

Model Coefficients:


Unnamed: 0,Feature,Coefficient
0,Age,3.117945
1,Cholesterol,2.011313
12,Exercise Hours_4,0.714695
27,Stress Level_8,0.618794
21,Stress Level_2,0.443836
10,Exercise Hours_2,0.437443
14,Exercise Hours_6,0.42629
18,Family History_Yes,0.407193
22,Stress Level_3,0.357271
26,Stress Level_7,0.344392


In [None]:
# Instantiate the Logistic Regression model
model = LogisticRegression(random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

## Evaluate the model

### Subtask:
Evaluate the performance of the Logistic Regression model.

**Reasoning**:
Calculate and display the accuracy score, classification report, and confusion matrix to assess the model's performance.

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("\nClassification Report:")
print(classification_rep)
print("\nConfusion Matrix:")
display(conf_matrix)

Accuracy: 0.87

Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.90      0.89       118
           1       0.85      0.83      0.84        82

    accuracy                           0.87       200
   macro avg       0.87      0.86      0.87       200
weighted avg       0.87      0.87      0.87       200


Confusion Matrix:


array([[106,  12],
       [ 14,  68]])