

To predict the chances of patients getting a heart attack using the provided dataset, I'll follow these steps:

1. **Data Loading and Preprocessing**: Load the dataset, handle missing values if any, and perform any necessary preprocessing steps like encoding categorical variables.
2. **Exploratory Data Analysis (EDA)**: Analyze the dataset to understand the distribution of variables and the relationship between features and the target variable.
3. **Model Selection**: Choose a suitable model for prediction. I'll consider logistic regression for this task due to its effectiveness in binary classification problems like this.
4. **Model Training**: Train the model on the dataset.
5. **Model Evaluation**: Evaluate the model using appropriate metrics like accuracy, precision, recall, and ROC-AUC.
6. **Visualization**: Provide visualizations to interpret the model results and feature importance.

Let's start by loading the dataset and performing some initial explorations to understand its structure and content.

In [None]:
import pandas as pd

# Load the dataset
heart_df = pd.read_csv('heart.csv')

# Display the first few rows of the dataframe
print(heart_df.head())

# Display summary statistics
print(heart_df.describe())



Here's an overview of the dataset:

{{outputs[0]}}

And the summary statistics:

{{outputs[1]}}

### Observations:
- The dataset contains 303 entries with no missing values across 14 features.
- Features include age, sex, chest pain type (cp), resting blood pressure (trtbps), cholesterol levels (chol), fasting blood sugar (fbs), resting electrocardiographic results (restecg), maximum heart rate achieved (thalachh), exercise induced angina (exng), ST depression induced by exercise relative to rest (oldpeak), the slope of the peak exercise ST segment (slp), number of major vessels colored by fluoroscopy (caa), thalassemia (thall), and the target variable (output) indicating the presence of heart disease.

### Next Steps:
1. **Data Preprocessing**: Check for any categorical variables that need encoding and scale the numerical features.
2. **Exploratory Data Analysis (EDA)**: Visualize the data to understand the relationships between features.
3. **Model Building**: Implement logistic regression for prediction.

Let's proceed with data preprocessing.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Encoding categorical variables and scaling numerical features
heart_df['sex'] = heart_df['sex'].astype('category')
heart_df['cp'] = heart_df['cp'].astype('category')
heart_df['fbs'] = heart_df['fbs'].astype('category')
heart_df['restecg'] = heart_df['restecg'].astype('category')
heart_df['exng'] = heart_df['exng'].astype('category')
heart_df['slp'] = heart_df['slp'].astype('category')
heart_df['caa'] = heart_df['caa'].astype('category')
heart_df['thall'] = heart_df['thall'].astype('category')

# Scaling numerical features
scaler = StandardScaler()
heart_df[['age', 'trtbps', 'chol', 'thalachh', 'oldpeak']] = scaler.fit_transform(heart_df[['age', 'trtbps', 'chol', 'thalachh', 'oldpeak']])

# Splitting the dataset into training and testing sets
X = heart_df.drop('output', axis=1)
y = heart_df['output']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('Data preprocessing complete. Training and testing sets prepared.')



### Next Steps:
1. **Exploratory Data Analysis (EDA)**: Visualize the relationships between features and the target variable.
2. **Model Building**: Train a logistic regression model.

Let's start with the exploratory data analysis to visualize some key relationships in the data.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the aesthetic style of the plots
sns.set(style='whitegrid')

# Create a figure to hold the visualizations
plt.figure(figsize=(20, 15), facecolor='white')

# Plotting correlations
plt.subplot(2, 2, 1)
corr_matrix = heart_df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')

# Age vs Max Heart Rate
plt.subplot(2, 2, 2)
sns.scatterplot(x='age', y='thalachh', hue='output', data=heart_df)
plt.title('Age vs Max Heart Rate')

# Cholesterol Levels vs Max Heart Rate
plt.subplot(2, 2, 3)
sns.scatterplot(x='chol', y='thalachh', hue='output', data=heart_df)
plt.title('Cholesterol Levels vs Max Heart Rate')

# ST Depression vs Heart Disease
plt.subplot(2, 2, 4)
sns.boxplot(x='output', y='oldpeak', data=heart_df)
plt.title('ST Depression vs Heart Disease')

plt.tight_layout()
plt.show()



Here are some visualizations from the exploratory data analysis:

{{images[0]}}

### Visualizations Explained:
- **Correlation Matrix**: Shows the relationships between all numerical features. Some features like 'oldpeak' (ST depression) show a notable correlation with the target variable 'output'.
- **Age vs Max Heart Rate**: Scatter plot showing how maximum heart rate achieved during exercise varies with age, colored by the presence of heart disease.
- **Cholesterol Levels vs Max Heart Rate**: Scatter plot illustrating the relationship between cholesterol levels and maximum heart rate, also colored by heart disease status.
- **ST Depression vs Heart Disease**: Box plot comparing the levels of ST depression in patients with and without heart disease.

### Next Steps:
1. **Model Building**: Train a logistic regression model using the preprocessed data.
2. **Model Evaluation**: Assess the model's performance on the testing set.

Shall we proceed with building the logistic regression model?

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize the logistic regression model
logreg = LogisticRegression(random_state=42)

# Fit the model on the training data
logreg.fit(X_train, y_train)

# Predict on the testing set
y_pred = logreg.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Generate confusion matrix and classification report
conf_matrix = confusion_matrix(y_test, y_pred)
clf_report = classification_report(y_test, y_pred)

print(f'Accuracy of the logistic regression model: {accuracy:.2f}')
print('Confusion Matrix:')
print(conf_matrix)
print('\nClassification Report:')
print(clf_report)



The logistic regression model has been trained and evaluated. Here are the results:

- **Accuracy**: 85%
- **Confusion Matrix**:
  {{outputs[2]}}
- **Classification Report**:
  {{outputs[4]}}

### Model Performance:
- The model achieves an accuracy of 85%, indicating a good fit for the dataset.
- The confusion matrix shows a balanced performance between false positives and false negatives.
- The classification report provides detailed metrics like precision, recall, and F1-score for each class, showing a balanced performance across the board.

### Next Steps:
1. **Model Tuning**: We could explore hyperparameter tuning to potentially improve the model.
2. **Feature Importance**: Analyze which features are most influential in predicting heart disease.

Would you like to proceed with any of these next steps or explore something else?



Let's proceed with hyperparameter tuning for the logistic regression model to potentially improve its performance. We'll use grid search to explore different combinations of hyperparameters.

I'll set up a grid search with cross-validation to find the best parameters for the logistic regression model. Let's start this process.

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear', 'lbfgs'],
    'max_iter': [100, 200, 300]
}

# Initialize the GridSearchCV object
grid_search = GridSearchCV(estimator=LogisticRegression(random_state=42), param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f'Best parameters found: {best_params}')
print(f'Best cross-validated accuracy: {best_score:.2f}')



The hyperparameter tuning using grid search has completed. Here are the results:

- **Best Parameters**: 
  - Regularization Strength (C): 1
  - Maximum Iterations (max_iter): 100
  - Solver: liblinear
- **Best Cross-Validated Accuracy**: 83%

Interestingly, the best cross-validated accuracy (83%) is slightly lower than the accuracy we achieved earlier (85%). This could be due to the model being more generalized when using cross-validation, which often provides a more realistic estimate of model performance on unseen data.

### Next Steps:
1. **Re-train Model**: We can re-train the model using these optimized parameters and evaluate it again.
2. **Feature Importance Analysis**: Analyze which features are most influential in predicting heart disease.

Would you like to re-train the model with these parameters, or would you prefer to explore the feature importance analysis?



Explore feature importance analysis

In [None]:
# Re-train the logistic regression model with the best parameters
optimized_logreg = LogisticRegression(C=1, max_iter=100, solver='liblinear', random_state=42)
optimized_logreg.fit(X_train, y_train)

# Get the coefficients of the features
feature_importance = optimized_logreg.coef_[0]

# Create a DataFrame to view the feature importance
importance_df = pd.DataFrame(feature_importance, index=X_train.columns, columns=['Coefficient']).sort_values(by='Coefficient', ascending=False)

print(importance_df)



Here is the feature importance analysis based on the coefficients from the optimized logistic regression model:

{{outputs[0]}}

### Feature Importance Insights:
- **Positive Coefficients** indicate features that increase the likelihood of heart disease:
  - **cp (Chest Pain Type)**: Strongest positive influence.
  - **slp (Slope of the peak exercise ST segment)**: Significant positive influence.
  - **restecg (Resting electrocardiographic results)**: Moderate positive influence.
  - **thalachh (Maximum heart rate achieved)**: Some positive influence.
  - **fbs (Fasting blood sugar > 120 mg/dl)**: Slight positive influence.

- **Negative Coefficients** indicate features that decrease the likelihood of heart disease:
  - **sex (Gender)**: Strongest negative influence.
  - **exng (Exercise induced angina)**: Significant negative influence.
  - **caa (Number of major vessels colored by flourosopy)**: Moderate negative influence.
  - **oldpeak (ST depression induced by exercise relative to rest)**: Moderate negative influence.
  - **thall (Thalassemia)**: Moderate negative influence.
  - **trtbps (Resting blood pressure)**: Some negative influence.
  - **chol (Serum Cholestoral)**: Slight negative influence.
  - **age**: Very slight negative influence.

### Conclusion:
The model suggests that certain types of chest pain, a higher maximum heart rate, and certain ECG results are associated with a higher likelihood of heart disease, while factors like being male, having exercise-induced angina, and having more major vessels visible on fluoroscopy are associated with a lower likelihood.

Would you like to delve deeper into any specific feature or proceed with another analysis?