### Importing Health Data

In [None]:
import pandas as pd
df = pd.read_csv("./heart_cleveland_upload.csv")

df

### Identifying Null Values

In [None]:
sumofnull = df.isnull().sum()
sumofnull

### Examining Data Types

In [None]:
datatype = df.dtypes
datatype

### Identifying Numerical and Categorical Features
Need to classify the features within our health data into two categories: numerical and categorical. This classification is crucial for our data analysis and modeling efforts. Understanding the nature of these features is essential for the project

In [None]:
numeric_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'condition']
cat_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']

print("Numeric Features:")
print(numeric_features)

print("\nCategorical Features:")
print(cat_features)

### Converting Features to Categorical Data Types
we transform selected features into categorical data types. Specifically, we convert 'sex,' 'cp,' 'fbs,' 'restecg,' 'exang,' 'slope,' 'ca,' and 'thal' into categorical variables. 

In [None]:
lst = cat_features
df[lst] = df[lst].astype(object)
dtype = df.dtypes

dtype

### Exploring Feature Correlations
By generating a heatmap using the 'sns.heatmap' function, we visualize the relationships between numerical features . This visualization is pivotal for understanding how these features interact and impact each other within the dataset. 

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

selected_columns = df[numeric_features]

# Calculating correlation
corr_data = selected_columns.corr()

# Creating heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_data, annot=True, cmap='RdBu', linewidths=0.1)
plt.title('Correlation Between Numeric Features')
plt.show()

### Visualizing Health Conditions
Visualize the distribution of health conditions within our dataset. By generating this plot, we can gain a clear overview of the prevalence of different health conditions. 

In [None]:
import matplotlib.pyplot as plt

condition_ax = sns.countplot(x=df["condition"], palette='bwr')
plt.show()

### Analyzing Health Conditions by Gender
we generate a countplot to analyze health conditions with respect to gender. By creating this plot, we gain insights into how different health conditions are distributed among males and females. 

In [None]:
sex_ax = sns.countplot(x=df["sex"], hue=df['condition'],  palette='bwr')
plt.show

### Examining Chest Pain Types and Health Conditions
countplot to examine the relationship between different types of chest pain ('cp') and health conditions. By visualizing this data, we gain insights into how various chest pain types are associated with different health conditions.

In [None]:
cp_ax = sns.countplot(x=df["cp"], hue=df['condition'], palette='bwr')
plt.show()

### Investigating Fasting Blood Sugar Levels and Health Conditions
countplot to investigate the relationship between fasting blood sugar levels ('fbs') and health conditions. By visualizing this data, we gain insights into how different fasting blood sugar levels are associated with various health conditions.

In [None]:
fbs_ax = sns.countplot(x=df["fbs"], hue=df['condition'], palette='bwr')
plt.show()

### Analyzing Resting Electrocardiographic Results and Health Conditions
countplot to analyze the connection between resting electrocardiographic results ('restecg') and health conditions. By visualizing this data, we gain insights into how different resting electrocardiographic outcomes are linked to various health conditions.

In [None]:
restecg_ax = sns.countplot(x=df["restecg"], hue=df['condition'], palette='bwr')
plt.show()

### Examining Exercise-Induced Angina and Health Conditions
countplot to examine the relationship between exercise-induced angina ('exang') and health conditions. By visualizing this data, we gain insights into how the presence or absence of exercise-induced angina is associated with various health conditions.

In [None]:
exang_ax = sns.countplot(x=df["exang"], hue=df['condition'], palette='bwr')
plt.show()

### Investigating the Slope of the ST Segment and Health Conditions
countplot to investigate the relationship between the slope of the ST segment ('slope') and health conditions. By visualizing this data, we gain insights into how different ST segment slopes are associated with various health conditions.

In [None]:
slope_ax = sns.countplot(x=df["slope"], hue=df['condition'], palette='bwr')
plt.show()

### Analyzing the Number of Major Vessels Colored by Fluoroscopy and Health Conditions
countplot to analyze the relationship between the number of major vessels colored by fluoroscopy ('ca') and health conditions. By visualizing this data, we gain insights into how the number of colored vessels is associated with various health conditions.

In [None]:
ca_ax = sns.countplot(x=df["ca"], hue=df['condition'], palette='bwr')
plt.show()

### Examining Thalassemia and Health Conditions
countplot to examine the relationship between thalassemia ('thal') and health conditions. By visualizing this data, we gain insights into how different thalassemia categories are associated with various health conditions. This analysis is essential for understanding the impact of thalassemia on health outcomes.

In [None]:
thal_ax = sns.countplot(x=df["thal"], hue=df['condition'], palette='bwr')
plt.show()

### Visualizing Age Distribution
create a histogram to visualize the distribution of age in our dataset. The histogram is constructed using the 'age' column from the dataset and is presented with 20 bins for better visualization. This plot helps us gain insights into the distribution of ages within the dataset and is a critical aspect of our project's analysis to understand the age demographics of the individuals in our data.

In [None]:
age_col = df['age']

plt.figure(figsize=(10,6))
plt.hist(age_col, bins=20, color='skyblue', alpha=0.7, ec='blue')

plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')

plt.show()

### Visualizing Resting Blood Pressure Distribution
 create a histogram to visualize the distribution of resting blood pressure ('trestbps') in our dataset. The histogram is constructed using the 'trestbps' column from the dataset and is presented with 20 bins for better visualization. This plot helps us gain insights into the distribution of resting blood pressure levels within the dataset and is an important aspect of our project's analysis to understand the distribution of this health-related feature.

In [None]:
trestbps_col = df['trestbps']

plt.figure(figsize=(10, 6))
plt.hist(trestbps_col, bins=20, color='lightcoral', alpha=0.7, ec='red')

plt.xlabel('trestbps')
plt.ylabel('Frequency')
plt.title('trestbps Distribution')

plt.show()

### Visualizing Cholesterol Distribution
create a histogram to visualize the distribution of cholesterol levels ('chol') in our dataset. The histogram is constructed using the 'chol' column from the dataset and is presented with 20 bins for better visualization. This plot helps us gain insights into the distribution of cholesterol levels within the dataset, which is a critical aspect of our project's analysis to understand the distribution of this health-related feature.

In [None]:
chol_col = df['chol']

plt.figure(figsize=(10, 6))
plt.hist(chol_col, bins=20, color='lightgreen', alpha=0.7, ec='green')

# Label x-axis and y-axis, and set title
plt.xlabel('Cholesterol (chol)')
plt.ylabel('Frequency')
plt.title('Cholesterol Distribution')

plt.show()

### Visualizing Maximum Heart Rate Distribution
create a histogram to visualize the distribution of maximum heart rate ('thalach') in our dataset. The histogram is constructed using the 'thalach' column from the dataset and is presented with 20 bins for better visualization. This plot helps us gain insights into the distribution of maximum heart rate levels within the dataset, which is a critical aspect of our project's analysis to understand the distribution of this health-related feature.

In [None]:
thalach_col = df['thalach']

plt.figure(figsize=(10, 6))
plt.hist(thalach_col, bins=20, color='cyan', alpha=0.7, ec="darkblue")

# Label x-axis and y-axis, and set title
plt.xlabel('Maximum Heart Rate (thalach)')
plt.ylabel('Frequency')
plt.title('Maximum Heart Rate Distribution')

plt.show()

### Visualizing ST Depression Distribution
create a histogram to visualize the distribution of ST depression ('oldpeak') in our dataset. The histogram is constructed using the 'oldpeak' column from the dataset and is presented with 20 bins for better visualization. This plot helps us gain insights into the distribution of ST depression levels within the dataset, which is a critical aspect of our project's analysis to understand the distribution of this health-related feature.

In [None]:
oldpeak_col = df['oldpeak']

plt.figure(figsize=(10, 6))
plt.hist(oldpeak_col, bins=20, color='orange', alpha=0.7, ec="darkred")

# Label x-axis and y-axis, and set title
plt.xlabel('ST depression')
plt.ylabel('Frequency')
plt.title('Distribution of ST depression')

plt.show()

### Analyzing Fasting Blood Sugar Levels and Health Conditions
use a countplot to analyze the relationship between fasting blood sugar levels ('fbs') and health conditions. By generating this plot, we gain insights into how different fasting blood sugar levels are associated with various health conditions. 

In [None]:
countplt = sns.catplot(x='fbs', hue='condition', kind='count', alpha=0.85, data=df, palette='bwr')
plt.show()

### Visualizing Chest Pain Types, Age, and Health Conditions
use a violin plot to visualize the relationship between different chest pain types ('cp'), age, and health conditions. By generating this plot, we gain insights into how chest pain types are distributed across different age groups and their association with health conditions. 

In [None]:
violinplt = sns.catplot(x='cp', y='age', hue='condition', kind='violin', palette='winter', data=df)
plt.show()

### Encoding Categorical Features
We'll encode categorical features within our dataset, specifically 'cp,' 'thal,' and 'slope,' using one-hot encoding. We'll transform these categorical variables into a numerical format, to incorporate them into our analysis.

In [None]:
categorical_cols = ['cp', 'thal', 'slope']
df['oldpeak'] = df['oldpeak'].astype(int)
# Cast categorical columns to integer data type
for col in categorical_cols:
    df[col] = df[col].astype(int)

df_encoded = pd.get_dummies(df, columns=categorical_cols, prefix_sep='_', dtype=int)
df_encoded.dtypes

### Preparing Features and Target Variable
Prepare the features and the target variable for our analysis. We'll create the variable 'x' by excluding the 'condition' column, which serves as our feature set. The 'y' variable is defined as the 'condition' column, representing our target variable. This separation is fundamental for our project as it sets the stage for further data analysis, modeling, and understanding the relationship between the features and the health condition

In [None]:
# Create DataFrame for predictor variables (features)
x = df_encoded.drop('condition', axis=1)

# Set 'y' as the target variable
y = df_encoded['condition']


print("Features (x):")
print(x.head())

print("\nTarget Variable (y):")
print(y.head())

### Scaling Features
use the MinMaxScaler from the sklearn library to scale the feature set 'x.' Scaling is crucial for ensuring that all the features are on a similar scale, preventing any feature from dominating the analysis due to its magnitude.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the 'x' DataFrame
x = scaler.fit_transform(x)

#--- Inspect data ---
x

### Splitting the Data into Training and Testing Sets
We'll split our dataset into training and testing sets using the train_test_split function from the sklearn library. The training set, 'X_train' and 'Y_train,' is designed to train our predictive models, while the testing set, 'X_test' and 'Y_test,' is reserved for evaluating the model's performance. By performing this data split, we ensure that our models are trained and tested on different data subsets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, train_size=0.8, test_size=0.2, random_state=4)


print("Training set - Predictor variables (X_train):", X_train.shape)
print("Testing set - Predictor variables (X_test):", X_test.shape)
print("Training set - Target variable (Y_train):", Y_train.shape)
print("Testing set - Target variable (Y_test):", Y_test.shape)

### Building and Evaluating Logistic Regression Model
build a logistic regression model for our project. We use the sklearn library to create the 'lr_model' and train it using the training data, 'X_train' and 'Y_train.' Additionally, we assess the model's performance through cross-validation, with 10 folds, to estimate its accuracy. The 'lr_mean_score' represents the mean accuracy across the folds.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Create logistic regression model
lr_model = LogisticRegression()

# Fit the model to the training data
lr_model.fit(X_train, Y_train)

# Perform cross-validation with 10 folds
lr_cv_results = cross_val_score(lr_model, X_train, Y_train, cv=10)

# Calculate mean score from cross-validation
lr_mean_score = round(lr_cv_results.mean(), 4)

print("Mean accuracy from cross-validation:", lr_mean_score)

### Building and Evaluating Linear Discriminant Analysis Model
construct a Linear Discriminant Analysis (LDA) model for our project using the sklearn library. We create the 'ldr_model' and train it with the training data, 'X_train' and 'Y_train.' Subsequently, we assess the model's performance through cross-validation with 10 folds, calculating the mean accuracy. The 'ldr_mean_score' represents this mean accuracy.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Create LDA model
ldr_model = LinearDiscriminantAnalysis()

# Fit the model to the training data
ldr_model.fit(X_train, Y_train)

# Perform cross-validation with 10 folds
ldr_cv_results = cross_val_score(ldr_model, X_train, Y_train, cv=10)

# Calculate mean score from cross-validation
ldr_mean_score = round(ldr_cv_results.mean(), 4)

print("Mean accuracy from cross-validation:", ldr_mean_score)

### Building and Evaluating K-Nearest Neighbors (KNN) Model
construct a K-Nearest Neighbors (KNN) model for our project using the sklearn library. We create the 'knn_model' and train it with the training data, 'X_train' and 'Y_train.' Subsequently, we assess the model's performance through cross-validation with 10 folds, calculating the mean accuracy. The 'knn_mean_score' represents this mean accuracy. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Create KNN model
knn_model = KNeighborsClassifier()

# Fit the model to the training data
knn_model.fit(X_train, Y_train)

# Perform cross-validation with 10 folds
knn_cv_results = cross_val_score(knn_model, X_train, Y_train, cv=10)

# Calculate mean score and standard deviation from cross-validation
knn_mean_score = round(knn_cv_results.mean(), 4)
knn_std_score = round(knn_cv_results.std(), 4)

print("Mean accuracy from cross-validation:", knn_mean_score)
print("Standard deviation of scores:", knn_std_score)


### Building and Evaluating Decision Tree Classifier Model
build a Decision Tree Classifier model for our project using the sklearn library. We create the 'dt_model' and train it with the training data, 'X_train' and 'Y_train.' Subsequently, we assess the model's performance through cross-validation with 10 folds, calculating the mean accuracy. The 'dt_mean_score' represents this mean accuracy. 

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Create Decision Tree Classifier model
dt_model = DecisionTreeClassifier()

# Fit the model to the training data
dt_model.fit(X_train, Y_train)

# Perform cross-validation with 10 folds
dt_cv_results = cross_val_score(dt_model, X_train, Y_train, cv=10)

# Calculate mean score from cross-validation
dt_mean_score = round(dt_cv_results.mean(), 4)

print("Mean accuracy from cross-validation:", dt_mean_score)

### Building and Evaluating Gaussian Naive Bayes Model
construct a Gaussian Naive Bayes model for our project using the sklearn library. We create the 'gnb_model' and train it with the training data, 'X_train' and 'Y_train.' Subsequently, we assess the model's performance through cross-validation with 10 folds, calculating the mean accuracy. The 'gnb_mean_score' represents this mean accuracy.

In [None]:
from sklearn.naive_bayes import GaussianNB

# Create Gaussian Naive Bayes model
gnb_model = GaussianNB()

# Fit the model to the training data
gnb_model.fit(X_train, Y_train)

# Perform cross-validation with 10 folds
gnb_cv_results = cross_val_score(gnb_model, X_train, Y_train, cv=10)

# Calculate mean score from cross-validation
gnb_mean_score = round(gnb_cv_results.mean(), 4)

print("Mean accuracy from cross-validation:", gnb_mean_score)


### Building and Evaluating Random Forest Classifier Model
construct a Random Forest Classifier model for our project using the sklearn library. We create the 'rf_model' with 100 trees and a maximum of 3 features per split. The model is trained using the training data, 'X_train' and 'Y_train.' Subsequently, we assess the model's performance through cross-validation with 10 folds, calculating the mean accuracy.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Define the number of trees and maximum number of features
num_trees = 100
max_features = 'sqrt'

# Create Random Forest Classifier model
rf_model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)

# Fit the model to the training data
rf_model.fit(X_train, Y_train)

# Perform cross-validation with 10 folds
rf_cv_results = cross_val_score(rf_model, X_train, Y_train, cv=10)

# Calculate mean score from cross-validation
rf_mean_score = round(rf_cv_results.mean(), 4)

print("Mean accuracy from cross-validation:", rf_mean_score)


### Building and Evaluating Support Vector Classifier (SVC) Model
construct a Support Vector Classifier (SVC) model for our project using the sklearn library. We create the 'sv_model' and train it with the training data, 'X_train' and 'Y_train.' Subsequently, we assess the model's performance through cross-validation with 10 folds, calculating the mean accuracy. The 'sv_mean_score' represents this mean accuracy. 

In [None]:
from sklearn.svm import SVC

# Create Support Vector Classifier (SVC) model
sv_model = SVC()

# Fit the model to the training data
sv_model.fit(X_train, Y_train)

# Perform cross-validation with 10 folds
sv_cv_results = cross_val_score(sv_model, X_train, Y_train, cv=10)

# Calculate mean score from cross-validation
sv_mean_score = round(sv_cv_results.mean(), 4)


print("Mean accuracy from cross-validation:", sv_mean_score)

### Evaluating Model Performance

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Use the selected model to predict on the test data
y_pred = lr_model.predict(X_test)  

# Calculate accuracy score
accuracy = accuracy_score(Y_test, y_pred)

# Generate confusion matrix
cm = confusion_matrix(Y_test, y_pred)

# Create classification report
cr = classification_report(Y_test, y_pred)

print("Accuracy:", accuracy)
print("\nConfusion Matrix:")
print(cm)
print("\nClassification Report:")
print(cr)


### Making Predictions with Gaussian Naive Bayes Model
utilize the trained Gaussian Naive Bayes model to make predictions on new data. We provide a set of features in the 'data' variable and use the 'gnb_model' to predict the corresponding health condition outcome. This prediction helps us understand how the model classifies a new instance based on the provided features, which is a crucial aspect of our project's analysis and decision-making.

In [None]:
data = [[0.254, 1, 0.487, 0.362,  ## age_scaled, sex, trestbps_scaled, chol
             1, 0.5, 0.641, 1,  ## fbs, restecg_scaled, thalach_scaled, exang
             0.672, 0.863, 0, 0,  ## oldpeak_scaled, ca_scaled, cp_0, cp_1
             0, 1, 0, 0,  ## cp_2, cp_3, thal_0, thal_1
             0, 1, 0, 1]]  ## thal_2, thal_3, slope_0, slope_1, slope_2

# predict the result by passing the sample data available here to your model to make a prediction.
prediction = lr_model.predict(data)


print("Prediction:", prediction)