# Prepare Dataset for training:
#### Step-by-step guide:
#### Load the dataset.

#### Inspect the data to understand its structure and features.

#### Check for missing values and handle them if necessary.

#### Encode categorical features if any.

#### Normalize or scale numerical features (optional but recommended for logistic regression).

#### Split the data into training and testing sets.

#### Ready for Logistic Regression Model.

## Let me start with loading and inspecting the data.

In [1]:
import pandas as pd
frame = pd.read_csv('heart_disease_uci.csv')
frame

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
915,916,54,Female,VA Long Beach,asymptomatic,127.0,333.0,True,st-t abnormality,154.0,False,0.0,,,,1
916,917,62,Male,VA Long Beach,typical angina,,139.0,False,st-t abnormality,,,,,,,0
917,918,55,Male,VA Long Beach,asymptomatic,122.0,223.0,True,st-t abnormality,100.0,False,0.0,,,fixed defect,2
918,919,58,Male,VA Long Beach,asymptomatic,,385.0,True,lv hypertrophy,,,,,,,0


The dataset has 920 entries and 16 columns. Here's a breakdown of the columns:

id: Identifier for each patient.

age: Age of the patient.

sex: Gender of the patient (Male/Female).

dataset: Source of the dataset (this might not be necessary for prediction).

cp: Chest pain type (categorical).

trestbps: Resting blood pressure (some missing values).

chol: Cholesterol level (some missing values).

fbs: Fasting blood sugar (categorical, some missing values).

restecg: Resting electrocardiographic results (categorical).

thalch: Maximum heart rate achieved (some missing values).

exang: Exercise induced angina (categorical, some missing values).

oldpeak: ST depression induced by exercise (some missing values).

slope: Slope of the peak exercise ST segment (categorical, many missing values).

ca: Number of major vessels (0-3) colored by fluoroscopy (many missing values).

thal: Thalassemia (categorical, many missing values).

num: Diagnosis of heart disease (target variable: 0 = no disease, 1+ = disease).

# Next Steps:

1.Handle missing values.

2.Convert categorical variables into numeric formats.

3.Drop irrelevant columns like id and possibly dataset

In [2]:
# Drop irrelevant columns (id and dataset)
df = frame.drop(columns=['id', 'dataset'])

# Check the number of missing values in each column
missing_values = df.isnull().sum()

missing_values


age           0
sex           0
cp            0
trestbps     59
chol         30
fbs          90
restecg       2
thalch       55
exang        55
oldpeak      62
slope       309
ca          611
thal        486
num           0
dtype: int64

Handle missing values:

You can fill or drop missing values depending on your strategy:

Drop rows with many missing values

In [3]:
df_cleaned=df

In [4]:
df = df.dropna(subset=['trestbps', 'chol', 'thalch', 'oldpeak', 'ca','thal','fbs','restecg','exang','slope'])


Fill missing values with mean/median:

In [None]:
# Fill missing values for numeric columns with the mean= Model Accuracy: 0.----- (or another strategy as needed)
# Fill missing values using .loc to avoid view issues
df.loc[:, 'trestbps'] = df['trestbps'].fillna(df['trestbps'].mean())
df.loc[:, 'chol'] = df['chol'].fillna(df['chol'].mean())
df.loc[:, 'thalch'] = df['thalch'].fillna(df['thalch'].mean())
df.loc[:, 'oldpeak'] = df['oldpeak'].fillna(df['oldpeak'].mean())
df.loc[:, 'ca'] = df['ca'].fillna(df['ca'].mean())

"""# Fill missing values for numeric columns with the median= Model Accuracy: 0.------ (or another strategy as needed)
df_cleaned['trestbps'].fillna(df_cleaned['trestbps'].median(), inplace=True)
df_cleaned['chol'].fillna(df_cleaned['chol'].median(), inplace=True)
df_cleaned['thalch'].fillna(df_cleaned['thalch'].median(), inplace=True)
df_cleaned['oldpeak'].fillna(df_cleaned['oldpeak'].median(), inplace=True)
df_cleaned['ca'].fillna(df_cleaned['ca'].median(), inplace=True)"""

In [None]:
# Fill missing values using .loc
df.loc[:, 'slope'] = df['slope'].fillna(df['slope'].mode()[0])
df.loc[:, 'thal'] = df['thal'].fillna(df['thal'].mode()[0])
df.loc[:, 'restecg'] = df['restecg'].fillna(df['restecg'].mode()[0])
df.loc[:, 'exang'] = df['exang'].fillna(df['exang'].mode()[0])
df.loc[:, 'fbs'] = df['fbs'].fillna(df['fbs'].mode()[0])


# Using Label Encoder for encode categories column to numaric 

In [5]:
from sklearn.preprocessing import LabelEncoder

# Assuming 'df' is your DataFrame and 'cp' is your column with categories
label_encoder = LabelEncoder()
df['cp'] = label_encoder.fit_transform(df['cp'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cp'] = label_encoder.fit_transform(df['cp'])


Convert categorical variables:

Label encoding converts each category into a unique integer

In [7]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Assuming 'df' is your DataFrame
categorical_cols = df.select_dtypes(include=['object']).columns

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply Label Encoding to each categorical column
for col in categorical_cols:
    df[col] = label_encoder.fit_transform(df[col])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = label_encoder.fit_transform(df[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = label_encoder.fit_transform(df[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = label_encoder.fit_transform(df[col])
A value is trying to be set on a copy of a slice from a DataFram

Filter out outliers

In [8]:
# Select only numeric columns
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns

# Calculate IQR for numeric columns only
Q1 = df[numeric_columns].quantile(0.25)
Q3 = df[numeric_columns].quantile(0.75)
IQR = Q3 - Q1

# Filter out outliers from the DataFrame for numeric columns only
df = df[~((df[numeric_columns] < (Q1 - 1.5 * IQR)) | (df[numeric_columns] > (Q3 + 1.5 * IQR))).any(axis=1)]



In [9]:
df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalch      0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
num         0
dtype: int64

Since the num column contains values ranging from 0 to 4, we can convert it into two classes:

0 for no disease

1 for any heart disease (i.e., values 1, 2, 3, or 4).

Steps to convert num to a binary variable:

Convert num column to binary:

If num == 0, it represents no heart disease.

If num > 0, it represents the presence of heart disease (convert to 1).

Here's the code to do this:

In [6]:
# Convert 'num' to a binary variable
df.loc[:, 'num'] = df['num'].apply(lambda x: 0 if x == 0 else 1)


# Check the distribution of the binary target variable
print(df['num'].value_counts())

num
0    160
1    139
Name: count, dtype: int64


## Split the data into features and target:

In [10]:
X = df.drop(columns=['num'])
y = df['num']


## Scale numerical features (optional for Logistic Regression):

In [12]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


## Split into train and test sets:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


## Ready for Logistic Regression Model

In [None]:
# Step 1: Train Logistic Regression Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import joblib

# Initialize the Logistic Regression model
model = LogisticRegression()

# Train the model on the training set
model.fit(X_train, y_train)

# Step 2: Make predictions and evaluate the model
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Step 3: Save the trained model to a file
model_filename = 'logistic_regression_heart_disease.pkl'
joblib.dump(model, model_filename)

print(f"Model saved as {model_filename}")


# Phase two:

# Deeplearning model 

In [14]:

import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
# Standardize data (DNNs typically benefit from standardizing inputs)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)



# Example with L2 regularization and Dropout
model = tf.keras.Sequential([
    tf.keras.layers.Dense(1024, activation='relu', input_shape=(X_train.shape[1],), 
                          kernel_regularizer=regularizers.l2(0.001)),  # L2 regularization
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.4),  # Dropout to prevent overfitting
    tf.keras.layers.Dense(512, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(256, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(32, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(16, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(8, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(4, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(2, activation='relu'),
    
    tf.keras.layers.Dense(1, activation='sigmoid')
])
optimizer = tf.keras.optimizers.AdamW(learning_rate=0.001)
  # Reduced learning rate
# Compile the model
model.compile(optimizer='AdamW', loss='binary_crossentropy', metrics=['accuracy'])



early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=1e-4)
# Train the model
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50,  batch_size=8, 
                    callbacks=[early_stopping,reduce_lr])

#print(model.summary())
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy * 100:.2f}%")


Epoch 1/50


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 20ms/step - accuracy: 0.4761 - loss: 1.9603 - val_accuracy: 0.6981 - val_loss: 1.7629 - learning_rate: 0.0010
Epoch 2/50
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.5124 - loss: 1.7133 - val_accuracy: 0.7736 - val_loss: 1.6162 - learning_rate: 0.0010
Epoch 3/50
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.5857 - loss: 1.5391 - val_accuracy: 0.7736 - val_loss: 1.4894 - learning_rate: 0.0010
Epoch 4/50
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.6348 - loss: 1.4318 - val_accuracy: 0.8491 - val_loss: 1.3959 - learning_rate: 0.0010
Epoch 5/50
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.6092 - loss: 1.3412 - val_accuracy: 0.8113 - val_loss: 1.3052 - learning_rate: 0.0010
Epoch 6/50
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accu

# Save the trained model

In [15]:
model.save("Test_Accuracy_84.52%.keras")

# Save the Scaler 

In [16]:
import joblib
# Save scaler
joblib.dump(scaler, 'scaler.pkl')

"""# Load scaler
scaler = joblib.load('scaler.pkl')"""


"# Load scaler\nscaler = joblib.load('scaler.pkl')"

# Gradient Boosting Classifier, Which is ML Algorithim 

In [None]:
import xgboost as xgb
from sklearn.metrics import accuracy_score

# Train a Gradient Boosting Classifier (XGBoost)
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X_train, y_train)

# Make predictions
y_pred = xgb_model.predict(X_test)

# Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"XGBoost Test Accuracy: {accuracy * 100:.2f}%")

# XGBClassifier, ML Alo

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

model = XGBClassifier()
param_grid = {
    'n_estimators': [100, 500],
    'learning_rate': [0.001, 0.1],
    'max_depth': [8, 5, 4],
}
grid_search = GridSearchCV(model, param_grid, cv=8, scoring='accuracy')
grid_search.fit(X_train, y_train)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_}")


# Create  Flask Server with Model API

In [17]:
import pickle
import numpy as np
import tensorflow as tf
from flask import Flask, request, jsonify

app = Flask(__name__)

# Load the saved Keras model and Scaler
model = tf.keras.models.load_model('Test_Accuracy_84.91%.keras')

# Load the scaler (ensure it's in the same directory)
with open('scaler.pkl', 'rb') as f:
    scaler = pickle.load(f)

@app.route('/predict', methods=['POST'])
def predict():
    # Get JSON data from the POST request
    data = request.json
    features = np.array([[
        data['age'], data['sex'], data['cp'], data['trestbps'], data['chol'], 
        data['fbs'], data['restecg'], data['thalach'], data['exang'], 
        data['oldpeak'], data['slope'], data['ca'], data['thal']
    ]])

    # Scale the input data
    scaled_features = scaler.transform(features)

    # Make a prediction
    prediction = model.predict(scaled_features)
    #predicted_class = np.argmax(prediction, axis=1)[0]

    # Return the result as JSON
    return jsonify({
        'predicted_class': int(prediction),
        'message': 'You are unlikely to have heart disease.' if prediction == 0 else 'You are likely to have heart disease.'
    })

if __name__ == '__main__':
    app.run(debug=True)


 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
 * Restarting with watchdog (windowsapi)


SystemExit: 1

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


# Streamlit Frontend Interacting with the Flask API

In [None]:
import streamlit as st
import requests
import json

# Function to gather user input
def get_user_input():
    age = st.number_input('Age', min_value=0, max_value=120, value=30)
    sex = st.selectbox('Sex (1 = Male, 0 = Female)', [0, 1])
    cp = st.selectbox('Chest Pain Type (0-3)', [0, 1, 2, 3])
    trestbps = st.number_input('Resting Blood Pressure', min_value=80, max_value=200, value=120)
    chol = st.number_input('Cholesterol Level', min_value=100, max_value=600, value=200)
    fbs = st.selectbox('Fasting Blood Sugar > 120 mg/dl (1 = True, 0 = False)', [0, 1])
    restecg = st.selectbox('Resting Electrocardiographic Results (0-2)', [0, 1, 2])
    thalach = st.number_input('Maximum Heart Rate Achieved', min_value=60, max_value=220, value=150)
    exang = st.selectbox('Exercise Induced Angina (1 = Yes, 0 = No)', [0, 1])
    oldpeak = st.number_input('ST Depression Induced by Exercise', min_value=0.0, max_value=6.0, value=1.0, step=0.1)
    slope = st.selectbox('Slope of Peak Exercise ST Segment (0-2)', [0, 1, 2])
    ca = st.number_input('Number of Major Vessels (0-4)', min_value=0, max_value=4, value=0)
    thal = st.selectbox('Thalassemia (0 = Normal, 1 = Fixed Defect, 2 = Reversible Defect)', [0, 1, 2])

    # Create a dictionary with the user input
    user_data = {
        'age': age, 'sex': sex, 'cp': cp, 'trestbps': trestbps, 'chol': chol,
        'fbs': fbs, 'restecg': restecg, 'thalach': thalach, 'exang': exang,
        'oldpeak': oldpeak, 'slope': slope, 'ca': ca, 'thal': thal
    }

    return user_data

# Main function for Streamlit app
def main():
    st.title("Heart Disease Diagnostic Chat")

    # Gather input data from user
    user_input = get_user_input()

    # When user clicks 'Predict' button, send data to Flask API
    if st.button("Predict"):
        # Send POST request to Flask API
        url = 'http://127.0.0.1:5000/predict'  # Flask API URL
        headers = {'Content-Type': 'application/json'}
        response = requests.post(url, data=json.dumps(user_input), headers=headers)

        if response.status_code == 200:
            prediction = response.json()
            st.write(f"Prediction: {prediction['message']}")
        else:
            st.write("Error: Could not connect to Flask API.")

if __name__ == "__main__":
    main()
