# Heartbeat Classification using Logistic Regression

This notebook demonstrates a heartbeat classification task using a dataset from Kaggle. The dataset consists of ECG data with different heartbeat types. We will use Logistic Regression for classification, and the workflow includes:

1. Data loading and preprocessing
2. Feature scaling using StandardScaler
3. Logistic Regression with class balancing
4. Model evaluation through accuracy, classification report, and confusion matrix

The goal is to classify heartbeats into different categories and assess model performance.


In [62]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, precision_recall_curve, confusion_matrix, accuracy_score

In [63]:
# Download the dataset from Kaggle (this will work in a local environment with proper Kaggle API setup)
# Note: In Jupyter, ensure the Kaggle API is configured correctly for this command to work.
!kaggle datasets download -d shayanfazeli/heartbeat

Dataset URL: https://www.kaggle.com/datasets/shayanfazeli/heartbeat
License(s): unknown
heartbeat.zip: Skipping, found more recently modified local copy (use --force to force download)


In [64]:
# Unzip the downloaded dataset
import zipfile

# Extract the dataset files
with zipfile.ZipFile('heartbeat.zip', 'r') as zip_ref:
    zip_ref.extractall('heartbeat_data')


In [65]:
import pandas as pd

# Load the training and testing datasets
data_train = pd.read_csv('/content/heartbeat_data/mitbih_train.csv', header=None)
data_test = pd.read_csv('/content/heartbeat_data/mitbih_test.csv', header=None)
# Combine train and test datasets for processing
data = pd.concat([data_train, data_test], ignore_index=True, sort=False)

In [66]:
# Display the dataset
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,178,179,180,181,182,183,184,185,186,187
0,0.977941,0.926471,0.681373,0.245098,0.154412,0.191176,0.151961,0.085784,0.058824,0.049020,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.960114,0.863248,0.461538,0.196581,0.094017,0.125356,0.099715,0.088319,0.074074,0.082621,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.000000,0.659459,0.186486,0.070270,0.070270,0.059459,0.056757,0.043243,0.054054,0.045946,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.925414,0.665746,0.541436,0.276243,0.196133,0.077348,0.071823,0.060773,0.066298,0.058011,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.967136,1.000000,0.830986,0.586854,0.356808,0.248826,0.145540,0.089202,0.117371,0.150235,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109441,0.928736,0.871264,0.804598,0.742529,0.650575,0.535632,0.394253,0.250575,0.140230,0.102299,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
109442,0.802691,0.692078,0.587444,0.446936,0.318386,0.189836,0.118087,0.077728,0.112108,0.152466,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
109443,1.000000,0.967359,0.620178,0.347181,0.139466,0.089021,0.103858,0.100890,0.106825,0.100890,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
109444,0.984127,0.567460,0.607143,0.583333,0.607143,0.575397,0.575397,0.488095,0.392857,0.238095,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0


In [67]:
# Step 3: Data Preprocessing
# Separate features (X) and target (y)
# Assuming the last column (index 187) is the target label
X = data.drop(columns=[187])  # Features (all columns except the last one)
y = data[187]  # Target labels (the last column)

In [68]:
# Split data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [69]:
# Standardize the features for better model performance (especially for Logistic Regression)
scaler = StandardScaler()

# Fit and transform the training data, and transform the test data
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [70]:
# Initialize Logistic Regression model with balanced class weights to handle class imbalance
logistic = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)

# Train the model on the training data
logistic.fit(X_train, y_train)

In [None]:
# Make predictions on the test data
y_pred = logistic.predict(X_test)

# Evaluate the model
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Logistic Regression Accuracy: {accuracy}')

# Display classification report (precision, recall, f1-score, etc.)
print('Logistic Regression Report:\n', classification_report(y_test, y_pred))

# Display confusion matrix
print('Logistic Regression Confusion Matrix:\n', confusion_matrix(y_test, y_pred))

# Improving Heartbeat Classification: From Logistic Regression to Binary & Multi-Class Models

In our initial attempt to classify heartbeats using Logistic Regression, we achieved an accuracy of **66.87%**. The performance was particularly poor for some of the abnormal classes, as seen in the classification report below:

- **Class 0 (Normal Heartbeat)** had a high precision but relatively low recall, leading to a f1-score of 0.77.
- **Classes 1, 2, and 3 (Various Heartbeat Abnormalities)** showed low precision and recall, indicating that the Logistic Regression model struggled to detect these conditions accurately.
- **Class 4** performed relatively well, but the overall performance across the classes was suboptimal.

**Logistic Regression Accuracy: 66.87%**


#Moving Forward: A New Approach
To improve the classification results, we decided to adopt a more refined approach:

Binary Classification: First, we will create a binary model to detect whether a heartbeat is normal or abnormal. This simplifies the task and allows us to focus on differentiating healthy heartbeats from abnormal ones.

#1. Binary Classification using XGBoost
First, we simplified the problem by converting it into a binary classification task, where the goal was to detect whether a heartbeat was normal (0) or abnormal (1). Using XGBoost with SMOTE to handle class imbalance, the model’s performance improved significantly.

In [None]:
# Step 1: Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from xgboost import XGBClassifier

In [None]:
# Step 2: Load the datasets (train and test)
# Note: Ensure the dataset files are correctly loaded from the specified path
data_train = pd.read_csv('/content/heartbeat_data/mitbih_test.csv', header=None)
data_test = pd.read_csv('/content/heartbeat_data/mitbih_train.csv', header=None)

# Step 3: Combine both datasets for unified processing
df = pd.concat([data_train, data_test], ignore_index=True, sort=False)

In [None]:
# Step 4: Make a copy of the data for manipulation
data = df.copy()

In [None]:
# Step 5: Binary classification task (simplify target labels)
# Converting the target label to binary (0 for normal, 1 for other abnormalities)
data.iloc[:, -1] = data.iloc[:, -1].apply(lambda x: 0 if x == 0 else 1)

# Step 6: Rename the target label column for clarity
data = data.rename({187: 'Label'}, axis=1)

In [76]:
# Step 7: Prepare the features (X) and target (y)
X = data.drop('Label', axis=1).copy()
y = data['Label'].copy()

# Step 8: Split data into training and testing sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

# Step 9: Address class imbalance using SMOTE (Synthetic Minority Oversampling Technique)
smote = SMOTE()
X_train, y_train = smote.fit_resample(X_train, y_train)

In [None]:
# Display the training set after SMOTE application
X_train

In [78]:
y_train

Unnamed: 0,Label
0,0.0
1,1.0
2,0.0
3,0.0
4,0.0
...,...
126621,1.0
126622,1.0
126623,1.0
126624,1.0


In [None]:
# Step 10: Initialize and train an XGBoost classifier
binary_xgb_model = XGBClassifier(random_state=42)
binary_xgb_model.fit(X_train, y_train)

In [None]:
# Step 11: Make predictions and evaluate the model
y_pred = binary_xgb_model.predict(X_test)

# Step 12: Evaluate the model performance
accuracy = accuracy_score(y_test, y_pred)
print(f'XGBoost Model Accuracy: {accuracy}')
print('XGBoost Classification Report:\n', classification_report(y_test, y_pred))
print('XGBoost Confusion Matrix:\n', confusion_matrix(y_test, y_pred))

#2. Multi-Class Classification for Abnormalities using XGBoost
Next, we applied XGBoost for multi-class classification, focusing only on abnormal heartbeats (after removing normal heartbeats from the dataset). The remaining abnormal heartbeats were classified into different categories, corresponding to various abnormal conditions.

We used SMOTE to handle the class imbalance, which helped improve the model's performance on these minority classes.

In [None]:
# Step 1: Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from xgboost import XGBClassifier

In [None]:
# Step 2: Load the datasets (train and test)
# Note: Ensure the dataset files are correctly loaded from the specified path
data_train = pd.read_csv('/content/heartbeat_data/mitbih_test.csv', header=None)
data_test = pd.read_csv('/content/heartbeat_data/mitbih_train.csv', header=None)

# Step 3: Combine both datasets
df = pd.concat([data_train, data_test], ignore_index=True, sort=False)

In [None]:
# Step 4: Remove normal heartbeats (class 0) to focus on abnormal heartbeats
df = df[df[187] != 0]

# Step 5: Reclassify the remaining labels by subtracting 1 (to make them start from 0)
df[187] = df[187] - 1

# Step 6: Prepare the features (X) and target (y)
X = df.drop(187, axis=1).copy()
y = df[187].copy()

In [None]:
# Step 7: Split data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

# Step 8: Handle class imbalance using SMOTE
smote = SMOTE()
X_train, y_train = smote.fit_resample(X_train, y_train)

In [None]:
# Display the training labels after SMOTE application
y_train

In [None]:
# Step 9: Initialize and train an XGBoost classifier for multi-class classification
multiclass_xgb_model = XGBClassifier(random_state=42)
multiclass_xgb_model.fit(X_train, y_train)

# Step 10: Make predictions and evaluate the model
y_pred = multiclass_xgb_model.predict(X_test)

# Step 11: Evaluate the model performance
accuracy = accuracy_score(y_test, y_pred)
print(f'XGBoost Model Accuracy: {accuracy}')
print('XGBoost Classification Report:\n', classification_report(y_test, y_pred))
print('XGBoost Confusion Matrix:\n', confusion_matrix(y_test, y_pred))


# Final Evaluation and Conclusion
Logistic Regression provided a baseline accuracy of **66.87**%, but its performance on abnormal heartbeats was suboptimal, particularly in detecting and distinguishing between abnormal conditions.

After switching to **XGBoost** and handling class imbalance with SMOTE, we observed significant improvements in both binary classification (normal vs. abnormal) and multi-class classification (specific abnormal conditions). This multi-step approach enabled the model to better focus on detecting abnormalities and further categorize them into specific conditions.

This workflow demonstrates that using a two-step classification approach, first focusing on binary classification to detect abnormal heartbeats and then using multi-class classification to identify specific conditions, can significantly improve the performance of heartbeat classification tasks.

# Deploying the Heartbeat Classification System Using Streamlit

In this section, we will deploy our heartbeat classification system using **Streamlit**, a powerful and easy-to-use tool for building web applications in Python. The goal is to provide an interactive interface where users can upload their ECG data in `.csv` format and classify heartbeats as normal or abnormal using either a **Logistic Regression** model or a two-step **Binary + Multi-Class XGBoost** approach.

### Deployment Details:
1. **Streamlit**: We will use Streamlit to build the web interface where users can upload their heartbeat data files, choose the classification method, and see the predicted results.
2. **Joblib**: This is used to save and load the pre-trained models.
3. **Ngrok**: We will use Ngrok to make the Streamlit app accessible over the internet by creating a secure tunnel.
4. **Models**:
   - **Logistic Regression**: For multi-class classification.
   - **XGBoost Binary Model**: For detecting normal vs abnormal heartbeats.
   - **XGBoost Multi-Class Model**: To classify specific types of abnormal heartbeats.

Let’s walk through the code, explaining how each part works.


In [None]:
!pip install streamlit
!pip install pyngrok

In [None]:
import joblib  # Used for saving and loading models

# Save the pre-trained models to disk using Joblib
joblib.dump(logistic, 'logistic_regression_model.pkl')  # Save Logistic Regression model
joblib.dump(binary_xgb_model, 'binary_model.pkl')       # Save Binary XGBoost model
joblib.dump(multiclass_xgb_model, 'multi_class_model.pkl')  # Save Multi-Class XGBoost model

In [None]:
# Write the Streamlit app code to a file 'app.py'
%%writefile app.py
import streamlit as st
import pandas as pd
import joblib
import numpy as np
from sklearn.preprocessing import StandardScaler  # For data standardization

# Function to preprocess the uploaded data based on whether it's for binary or multi-class tasks
def preprocess_data(df, binary=False):
    data = df.copy()
    X = data.drop(data.columns[len(data.columns)-1], axis=1)  # Drop the last column (target)
    if not binary:  # Only apply scaling if not binary classification
        scaler = StandardScaler()
        X = scaler.fit_transform(X)  # Standardize the features for better model performance
    return X

# Logistic Regression Model Prediction
def logistic_regression(X):
    model = joblib.load('logistic_regression_model.pkl')  # Load the pre-trained Logistic Regression model
    y_pred = model.predict(X)  # Predict the heartbeat types
    print(y_pred)  # Output the predictions
    return y_pred

# Binary Classification followed by Multi-Class Classification for abnormal heartbeats
def binary_then_multiclass(df):
    # Step 1: Binary Classification (Normal vs Abnormal)
    X_bin = preprocess_data(df, binary=True)  # Preprocess the data for binary classification
    binary_model = joblib.load('binary_model.pkl')  # Load the pre-trained binary XGBoost model
    y_pred_bin = binary_model.predict(X_bin)  # Predict normal (0) or abnormal (1)
    print(y_pred_bin)

    # Initialize final predictions with binary results (0 for normal cases)
    y_pred_full = y_pred_bin.copy()

    # Step 2: Multi-Class Classification for abnormal heartbeats
    abnormal_index = np.where(y_pred_bin == 1)[0]  # Find indices of abnormal cases (1)
    if len(abnormal_index) > 0:  # If there are abnormal cases
        X_abnormal = X_bin.iloc[abnormal_index]  # Extract features for the abnormal cases

        # Load the pre-trained multi-class XGBoost model
        multi_model = joblib.load('multi_class_model.pkl')
        y_pred_multi = multi_model.predict(X_abnormal) + 1  # Predict the abnormal class and increment by 1

        # Update the final predictions with multi-class results for abnormal cases
        y_pred_full[abnormal_index] = y_pred_multi
        print(y_pred_full)

    return y_pred_full

# Streamlit App Interface
st.title("Heartbeat Classification System")  # Title of the web app

# File upload option
uploaded_file = st.file_uploader("Upload your heartbeat data file (.csv)", type="csv")  # Accept only .csv files

# Label mapping for the final output (prediction labels)
label_mapping = {
    0: 'N - Normal Beat',
    1: 'S - Supraventricular premature or ectopic beat',
    2: 'V - Premature ventricular contraction',
    3: 'F - Fusion of ventricular and normal beat',
    4: 'Q - Unclassified beat'
}

if uploaded_file:  # Check if a file has been uploaded
    df = pd.read_csv(uploaded_file, header=None)  # Read the uploaded CSV file
    st.write("Data Preview:")
    st.dataframe(df.head())  # Display the first few rows of the uploaded file

    # User selects classification method
    option = st.selectbox(
        "Select Classification Method:",
        ("Logistic Regression", "Binary + Multi-Class")  # Two options for classification methods
    )

    # Predict button
    if st.button("Predict Output"):
        if option == "Logistic Regression":
            st.write("Running Logistic Regression...")
            X = preprocess_data(df)  # Preprocess data for Logistic Regression
            y_pred = logistic_regression(X)  # Predict using Logistic Regression
            y_pred_labels = pd.Series(y_pred).map(label_mapping)  # Map predicted labels to human-readable form
            st.write("Predicted Labels:")
            st.write(y_pred_labels)  # Display the predicted labels

        elif option == "Binary + Multi-Class":
            st.write("Running Binary + Multi-Class Classification...")
            y_pred_full = binary_then_multiclass(df)  # Predict using Binary + Multi-Class models
            y_pred_labels = pd.Series(y_pred_full).map(label_mapping)  # Map predicted labels
            st.write("Predicted Labels (Normal and Abnormal):")
            st.write(y_pred_labels)  # Display the predicted labels

# Generating a Random Test Sample for Model Testing

This code is designed to create a small test dataset from the larger ECG heartbeat dataset. The test dataset will contain a random selection of rows for each class of heartbeats, which can then be used to evaluate the deployed model. The resulting sample is saved as a CSV file, which can be loaded into the model for predictions.

### Key Steps in the Code:
1. **Loading the Dataset**: The ECG heartbeat data (`mitbih_test.csv`) is loaded using pandas.
2. **Random Sampling**: The code randomly samples 10 rows for each class (grouped by the last column, which contains the labels).
3. **Label Encoding (if applicable)**: Any non-numeric (categorical) data is encoded into numeric form using `LabelEncoder`. This step ensures that the data is compatible with the model.
4. **Saving the Sample**: The sampled data is saved to a new CSV file (`random_test_sample.csv`), which can be used for testing the model predictions.

This process ensures that you have a smaller, representative dataset to test and validate the performance of your deployed model.


In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('/content/heartbeat_data/mitbih_test.csv')


random_rows = df.groupby(df.columns[-1]).sample(n=10, random_state=42) # Group by the last column using df.columns[-1]
label_encoder = LabelEncoder()

for column in random_rows.columns:
    if random_rows[column].dtype == 'object':
        random_rows[column] = label_encoder.fit_transform(random_rows[column].astype(str))
        random_rows[column] = random_rows[column].astype(int)

random_rows.to_csv('random_test_sample.csv',header=False ,index=False)

print("Save random_test_sample.csv")


# **To expose your local Streamlit app online using Ngrok, you'll need to create a free Ngrok account and obtain your own authentication token. Follow these steps:**

1. Go to the [Ngrok website](https://ngrok.com/) and sign up for a free account.
2. Once signed up, navigate to the **Dashboard** to get your **authentication token**.
3. Replace the `your_auth_token_here` in the code with your own token.
4. You’re now ready to use Ngrok to securely expose your Streamlit app to the web!



In [None]:
import os
from pyngrok import ngrok

# Set up the Ngrok authentication token (replace with your own token)
os.environ["NGROK_AUTH_TOKEN"] = "your_auth_token_here"
ngrok.set_auth_token("your_auth_token_here")

# Start a secure tunnel to the local Streamlit app
public_url = ngrok.connect(addr='8501')  # Expose the Streamlit app on port 8501
print("Public URL:", public_url)

# Run the Streamlit app in the background
!streamlit run app.py &  # Launch Streamlit app in the background

# Create a tunnel to the Streamlit app using Ngrok
public_url = ngrok.connect(port='8501')  # Expose the Streamlit app
public_url  # Output the public URL for accessing the app remotely
