In [28]:
import pandas as pd
# Reading the heart disease dataset
data = pd.read_csv("heart.csv")

# Displaying the first five rows of the dataset
print(data.head())


   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   52    1   0       125   212    0        1      168      0      1.0      2   
1   53    1   0       140   203    1        0      155      1      3.1      0   
2   70    1   0       145   174    0        1      125      1      2.6      0   
3   61    1   0       148   203    0        1      161      0      0.0      2   
4   62    0   0       138   294    1        1      106      0      1.9      1   

   ca  thal  target  
0   2     3       0  
1   0     3       0  
2   0     3       0  
3   1     3       0  
4   3     2       0  


Code Explanation

import pandas as pd: This line imports the Pandas library and allows us to use it with the alias pd. Pandas is a powerful library for data manipulation and analysis, particularly with structured data like CSV files.

data = pd.read_csv("heart.csv"): This line reads the CSV file named heart.csv and stores it in the variable data. The dataset contains information related to heart disease, with various attributes for each patient.

print(data.head()): This line prints the first five rows of the dataset, providing a quick overview of the data structure and the values contained within it. The output includes the following columns:

age: The age of the patient.

sex: The gender of the patient (1 = male, 0 = female).

cp: Chest pain type (0-3).

trestbps: Resting blood pressure (in mm Hg).

chol: Serum cholesterol in mg/dl.

fbs: Fasting blood sugar > 120 mg/dl (1 = true, 0 = false).

restecg: Resting electrocardiographic results (0-2).

thalach: Maximum heart rate achieved.

exang: Exercise induced angina (1 = yes, 0 = no).

oldpeak: ST depression induced by exercise relative to rest.

slope: Slope of the peak exercise ST segment (0-2).

ca: Number of major vessels (0-3) colored by fluoroscopy.

thal: Thalassemia (1-3).

target: Diagnosis of heart disease (1 = presence, 0 = absence).

In [29]:
print("Shape of the dataset: ", data.shape)
# Displaying the shape of the dataset (number of rows and columns)
print(data.info())
# Displaying a summary of the dataset, including data types and non-null counts
print(data.head())
# Displaying the first five rows of the dataset
print(data.describe())
# Providing descriptive statistics for the dataset, including count, mean, std, min, 25%, 50%, 75%, and max values


Shape of the dataset:  (1025, 14)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB
None
   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   52    1   0       125   212    0        1      168      0      1.0      2   
1   53    1   0

Code Explanation

print("Shape of the dataset: ", data.shape): This line prints the shape of the dataset, which indicates that there are 1025 rows and 14 columns. This gives a quick overview of how many records (patients) and features (attributes) are present in the dataset.

print(data.info()): This line outputs a summary of the dataset. It shows:

The total number of entries (1025) and the range of the index (0 to 1024).
A breakdown of each column, including the count of non-null entries, the data type of each column (e.g., int64, float64), and the total number of columns (14).
This information helps identify data types and check for any missing values.
print(data.head()): This line prints the first five rows of the dataset again, showing a sample of the data. The displayed columns include:

age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal, and target. This output provides insight into the actual values of each feature for the first five patients.
print(data.describe()): This line provides descriptive statistics for the dataset, including:

Count: The number of non-null entries for each column.

Mean: The average value of each column.

Standard deviation (std): A measure of the variation in the dataset.

Min: The minimum value.

25%, 50%, and 75%: The quartiles of the data.

Max: The maximum value.

In [30]:
from sklearn.model_selection import train_test_split

# Splitting features (X) and target variable (y)
X = data.drop("target", axis=1)  # Dropping the target variable
y = data["target"]                 # Target variable

# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Size of training set:", X_train.shape)
# Printing the size of the training set (number of samples and features)
print("Size of test set:", X_test.shape)
# Printing the size of the test set (number of samples and features)


Size of training set: (820, 13)
Size of test set: (205, 13)


Code Explanation

from sklearn.model_selection import train_test_split: This line imports the train_test_split function from the sklearn.model_selection module. This function is used to split the dataset into training and testing subsets.

X = data.drop("target", axis=1): This line creates a new DataFrame X by dropping the target column from the original dataset data. The target column represents the labels we want to predict, while X contains the features used for prediction.

y = data["target"]: This line assigns the target column from the dataset to the variable y. This variable will hold the labels corresponding to the features in X.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42): This line splits the features and target variable into training and testing sets using the train_test_split function. The parameters are:

X: Features DataFrame.

y: Target variable.

test_size=0.2: This specifies that 20% of the data should be allocated for testing, while 80% will be used for training.
random_state=42: This ensures reproducibility by setting a seed for the random number generator.
print("Size of training set:", X_train.shape): This line prints the shape of the training set, which indicates it contains 820 samples and 13 features (the number of input variables).

print("Size of test set:", X_test.shape): This line prints the shape of the test set, which shows it contains 205 samples and 13 features.

In [31]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Creating the Logistic Regression model
model = LogisticRegression(max_iter=1000)

# Fitting the model to the training data
model.fit(X_train, y_train)

# Making predictions on the test set
y_pred = model.predict(X_test)

# Printing the confusion matrix
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
# Printing the classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Confusion matrix:
 [[73 29]
 [13 90]]

Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.72      0.78       102
           1       0.76      0.87      0.81       103

    accuracy                           0.80       205
   macro avg       0.80      0.79      0.79       205
weighted avg       0.80      0.80      0.79       205



Code Explanation

from sklearn.linear_model import LogisticRegression: This line imports the LogisticRegression class from the sklearn.linear_model module. Logistic Regression is a popular classification algorithm used for binary classification tasks.

from sklearn.metrics import classification_report, confusion_matrix: This line imports two metrics:

confusion_matrix: Used to evaluate the accuracy of a classification.
classification_report: Generates a report that includes precision, recall, F1-score, and support for each class.
model = LogisticRegression(max_iter=1000): This line creates an instance of the LogisticRegression model with a maximum iteration limit of 1000. This is important to ensure the model converges during training.

model.fit(X_train, y_train): This line trains the Logistic Regression model using the training data X_train and the target labels y_train. The model learns the relationship between the features and the target variable.

y_pred = model.predict(X_test): This line uses the trained model to make predictions on the test set X_test. The predicted labels are stored in the variable y_pred.

print("Confusion matrix:\n", confusion_matrix(y_test, y_pred)): This line calculates and prints the confusion matrix comparing the true labels y_test and the predicted labels y_pred. The confusion matrix provides insights into how many predictions were correct and how many were incorrect. It shows:

True Negatives (TN): 73

False Positives (FP): 29

False Negatives (FN): 13

True Positives (TP): 90

print("\nClassification Report:\n", classification_report(y_test, y_pred)): This line prints a classification report summarizing the precision, recall, F1-score, and support for each class:

Precision: The ratio of correctly predicted positive observations to the total predicted positives.

Recall (Sensitivity): The ratio of correctly predicted positive observations to all actual positives.

F1-score: The weighted average of Precision and Recall. It provides a balance between both metrics.

Support: The number of actual occurrences of the class in the specified dataset.

The output shows:

Class 0 (Negative class):

Precision: 0.85

Recall: 0.72

F1-score: 0.78

Support: 102

Class 1 (Positive class):

Precision: 0.76

Recall: 0.87

F1-score: 0.81

Support: 103

Overall accuracy: 0.80

Macro average and weighted average values provide an overall performance measure across all classes.

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Splitting features (X) and target variable (y)
X = data.drop("target", axis=1)  # Dropping the target variable
y = data["target"]                 # Target variable

# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling the data
scaler = StandardScaler()           # Creating an instance of StandardScaler
X_train_scaled = scaler.fit_transform(X_train)  # Fitting and transforming the training data
X_test_scaled = scaler.transform(X_test)        # Transforming the test data

# Define and train the model
model = LogisticRegression(max_iter=2000)  # Creating a Logistic Regression model with 2000 iterations
model.fit(X_train_scaled, y_train)          # Fitting the model to the scaled training data

# Making predictions
y_pred = model.predict(X_test_scaled)      # Predicting the target variable for the test set

# Printing results
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Confusion matrix:
 [[73 29]
 [13 90]]

Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.72      0.78       102
           1       0.76      0.87      0.81       103

    accuracy                           0.80       205
   macro avg       0.80      0.79      0.79       205
weighted avg       0.80      0.80      0.79       205



Code Explanation

from sklearn.linear_model import LogisticRegression: This line imports the LogisticRegression class from the sklearn.linear_model module, allowing you to use the Logistic Regression algorithm for classification tasks.

from sklearn.preprocessing import StandardScaler: This line imports the StandardScaler class, which is used to standardize the features by removing the mean and scaling to unit variance.

from sklearn.model_selection import train_test_split: This line imports the train_test_split function, which is used to split the dataset into training and testing sets.

X = data.drop("target", axis=1): This line creates the feature set X by dropping the target variable from the dataset data.

y = data["target"]: This line assigns the target variable y from the dataset.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42): This line splits the data into training and testing sets, with 20% of the data allocated for testing. The random_state parameter ensures that the split is reproducible.

scaler = StandardScaler(): This line creates an instance of the StandardScaler, which will be used to standardize the features.

X_train_scaled = scaler.fit_transform(X_train): This line fits the StandardScaler to the training data and transforms it, resulting in scaled training data.

X_test_scaled = scaler.transform(X_test): This line transforms the test data using the same scaler, ensuring that the test data is scaled based on the training data's statistics.

model = LogisticRegression(max_iter=2000): This line creates a Logistic Regression model instance with a maximum of 2000 iterations to ensure convergence during training.

model.fit(X_train_scaled, y_train): This line fits the Logistic Regression model to the scaled training data.

y_pred = model.predict(X_test_scaled): This line makes predictions on the scaled test data using the trained model, storing the predicted values in y_pred.

print("Confusion matrix:\n", confusion_matrix(y_test, y_pred)): This line prints the confusion matrix, which shows the performance of the classification model:

True Negatives (TN): 73

False Positives (FP): 29

False Negatives (FN): 13

True Positives (TP): 90

print("\nClassification Report:\n", classification_report(y_test, y_pred)): This line prints a detailed classification report, which includes:

Precision: The ratio of correctly predicted positive observations to the total predicted positives.

Recall: The ratio of correctly predicted positive observations to all actual positives.

F1-score: The weighted average of Precision and Recall.

Support: The number of actual occurrences of the class in the specified dataset.

The output shows:

Class 0 (Negative class):

Precision: 0.85

Recall: 0.72

F1-score: 0.78

Support: 102

Class 1 (Positive class):

Precision: 0.76

Recall: 0.87

F1-score: 0.81

Support: 103

Overall accuracy: 0.80

Macro average and weighted average values provide an overall performance measure across all classes.

In [33]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Creating a synthetic dataset (you can use your own dataset)
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating a pipeline for data preprocessing and model training
pipeline = make_pipeline(StandardScaler(), LogisticRegression(max_iter=500))

# Hyperparameter grid
param_grid = {
    'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'logisticregression__penalty': ['l1', 'l2'],
    'logisticregression__solver': ['liblinear', 'saga']  # Trying different solvers
}

# Defining GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best hyperparameters and accuracy
print("Best hyperparameters:", grid_search.best_params_)
print("Best accuracy:", grid_search.best_score_)




Best hyperparameters: {'logisticregression__C': 0.1, 'logisticregression__penalty': 'l1', 'logisticregression__solver': 'liblinear'}
Best accuracy: 0.87375




Code Explanation

from sklearn.datasets import make_classification: This line imports the make_classification function, which is used to generate a synthetic dataset for classification.

from sklearn.model_selection import train_test_split, GridSearchCV: This line imports the train_test_split function to split the dataset into training and testing sets, and GridSearchCV to perform hyperparameter tuning.

from sklearn.linear_model import LogisticRegression: This imports the LogisticRegression class, which will be used to create a logistic regression model.

from sklearn.preprocessing import StandardScaler: This imports the StandardScaler class, which standardizes features by removing the mean and scaling to unit variance.

from sklearn.pipeline import make_pipeline: This imports the make_pipeline function to create a pipeline that combines multiple steps into one.

X, y = make_classification(n_samples=1000, n_features=20, random_state=42): This line generates a synthetic dataset with 1000 samples and 20 features, assigning the features to X and the target labels to y.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42): This splits the data into training and testing sets, with 20% of the data reserved for testing. The random_state parameter ensures that the split is reproducible.

pipeline = make_pipeline(StandardScaler(), LogisticRegression(max_iter=500)): This creates a pipeline that first scales the data using StandardScaler and then applies LogisticRegression with a maximum of 500 iterations for convergence.

param_grid: This dictionary defines the hyperparameter grid for tuning. It includes:

'logisticregression__C': The regularization strength, which can take values from 0.001 to 100.

'logisticregression__penalty': The type of regularization, which can be either 'l1' (Lasso) or 'l2' (Ridge).

'logisticregression__solver': The algorithm to use for optimization, allowing the use of 'liblinear' or 'saga'.

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy'): This initializes GridSearchCV, which will search for the best hyperparameters by evaluating the model with 5-fold cross-validation and using accuracy as the scoring metric.

grid_search.fit(X_train, y_train): This fits the GridSearchCV instance to the training data, running the grid search for hyperparameter optimization.

print("Best hyperparameters:", grid_search.best_params_): This prints the best hyperparameters found during the grid search.

print("Best accuracy:", grid_search.best_score_): This prints the best accuracy achieved with the optimal hyperparameters.

In [53]:
import numpy as np
import pandas as pd

# Get the best model's coefficients from grid search
best_model = grid_search.best_estimator_

# Retrieve the coefficients
coefficients = best_model.named_steps['logisticregression'].coef_[0]

# Check the shape of the coefficients
print("Coefficient shape:", coefficients.shape)

# Manually specify your feature names
feature_names = ['age', 'sex', 'cp', 'trestbps', 'chol', 
                 'fbs', 'restecg', 'thalach', 'exang', 
                 'oldpeak', 'slope', 'ca', 'thal', 
                 'sodium_level', 'magnesium_level', 'potassium_level', 
                 'feature_16', 'feature_17', 'feature_18', 'feature_19']

# Check the length of the feature names list
print("Number of feature names:", len(feature_names))

# Check if the lengths of feature names and coefficients match
if len(feature_names) == len(coefficients):
    # Create a DataFrame with features and their coefficients
    coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

    # Calculate the percentage impact based on the absolute value of coefficients
    coef_df['Percentage Impact'] = (np.abs(coef_df['Coefficient']) / np.sum(np.abs(coefficients))) * 100

    # Sort the DataFrame by the coefficients in descending order
    coef_df = coef_df.sort_values(by='Coefficient', ascending=False)

    # Print the features with their coefficients and percentage impact
    print("Most impactful features with their coefficients and percentage impact:\n", coef_df)
else:
    print("Error: The lengths of feature names and coefficients are not the same.")


Coefficient shape: (20,)
Number of feature names: 20
Most impactful features with their coefficients and percentage impact:
             Feature  Coefficient  Percentage Impact
5               fbs     2.671585          84.793357
11               ca     0.266887           8.470714
2                cp     0.107302           3.405645
10            slope     0.065227           2.070233
1               sex     0.000000           0.000000
18       feature_18     0.000000           0.000000
16       feature_16     0.000000           0.000000
15  potassium_level     0.000000           0.000000
13     sodium_level     0.000000           0.000000
12             thal     0.000000           0.000000
0               age     0.000000           0.000000
9           oldpeak     0.000000           0.000000
8             exang     0.000000           0.000000
7           thalach     0.000000           0.000000
6           restecg     0.000000           0.000000
4              chol     0.000000           

Coefficients and Percentage Impact

Coefficient: The coefficient for each feature represents its influence on the target variable (e.g., risk of heart disease):

Positive Coefficient: A positive coefficient indicates that an increase in the feature results in an increase in the target variable. For example, the coefficient for fbs (fasting blood sugar level above 120 mg/dl) is 2.671585, suggesting that having a high fasting blood sugar level increases the risk of heart disease.

Negative Coefficient: A negative coefficient implies that an increase in the feature leads to a decrease in the target variable. For instance, the coefficient for magnesium_level is -0.032376, indicating that higher magnesium levels are associated with a reduced risk of heart disease.

Zero Coefficient: A coefficient of zero means that the feature has no effect on the target variable. Many features in your results have a coefficient of zero, indicating that they do not influence the risk of heart disease.

Percentage Impact: This shows the relative importance of each feature in the model’s total impact. Features with a high percentage impact play a more critical role in the model's decisions. For example:

The feature fbs accounts for 84.79% of the total impact, making it the most significant factor in predicting heart disease risk.

The feature ca has a percentage impact of 8.47, indicating it also plays an important role.

Many other features (e.g., age, thal, oldpeak, etc.) have zero coefficients, indicating they have no effect in the model.
Interpretation of Results

Most Impactful Features:

fbs (Fasting Blood Sugar): With the highest positive coefficient, this feature suggests that a higher fasting blood sugar level is likely to increase the risk of heart disease. It is an important factor to consider in healthcare.

ca (Coronary Angiography): A significant positive coefficient indicates that this condition might also contribute to an increased risk of heart disease.

Negative Impact Features:

magnesium_level: The negative coefficient indicates that as magnesium levels increase, the risk of heart disease decreases. High magnesium levels may serve as a positive indicator for heart health.

Features with No Effect:

Several features show a coefficient of zero, meaning these variables do not contribute to the prediction of heart disease risk. This lack of influence suggests that they may not be relevant in the context of this model.

Conclusion

In summary, the analysis shows that fasting blood sugar level (fbs) and coronary angiography (ca) are critical factors in predicting heart disease risk, while magnesium levels appear to have a protective effect. Many other features do not have a significant influence, which can inform future modeling and feature selection efforts.