# 1. Linear Regression

In this notebook, we will build a linear regression model to predict the presence or absence of heart disease (binary classification problem).

First we will import the libraries we will use:

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Then we will read our dataset csv file and see the columns inside. 
In the 'UCI Heart Disease Data - Column Descriptions.txt' you can find the description of its column

In [21]:
df = pd.read_csv('heart_disease_uci.csv')
dataset_columns = df.columns.tolist()
dataset_columns

['id',
 'age',
 'sex',
 'dataset',
 'cp',
 'trestbps',
 'chol',
 'fbs',
 'restecg',
 'thalch',
 'exang',
 'oldpeak',
 'slope',
 'ca',
 'thal',
 'num']

Handle missing values, encode categorical variables, and scale numerical features:

In [23]:
# Checking for missing values
print(df.isnull().sum())

# Encoding categorical variables
label_encoders = {}
categorical_columns = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal', 'dataset']  # Added 'dataset'
for column in categorical_columns:
    if df[column].dtype == 'object' or column in categorical_columns:  # ensures all categorical are encoded
        le = LabelEncoder()
        df[column] = le.fit_transform(df[column].astype(str))  # casting to string to avoid any conversion errors
        label_encoders[column] = le

# Scaling numerical variables
scaler = StandardScaler()
numerical_columns = ['age', 'trestbps', 'chol', 'thalch', 'oldpeak']  # list of numerical columns
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])




id           0
age          0
sex          0
dataset      0
cp           0
trestbps    59
chol        30
fbs          0
restecg      0
thalch      55
exang        0
oldpeak     62
slope        0
ca           0
thal         0
num          0
dtype: int64


Check transformed data types and preview data

In [24]:
print(df.dtypes)
print(df.head())

id            int64
age         float64
sex           int64
dataset       int64
cp            int64
trestbps    float64
chol        float64
fbs           int64
restecg       int64
thalch      float64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
num           int64
dtype: object
   id       age  sex  dataset  cp  trestbps      chol  fbs  restecg    thalch  \
0   1  1.007386    1        0   3  0.675287  0.305908    1        0  0.480653   
1   2  1.432034    1        0   0  1.462483  0.784599    0        0 -1.140262   
2   3  1.432034    1        0   0 -0.636705  0.269780    0        0 -0.329805   
3   4 -1.752828    1        0   2 -0.111908  0.459450    0        2  1.908602   
4   5 -1.328180    0        0   1 -0.111908  0.043982    0        0  1.329704   

   exang   oldpeak  slope  ca  thal  num  
0      0  1.303159      0   0     0    0  
1      1  0.569611      1   3     2    2  
2      1  1.578239      1   2     3    1  
3      

Divide the dataset into training and testing sets:

In [25]:
# Splitting the dataset into training and testing sets
X = df.drop(['id', 'num'], axis=1)  # assuming 'id' is irrelevant and 'num' is the target
y = df['num']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)  # 70% training and 30% testing


Train the logistic regression model:

In [26]:
# Creating the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Evaluate the model's performance using accuracy and other metrics:

In [None]:
# Predicting test data
y_pred = model.predict(X_test)

# Evaluating the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Visualize the model results 

In [None]:
# Visualizing the confusion matrix
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()