# **Data Programming In Python - Project Part 2**

## **Prediction of Heart Disease** An Analysis of Heart Disease and Factors that Contribute to Diagnosis Specifically Examining Cholesterol in Different Demographics









### **Introduction**

This project aims to explore the correlation between specific demographics, health factors and indicators and how signifcant they are in heart disease prognosis.
The aim of this study is to provide insight into heart disease predictors, showcase a range of analytical techniques and provide evidence of my abiltity to handle data.

The heart disease dataset was obtained from the UC Irvine Machine Learning repository https://archive.ics.uci.edu/dataset/45/heart+disease.

### Methodology

The steps take to perform analysis began by importing and presenting the data in a clear and understandable format, checking for outliers and null datapoints and correcting or removing them. The data came split so they were combined and cleaned because if I had cleaned one dataframe the other would need to match.
I then ran a z-score to measure variance and reliability then a logistic regression model on it and to measure its ability to predict new data.


### Libraries

In [99]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats import pearsonr
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder



# install the UCI's repository containing the heart disease dataset.
!pip install ucimlrepo



### Data Pre-processing

Import the data set from the UCI reposistory

The dataset contains a total of 76 attributes (i.e., columns); however, the analysis is conducted on a subset of the original dataset, consisting of 14 attributes: age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal, and num (the target variable).

I began by importing the dataset as instructed on the machine learning repository, then printed the metadata and variable information. The metadata displays the column names present and the variable information explains the abbreviation of the column names.

In [100]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
heart_disease = fetch_ucirepo(id=45)

# data (as pandas dataframes)
X = heart_disease.data.features
y = heart_disease.data.targets

Edit the variable descriptions for the null categories

In [101]:
heart_disease.variables.loc[2, 'description'] = 'chest pain type: Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic'
heart_disease.variables.loc[6, 'description'] = 'restecg: resting electrocardiographic results, Value 0: normal, Value 1: having ST-T wave abnormality, Value 2: showing probable or definite left ventricular hypertrophy by Estes criteria'
heart_disease.variables.loc[10, 'description'] = 'slope: the slope of the peak exercise ST segment: Value 1: upsloping, Value 2: flat, Value 3: downsloping'
heart_disease.variables.loc[12, 'description'] = 'thalassemia (blood condition): 3 = normal; 6 = fixed defect; 7 = reversable defect'
heart_disease.variables.loc[13, 'description'] = 'presence of heart disease (values 1,2,3,4) from absence (value 0)'
# metadata
print(heart_disease.metadata)

# variable information
print(heart_disease.variables)

{'uci_id': 45, 'name': 'Heart Disease', 'repository_url': 'https://archive.ics.uci.edu/dataset/45/heart+disease', 'data_url': 'https://archive.ics.uci.edu/static/public/45/data.csv', 'abstract': '4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 303, 'num_features': 13, 'feature_types': ['Categorical', 'Integer', 'Real'], 'demographics': ['Age', 'Sex'], 'target_col': ['num'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1989, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C52P4X', 'creators': ['Andras Janosi', 'William Steinbrunn', 'Matthias Pfisterer', 'Robert Detrano'], 'intro_paper': {'ID': 231, 'type': 'NATIVE', 'title': 'International application of a new probability algorithm for the diagnosis of coronary artery disease.', 'authors': 'R. Detrano, A. Jánosi, W. Steinbrunn, M

Display the dataframes

In [102]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0


In [103]:
y.head()

Unnamed: 0,num
0,0
1,2
2,1
3,0
4,0


The data given to me already had the target values separate from my predictors. While helpful, i wanted to fist get insight on all the data as a whole. After concatenating the target onto the largeer dataset, i cleaned the data by removing null records and will later examine the data to exclude outliers.

In [105]:
# concatenate the y column onto the dataframe
X_with_target = pd.concat([X, y], axis=1)

# display the dataframe
X_with_target

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,1,110,264,0,0,132,0,1.2,2,0.0,7.0,1
299,68,1,4,144,193,1,0,141,0,3.4,2,2.0,7.0,2
300,57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0,3
301,57,0,2,130,236,0,2,174,0,0.0,2,1.0,3.0,1


In [106]:
# select the rows with mull values present
missing_rows = X_with_target[X_with_target[['ca', 'thal']].isnull().any(axis=1)]

# display the dataframe
missing_rows

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
87,53,0,3,128,216,0,2,115,0,0.0,1,0.0,,0
166,52,1,3,138,223,0,0,169,0,0.0,1,,3.0,0
192,43,1,4,132,247,1,2,143,1,0.1,2,,7.0,1
266,52,1,4,128,204,1,0,156,1,1.0,2,0.0,,2
287,58,1,2,125,220,0,0,144,0,0.4,2,,7.0,0
302,38,1,3,138,175,0,0,173,0,0.0,1,,3.0,0


In [107]:
# remove the null values from the dataframe
heart_df = X_with_target.drop(missing_rows.index)

# # display the dataframe
heart_df

heart_df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
count,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0
mean,54.542088,0.676768,3.158249,131.693603,247.350168,0.144781,0.996633,149.599327,0.326599,1.055556,1.602694,0.676768,4.73064,0.946128
std,9.049736,0.4685,0.964859,17.762806,51.997583,0.352474,0.994914,22.941562,0.469761,1.166123,0.618187,0.938965,1.938629,1.234551
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.0,0.0,0.0,1.0,0.0,3.0,0.0
50%,56.0,1.0,3.0,130.0,243.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0,3.0,0.0
75%,61.0,1.0,4.0,140.0,276.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0,2.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0,4.0


I then calculated the columnar z-scores to find out if there were any outliers in the data.

In [108]:
z_score = heart_df.apply(stats.zscore, axis = 0 )
outliers = (np.abs(z_score) > 2.5)
show = heart_df[outliers.any(axis=1)]
show

#Shows the rows with z-score values > 3 (< -3), as 'True' values
true_locations = outliers.stack()[outliers.stack()]
true_locations

Unnamed: 0,Unnamed: 1,0
48,chol,True
83,trestbps,True
91,oldpeak,True
121,chol,True
121,oldpeak,True
123,oldpeak,True
126,trestbps,True
126,oldpeak,True
132,age,True
152,chol,True


In [109]:
X = heart_df.iloc[:, :-1]
y = heart_df.iloc[:, -1:]

X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,57,0,4,140,241,0,0,123,1,0.2,2,0.0,7.0
298,45,1,1,110,264,0,0,132,0,1.2,2,0.0,7.0
299,68,1,4,144,193,1,0,141,0,3.4,2,2.0,7.0
300,57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0


Before expunging those outliers i will calculate the correlation coefficient to know if the data to be removed even relates to the predictor otherwise it would be pointless to remove them.

While doing so I also calculated the p-values to check the statistical significance of each column

In [110]:
print(X.columns.tolist())
results = {}
for col in X.columns[:-1]:
  corr, p_value = pearsonr(X[col], y['num'])
  results[col] = {'correlation': corr, 'p_value': p_value}

for var, stats in results.items():
    print(f"Variable: {var}")
    print(f"  Correlation: {stats['correlation']:.3f}")
    print(f"  P-value: {stats['p_value']:.3f}")

['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
Variable: age
  Correlation: 0.222
  P-value: 0.000
Variable: sex
  Correlation: 0.227
  P-value: 0.000
Variable: cp
  Correlation: 0.404
  P-value: 0.000
Variable: trestbps
  Correlation: 0.160
  P-value: 0.006
Variable: chol
  Correlation: 0.066
  P-value: 0.254
Variable: fbs
  Correlation: 0.049
  P-value: 0.400
Variable: restecg
  Correlation: 0.184
  P-value: 0.001
Variable: thalach
  Correlation: -0.421
  P-value: 0.000
Variable: exang
  Correlation: 0.392
  P-value: 0.000
Variable: oldpeak
  Correlation: 0.501
  P-value: 0.000
Variable: slope
  Correlation: 0.375
  P-value: 0.000
Variable: ca
  Correlation: 0.521
  P-value: 0.000


FBS (fasting blood sugar) had the highest p-value of 0.400, i removed it and ran the analysis again. Chol (serum cholestoral in mg/dl) had the next largest of 0.254 and was subsequently removed.

In [111]:
remaining_vars = X.columns.tolist()
results = {}
significance_level = 0.05

while True:
    temp_results = {}

    for col in remaining_vars:
        corr, p_value = pearsonr(X[col], y['num'])
        temp_results[col] = {'correlation': corr, 'p_value': p_value}

    var_to_remove = max(temp_results, key=lambda x: temp_results[x]['p_value'])
    max_p_value = temp_results[var_to_remove]['p_value']

    if max_p_value < significance_level:
        results.update(temp_results)
        break

    print(f"Removing variable: {var_to_remove} (p-value: {max_p_value:.3f})")
    remaining_vars.remove(var_to_remove)

Removing variable: fbs (p-value: 0.400)
Removing variable: chol (p-value: 0.254)


In [112]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 15)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

coefficients = list(zip(X.columns, model.coef_))
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Coefficients:")

print("Intercept:", model.intercept_)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

Coefficients:
Intercept: [-0.38252144]
Mean Squared Error: 0.7201762224585851
R-squared: 0.4655464026281372


 The coefficients show the relationship of each of the dependent values in regards to the independant value. The intercept is the would be mean value of y if the depependant variables were all 0.
 The mean squared error show the predictive capabilities of the model, depending on the predictors range of values, the closer it is to zero the better the regression model is at predicting. The R-squared value shows the proportion of variance in the dependant data that can be explained by the independant variable i.e. how well the data fits the model.

 The regression model shows that 46% of the models variance can be explained by the independant variable, this shows low fit of the data which means that the accuracy of the predictions are moderate and not truly reliable.