## SAHeart Dataset


Cardio-vascular diseases are among the most frequent causes of death. Clinically there are
widely accepted indicators for potential risk of contracting a cardio-vascular sickness. Hence,
the knowledge of the determinant risk factors that lead cardio-vascular sickness can aid decision-
making for pretreatment and changing lifestyles to avoid or reduce future complications.

The dataset SAHeart.csv is about coronary heart disease (CHD) in South Africa. The goal is
to use a set of indicators to identify if a patient has a risk of contracting coronary diseases or
not. This predictor variables are:

1. x1 =sbp: systolic blood pressure
2. x2 =tobacco: cumulative tobacco (kg)
3. x3 =ldl: low densiity lipoprotein cholesterol level
4. x4 =adiposity: sever overweight
5. x5 = famhist: family history of heart disease (Present, Absent)
6. x6 =typea: type-A behavior

7. x7 =obesity: excessive fat accumulation
8. x8 =alcohol: current alcohol consumption
9. x9 =age: age at onset
10. y =chd: response, coronary heart disease

In [6]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression,Ridge,Lasso,ElasticNet,LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,mean_absolute_error,mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
warnings.filterwarnings("ignore")

### Uploading this data set and replace non-number data with a reasonable numerical representation

In [8]:
# libraries specific to this question
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, accuracy_score

# loading the dataset
saheart = pd.read_csv('SAheart.data.csv')

In [9]:
# replace [present, absent] to [1, 0]
lb_make = LabelEncoder()
saheart['famhist'] = lb_make.fit_transform(saheart["famhist"])
saheart

Unnamed: 0,row.names,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
0,1,160,12.00,5.73,23.11,1,49,25.30,97.20,52,1
1,2,144,0.01,4.41,28.61,0,55,28.87,2.06,63,1
2,3,118,0.08,3.48,32.28,1,52,29.14,3.81,46,0
3,4,170,7.50,6.41,38.03,1,51,31.99,24.26,58,1
4,5,134,13.60,3.50,27.78,1,60,25.99,57.34,49,1
...,...,...,...,...,...,...,...,...,...,...,...
457,459,214,0.40,5.98,31.72,0,64,28.45,0.00,58,0
458,460,182,4.20,4.41,32.10,0,52,28.61,18.72,52,1
459,461,108,3.00,1.59,15.23,0,40,20.09,26.64,55,0
460,462,118,5.40,11.61,30.79,0,64,27.35,23.97,40,0


### Training a logistic regressor to tell if a patient has high risk of CHD or not.

In [10]:
x = saheart[['sbp', 'tobacco', 'ldl', 'adiposity', 'famhist', 'typea', 'obesity','alcohol', 'age']].values
y = saheart[['chd']].values

# Fitting Logistic Regression to the Training set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
classifier = LogisticRegression(random_state = 0, solver='lbfgs', multi_class='auto')
classifier.fit(x_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(x_test)


In [11]:
# checking confusion matrix and accuracy
con_mat = confusion_matrix(y_test, y_pred)
print(con_mat)
print(accuracy_score(y_pred, y_test))

[[68  9]
 [23 16]]
0.7241379310344828


### Identifying if the a patient with the following data is of high risk or not x = [133, 3.3, 4.6, 34.5, present, 52, 30, 32, 20, 44].

In [12]:
# we replaced present with 1, and ignored the last input (what's have you ignored? it's okay!)
x1 = [[133, 3.3, 4.6, 34.5, 1 , 52, 30, 32, 20]]
y_pred1 = classifier.predict(x1)
y_pred1

array([0], dtype=int64)

### Checking the most determinant factors for heart disease?


In [13]:
# lets check their coefficients
saheart.corr()
print(classifier.coef_, classifier.intercept_)

[[-0.01245675  0.09876644  0.09806837  0.06695784  0.73997731  0.0243292
  -0.16390287 -0.00247959  0.0327128 ]] [-0.26479137]


###### From the coefficients of the model ===> the feature with the highest coefficient. Here is the famhist (0.74) which means "famhist-family histroy" is the most important features in the dataset.