# Heart Disease Framingham

## Introduction




### Problem

The World Health Organization has estimated 12 million deaths occur worldwide, every year due to Heart diseases. Half the deaths in the United States and other developed countries are due to cardio vascular diseases. The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in turn reduce the complications. This project will compare the performance of different models and try to improve the predictive performance.

### Source 

The dataset is publically available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD).The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes. Variables Each attribute is a potential risk factor. There are both demographic, behavioral and medical risk factors.

## Logistic Regression Model

In [None]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/soltaniehha/Intro-to-Data-Analytics/master/data/AnalyticsEdge-Datasets/Framingham.csv')
df.head(3)

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0


### Data Description

* Demographic risk factors
  * male: sex of patient
  * age: age in years at first examination
  * education: Some high school (1), high school/GED (2), some college/vocational school (3), college (4)
* Behavioral risk factors
  * currentSmoker, cigsPerDay: Smoking behavior
  * Medical history risk factors
  * BPmeds: On blood pressure medication at time of first examination
  * prevalentStroke: Previously had a stroke
  * prevalentHyp: Currently hypertensive
  * diabetes: Currently has diabetes
* Risk factors from first examination
  * totChol: Total cholesterol (mg/dL)
  * sysBP: Systolic blood pressure
  * diaBP: Diastolic blood pressure
  * BMI: Body Mass Index, weight (kg)/height (m)2
  * heartRate: Heart rate (beats/minute)
  * glucose: Blood glucose level (mg/dL)



### Preprocessing - no predictive power/handling missing values

In [None]:
# Overall the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4240 entries, 0 to 4239
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4240 non-null   int64  
 1   age              4240 non-null   int64  
 2   education        4135 non-null   float64
 3   currentSmoker    4240 non-null   int64  
 4   cigsPerDay       4211 non-null   float64
 5   BPMeds           4187 non-null   float64
 6   prevalentStroke  4240 non-null   int64  
 7   prevalentHyp     4240 non-null   int64  
 8   diabetes         4240 non-null   int64  
 9   totChol          4190 non-null   float64
 10  sysBP            4240 non-null   float64
 11  diaBP            4240 non-null   float64
 12  BMI              4221 non-null   float64
 13  heartRate        4239 non-null   float64
 14  glucose          3852 non-null   float64
 15  TenYearCHD       4240 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 530.1 KB


In [None]:
# Check missing values
df.isnull().sum()

male                 0
age                  0
education          105
currentSmoker        0
cigsPerDay          29
BPMeds              53
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             50
sysBP                0
diaBP                0
BMI                 19
heartRate            1
glucose            388
TenYearCHD           0
dtype: int64

First, we will drop the `education` variable. Because all people at any level of education are equal in the face of disease, `education` should not be used as a variable affecting disease. Meanwhile, for the medical industry, subjective variables such as education level should not appear in the results of machine learning, which may affect doctors' judgment of patients.

In [None]:
# Drop education column
df.drop(['education'], axis=1, inplace=True)    # drop deck, embark_town, alive, class & sex

Among the remaining variables, `cigsPerDay`, `BPMeds`, `totChol`, `BMI` and `glucose` still have null values. These null values account for about 12% of the total data, given the medical importance and small proportion of these variables. Dropping these entries does not lose too much data.

In [None]:
# Calculate missing entries percentage
missing_values = df.isnull().sum().sort_values(ascending=False)
missing_percentage = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)*100
missing_results = pd.concat([missing_values, missing_percentage], axis=1, keys=["Total", "Percentage"])
missing_data = missing_results[missing_results['Total']>0]
missing_data

Unnamed: 0,Total,Percentage
glucose,388,9.150943
BPMeds,53,1.25
totChol,50,1.179245
cigsPerDay,29,0.683962
BMI,19,0.448113
heartRate,1,0.023585


In [None]:
# Drop missing entries
df.dropna(axis=0, inplace=True)
df.isnull().sum()

male               0
age                0
currentSmoker      0
cigsPerDay         0
BPMeds             0
prevalentStroke    0
prevalentHyp       0
diabetes           0
totChol            0
sysBP              0
diaBP              0
BMI                0
heartRate          0
glucose            0
TenYearCHD         0
dtype: int64

Let's check the remaining amount of data, we still have 3751 rows of data for further analysis.

In [None]:
# Check remain entires
df.shape

(3751, 15)

### Preprocessing - categorical variables
Before machine learning, we need to make sure that all the variables are numerical variables, so it looks like our data is clean and we don't need to do any more processing.

In [None]:
# Show dataset sample
df.head(3)

Unnamed: 0,male,age,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0


### Create feature matrix and target vector

In [None]:
# Create a feature matrix without `TenYearCHD` and call it `X`:
X = df.drop('TenYearCHD', axis=1)
X.shape

(3751, 14)

In [None]:
# Create a target vector with "survived" and call it `y`:
y = df['TenYearCHD']
y.shape

(3751,)

### Split the data randomly into train and test

Next, the data will be split as test and train dataset. Test data accounted for 30% of the metadata set, and training data accounted for 70% of the metadata set.

In [None]:
# Split dataset randomly by using sklearn package
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=780)

In [None]:
Xtrain.shape

(2625, 14)

In [None]:
ytrain.shape

(2625,)

In [None]:
Xtest.shape

(1126, 14)

In [None]:
ytest.shape

(1126,)

### Instantiate and fit a logistic regression model

In [None]:
# Set "liblinear" as solver for logistic regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear')

In [None]:
# Fit model to the training data
model.fit(Xtrain, ytrain)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

### Prediction

In [None]:
# predict on test data and store the results as y_model
y_model = model.predict(Xtest)

In [None]:
# Add predit results to test dataset
test = Xtest.join(ytest).reset_index()
test.join(pd.Series(y_model, name='predicted')).head()

Unnamed: 0,index,male,age,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD,predicted
0,1290,0,40,1,15.0,0.0,0,0,0,244.0,110.0,73.0,21.84,88.0,67.0,0,0
1,2025,0,57,0,0.0,0.0,0,1,0,207.0,175.0,80.0,20.86,83.0,75.0,1,0
2,1477,0,49,0,0.0,0.0,0,0,0,290.0,137.5,92.0,24.46,80.0,74.0,0,0
3,603,0,61,1,20.0,0.0,0,1,0,245.0,140.0,73.0,30.74,90.0,91.0,1,0
4,2380,1,59,1,30.0,0.0,0,0,0,235.0,136.0,96.0,28.61,54.0,85.0,0,0


#### Accuracy

Our basic model is giving us an accuracy of 84.6%, which is a very good prediction result.

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

0.8463587921847247

#### Sensitivity
Sensitivity = $TPR=\frac{TP}{P}$, where $TP$ is ture positives and $P$ is count of all positives.

In [None]:
P = sum(ytest == 1)
TP = sum((ytest == 1) & (y_model == 1))
TP/P

0.022988505747126436

#### Specificity
Specificity = $TNR=\frac{TN}{N}$, where $TN$ is ture negatives and $N$ is count of all negatives.

In [None]:
N = sum(ytest == 0)
TN = sum((ytest == 0) & (y_model == 0))
TN/N

0.9968487394957983

In [None]:
# Calculating the mean absolute error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(ytest, y_model)

0.15364120781527532

## Performance Improvement

### Feature Engineering

First, have a look at the correlation matrix between Predictor variable and target variable.

In [None]:
df.corr()

Unnamed: 0,male,age,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
male,1.0,-0.024142,0.203215,0.325886,-0.052355,-0.002513,0.003588,0.011834,-0.067595,-0.044515,0.05389,0.074868,-0.115444,0.00313,0.096
age,-0.024142,1.0,-0.211338,-0.188804,0.13169,0.050018,0.305292,0.109321,0.261443,0.38828,0.205191,0.136428,-0.005893,0.118529,0.231584
currentSmoker,0.203215,-0.211338,1.0,0.773259,-0.051816,-0.037573,-0.105258,-0.045308,-0.049945,-0.133154,-0.114118,-0.165404,0.054924,-0.054078,0.021709
cigsPerDay,0.325886,-0.188804,0.773259,1.0,-0.046625,-0.035713,-0.06747,-0.039436,-0.0306,-0.09231,-0.056202,-0.090525,0.067194,-0.05509,0.05592
BPMeds,-0.052355,0.13169,-0.051816,-0.046625,1.0,0.111601,0.262955,0.056337,0.089625,0.269479,0.199282,0.105128,0.010228,0.052464,0.08474
prevalentStroke,-0.002513,0.050018,-0.037573,-0.035713,0.111601,1.0,0.065169,0.009423,0.012297,0.060421,0.055189,0.035568,-0.016673,0.015789,0.047684
prevalentHyp,0.003588,0.305292,-0.105258,-0.06747,0.262955,0.065169,1.0,0.08203,0.164645,0.697849,0.616753,0.303411,0.142013,0.085776,0.178615
diabetes,0.011834,0.109321,-0.045308,-0.039436,0.056337,0.009423,0.08203,1.0,0.047453,0.104393,0.051761,0.093098,0.06337,0.616087,0.093222
totChol,-0.067595,0.261443,-0.049945,-0.0306,0.089625,0.012297,0.164645,0.047453,1.0,0.216375,0.169811,0.119651,0.094795,0.046902,0.089613
sysBP,-0.044515,0.38828,-0.133154,-0.09231,0.269479,0.060421,0.697849,0.104393,0.216375,1.0,0.785853,0.330484,0.181381,0.132878,0.220108


According to the correlation matrix, it can be seen that there are some variables with low values. In the following steps, these variables will be further processed to improve their correlation and thus improve the performance of the model.
* **male**: sex of patient
* **currentSmoker**: Smoking behavior
* **cigsPerDay**: Smoking behavior
* **BPMeds**: On blood pressure medication at time of first examination
* **prevalentStroke**: Previously had a stroke
* **diabetes**: Currently has diabetes
* **totChol**: Total cholesterol (mg/dL)
* **BMI**: Body Mass Index, weight (kg)/height (m)2
* **heartRate**: Heart rate (beats/minute)

In [None]:
df_eng = df.copy()

#### Create BMI Level variable
Source: [American Heart Association - BMI level](https://www.heart.org/en/healthy-living/healthy-eating/losing-weight/bmi-in-adults)

According to the classification of BMI by American Heart Association, I will classify these variables and generate a new variable: BMI_level.
* Underweight: BMI <= 18.5 kg/m²
* Healthy: 18.5 kg/m² < BMI < 24.9 kg/m²
* Overweight: 25 kg/m² <= BMI < 30.0 kg/m²
* Obesity: BMI >= 30.0 kg/m²
* Extreme obesity: BMI >= 40 kg/m²

In [None]:
import numpy as np

# create a list of our conditions
conditions = [
    (df_eng['BMI'] <= 18.5),
    (df_eng['BMI'] > 18.5) & (df_eng['BMI'] < 25),
    (df_eng['BMI'] >= 25) & (df_eng['BMI'] < 30),
    (df_eng['BMI'] >= 30) & (df_eng['BMI'] < 40),
    (df_eng['BMI'] >= 40)
    ]

# create a list of the values we want to assign for each condition
values = ['Underweight', 'Healthy', 'Overweight', 'Obesity', 'Extreme obesity']

# create a new column and use np.select to assign values to it using our lists as arguments
df_eng['BMI_level'] = np.select(conditions, values)

# display updated DataFrame
df.head()

Unnamed: 0,male,age,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


#### Create totChol Level variable
Source: [Medical News Today - Cholesterol range](https://www.medicalnewstoday.com/articles/315900)

According to the classification of Total Cholesterol by Medical News Today, I will classify these variables and generate a new variable: totChol_range.
* Desirable: totChol <= 200 mg/dL
* Borderline High: 200 mg/dL < totChol < 240 mg/dL
* High: totChol >= 240 mg/dL

In [None]:
import numpy as np

# create a list of our conditions
conditions = [
    (df_eng['totChol'] <= 200),
    (df_eng['totChol'] > 200) & (df_eng['totChol'] < 240),
    (df_eng['totChol'] >= 240)
    ]

# create a list of the values we want to assign for each condition
values = ['Desirable', 'Borderline High', 'High']

# create a new column and use np.select to assign values to it using our lists as arguments
df_eng['totChol_level'] = np.select(conditions, values)

# display updated DataFrame
df.head()

Unnamed: 0,male,age,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


#### Create BP (Blood Pressure) Level variable
Source: [American Heart Association - Blood Pressure range](https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings)

According to the classification of Blood Pressure by American Heart Association, I will classify these variables and generate a new variable: BP_level.
* Normal: sysBP <= 120 mmHg && diaBP <= 80 mmHg
* Elevated: 120 mmHg < sysBP < 130 mmHg && diaBP <= 80 mmHg
* High blood pressure 1: 130 mmHg <= sysBP < 140 mmHg || 80 mmHg < diaBP < 90 mmHg
* High blood pressure 2: sysBP >= 140 mmHg || diaBP >= 90 mmHg
* Hypertensive Crisis: sysBP >= 180 mmHg && diaBP >= 120 mmHg

In [None]:
import numpy as np

# create a list of our conditions
conditions = [
    (df_eng['sysBP'] <= 120) & (df_eng['diaBP'] <= 80),
    ((df_eng['sysBP'] > 120) & (df_eng['sysBP'] < 130)) & (df_eng['diaBP'] <= 80),
    ((df_eng['sysBP'] >= 130) & (df_eng['sysBP'] < 140)) | ((df_eng['diaBP'] > 80) & (df_eng['diaBP'] <= 90)),
    (df_eng['sysBP'] >= 140) & (df_eng['diaBP'] >= 90),
    (df_eng['sysBP'] >= 180) & (df_eng['diaBP'] >= 120)
    ]

# create a list of the values we want to assign for each condition
values = ['Normal', 'Elevated', 'High blood pressure 1', 'High blood pressure 2', 'Hypertensive Crisis']

# create a new column and use np.select to assign values to it using our lists as arguments
df_eng['BP_level'] = np.select(conditions, values)

# display updated DataFrame
df.head()

Unnamed: 0,male,age,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


#### Preprocessing - categorical variables

After adding new variables, there are some categorical variables, which we will handle next.

In [None]:
df_eng = pd.get_dummies(df_eng, columns=['BMI_level', 'totChol_level', 'BP_level'])
df_eng.head()

Unnamed: 0,male,age,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD,BMI_level_Extreme obesity,BMI_level_Healthy,BMI_level_Obesity,BMI_level_Overweight,BMI_level_Underweight,totChol_level_Borderline High,totChol_level_Desirable,totChol_level_High,BP_level_0,BP_level_Elevated,BP_level_High blood pressure 1,BP_level_High blood pressure 2,BP_level_Normal
0,1,39,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0,0,0,0,1,0,0,1,0,0,0,0,0,1
1,0,46,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0,0,0,0,1,0,0,0,1,0,0,1,0,0
2,1,48,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0,0,0,0,1,0,0,0,1,0,1,0,0,0
3,0,61,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1,0,0,0,1,0,1,0,0,0,0,0,1,0
4,0,46,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0,0,1,0,0,0,0,0,1,0,0,1,0,0


Now let's see if the correlation matrix is improved after adding new variables。

In [None]:
df_eng.corr()

Unnamed: 0,male,age,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD,BMI_level_Extreme obesity,BMI_level_Healthy,BMI_level_Obesity,BMI_level_Overweight,BMI_level_Underweight,totChol_level_Borderline High,totChol_level_Desirable,totChol_level_High,BP_level_0,BP_level_Elevated,BP_level_High blood pressure 1,BP_level_High blood pressure 2,BP_level_Normal
male,1.0,-0.024142,0.203215,0.325886,-0.052355,-0.002513,0.003588,0.011834,-0.067595,-0.044515,0.05389,0.074868,-0.115444,0.00313,0.096,-0.047733,-0.14134,-0.023459,0.177342,-0.052671,0.076178,-0.003725,-0.069445,-0.011863,-0.014136,0.055701,0.002367,-0.048519
age,-0.024142,1.0,-0.211338,-0.188804,0.13169,0.050018,0.305292,0.109321,0.261443,0.38828,0.205191,0.136428,-0.005893,0.118529,0.231584,0.036418,-0.129391,0.061335,0.090769,-0.028667,-0.042708,-0.233309,0.231664,0.075732,-0.00837,0.041839,0.208924,-0.25169
currentSmoker,0.203215,-0.211338,1.0,0.773259,-0.051816,-0.037573,-0.105258,-0.045308,-0.049945,-0.133154,-0.114118,-0.165404,0.054924,-0.054078,0.021709,-0.033141,0.141454,-0.109368,-0.074334,0.039901,-0.010968,0.052551,-0.032588,-0.048494,0.021995,-0.040702,-0.070125,0.110724
cigsPerDay,0.325886,-0.188804,0.773259,1.0,-0.046625,-0.035713,-0.06747,-0.039436,-0.0306,-0.09231,-0.056202,-0.090525,0.067194,-0.05509,0.05592,-0.029629,0.068485,-0.059399,-0.029519,0.018828,0.007042,0.012859,-0.01723,-0.039389,0.013294,-0.008518,-0.045166,0.056012
BPMeds,-0.052355,0.13169,-0.051816,-0.046625,1.0,0.111601,0.262955,0.056337,0.089625,0.269479,0.199282,0.105128,0.010228,0.052464,0.08474,0.027078,-0.069744,0.072811,0.01959,-0.007036,-0.021939,-0.053178,0.064417,-0.018843,-0.053171,-0.015869,0.195074,-0.106779
prevalentStroke,-0.002513,0.050018,-0.037573,-0.035713,0.111601,1.0,0.065169,0.009423,0.012297,0.060421,0.055189,0.035568,-0.016673,0.015789,0.047684,0.041029,-0.001911,0.016128,-0.013041,-0.008721,0.006588,-0.029924,0.01823,0.003838,-0.012736,-0.025315,0.076081,-0.030795
prevalentHyp,0.003588,0.305292,-0.105258,-0.06747,0.262955,0.065169,1.0,0.08203,0.164645,0.697849,0.616753,0.303411,0.142013,0.085776,0.178615,0.091473,-0.221618,0.205073,0.085917,-0.053168,-0.048699,-0.10989,0.136316,0.152896,-0.198846,-0.023742,0.583192,-0.407692
diabetes,0.011834,0.109321,-0.045308,-0.039436,0.056337,0.009423,0.08203,1.0,0.047453,0.104393,0.051761,0.093098,0.06337,0.616087,0.093222,0.073022,-0.062735,0.058954,0.010807,0.009153,-0.005595,-0.02184,0.023207,0.001191,-0.02759,0.009455,0.061882,-0.045656
totChol,-0.067595,0.261443,-0.049945,-0.0306,0.089625,0.012297,0.164645,0.047453,1.0,0.216375,0.169811,0.119651,0.094795,0.046902,0.089613,0.001141,-0.109294,0.043539,0.093927,-0.055263,-0.261227,-0.646792,0.778156,0.054247,-0.035072,0.066269,0.119165,-0.17454
sysBP,-0.044515,0.38828,-0.133154,-0.09231,0.269479,0.060421,0.697849,0.104393,0.216375,1.0,0.785853,0.330484,0.181381,0.132878,0.220108,0.13556,-0.244246,0.184968,0.118544,-0.067853,-0.052221,-0.159823,0.18055,0.095921,-0.117098,0.069432,0.674626,-0.617992


In [None]:
# Create a feature matrix without `TenYearCHD` and call it `X`:
X_eng = df_eng.drop('TenYearCHD', axis=1)
# Create a target vector with "survived" and call it `y`:
y_eng = df_eng['TenYearCHD']
# Split dataset randomly by using sklearn package
from sklearn.model_selection import train_test_split
Xtrain_eng, Xtest_eng, ytrain_eng, ytest_eng = train_test_split(X_eng, y_eng, test_size=0.3, random_state=780)
# Set "liblinear" as solver for logistic regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear')
# Fit model to the training data
model.fit(Xtrain_eng, ytrain_eng)
# predict on test data and store the results as y_model
y_model_eng = model.predict(Xtest_eng)
# Add predit results to test dataset
test_eng = Xtest_eng.join(ytest_eng).reset_index()
test_eng.join(pd.Series(y_model_eng, name='predicted')).head()

Unnamed: 0,index,male,age,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,BMI_level_Extreme obesity,BMI_level_Healthy,BMI_level_Obesity,BMI_level_Overweight,BMI_level_Underweight,totChol_level_Borderline High,totChol_level_Desirable,totChol_level_High,BP_level_0,BP_level_Elevated,BP_level_High blood pressure 1,BP_level_High blood pressure 2,BP_level_Normal,TenYearCHD,predicted
0,1290,0,40,1,15.0,0.0,0,0,0,244.0,110.0,73.0,21.84,88.0,67.0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0
1,2025,0,57,0,0.0,0.0,0,1,0,207.0,175.0,80.0,20.86,83.0,75.0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0
2,1477,0,49,0,0.0,0.0,0,0,0,290.0,137.5,92.0,24.46,80.0,74.0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0
3,603,0,61,1,20.0,0.0,0,1,0,245.0,140.0,73.0,30.74,90.0,91.0,0,0,1,0,0,0,0,1,1,0,0,0,0,1,0
4,2380,1,59,1,30.0,0.0,0,0,0,235.0,136.0,96.0,28.61,54.0,85.0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0


#### Improved model's accuracy, sensitivity and specificity


By adding new variables, accuracy, sensitivity, mean absolute error and specificity were improved by 1%, 3% , 0.1%, 1%, which were huge improvement.

In [None]:
# Calculate accuracy
from sklearn.metrics import accuracy_score
accuracy_score(ytest_eng, y_model_eng)

0.8516873889875666

In [None]:
# Calculate sensitivity
P = sum(ytest_eng == 1)
TP = sum((ytest_eng == 1) & (y_model_eng == 1))
TP/P

0.05172413793103448

In [None]:
# Calculate specificity
N = sum(ytest_eng == 0)
TN = sum((ytest_eng == 0) & (y_model_eng == 0))
TN/N

0.9978991596638656

In [None]:
# Calculating the mean absolute error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(ytest_eng, y_model_eng)

0.1483126110124334


At the same time, I tried to delete BMI, totChol, sysBP and diaBP, because the variables we created were derived from the data of these variables. However, I found that if these variables were deleted, the performance of the model would be limited, with accuracy, sensitivity increasing by 0.5%, 2%.

### Try different model: Random Forest

#### Original database: df

Firstly, random Forest model was used for the dataset with no new variables added.

The original performance is:
* Accuracy: 0.8463587921847247
* Sensitivity: 0.022988505747126436
* Specificity: 0.9968487394957983
* Mean absolute error: 0.15364120781527532


In [None]:
# Run Random Forest model
from sklearn.ensemble import RandomForestClassifier
regressor = RandomForestClassifier(n_estimators=20, random_state=0)
regressor.fit(Xtrain, ytrain)
y_model = regressor.predict(Xtest)

In [None]:
# Calculate accuracy
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

0.8481349911190054

In [None]:
# Calculate sensitivity
P = sum(ytest == 1)
TP = sum((ytest == 1) & (y_model == 1))
TP/P

0.07471264367816093

In [None]:
# Calculate specificity
N = sum(ytest == 0)
TN = sum((ytest == 0) & (y_model == 0))
TN/N

0.9894957983193278

In [None]:
# Calculating the mean absolute error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(ytest, y_model)

0.15186500888099466

Compared with previous logistic regression results, accuracy, sensitivity and specificity were reduced by 0.2%, 5% and 0.5%. Although accuracy has decreased, mean absolute error has decreased by 0.2%.

In general, this is a good improvement, because the sensitivity has been improved a lot.

#### Improved database: df_eng

Secondly, random Forest model was used for the dataset with new variables added.

The original performance is:
* Accuracy: 0.8516873889875666
* Sensitivity: 0.05172413793103448
* Specificity: 0.9978991596638656
* Mean absolute error: 0.1483126110124334


In [None]:
# Run Random Forest model
from sklearn.ensemble import RandomForestClassifier
regressor = RandomForestClassifier(n_estimators=20, random_state=0)
regressor.fit(Xtrain_eng, ytrain_eng)
y_model_eng = regressor.predict(Xtest_eng)

In [None]:
# Calculate accuracy
from sklearn.metrics import accuracy_score
accuracy_score(ytest_eng, y_model_eng)

0.8374777975133215

In [None]:
# Calculate sensitivity
P = sum(ytest_eng == 1)
TP = sum((ytest_eng == 1) & (y_model_eng == 1))
TP/P

0.05747126436781609

In [None]:
# Calculate specificity
N = sum(ytest_eng == 0)
TN = sum((ytest_eng == 0) & (y_model_eng == 0))
TN/N

0.9800420168067226

In [None]:
# Calculating the mean absolute error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(ytest_eng, y_model_eng)

0.1625222024866785

It seems that random forest is not a suitable model for new data. Accuracy, sensitivity, specificity and mean absolute error were not enhanced, but decreased.

## Conclusion

By comparing the models, I found that the performance of a dataset varies from model to model. 

For example, in Logistic Regression, we added new variables to improve performance. However, using this data set in the Random Forest model does not perform as well as the data set without adding data.

Therefore, it is not necessary to increase or decrease variables to improve model performance, which is relatively speaking. There is a suitable model for each data set. Therefore, when selecting models and data sets, performance is one of the factors to be considered, and the number of data variables should also be considered to avoid overfitting.