## CSE5ML Lab 2 Part 2 : Machine Learning with Scikit Learn for Classification

In part 1, we learned how to use some ML models in scikit learn package on a regression task with some data preprocessing procedures. This week, we are going to review the data preprocessing procedures and apply logistic regression as well as support vector machine (SVM) on a classification task.

Task

This database is collected from the V.A. Medical Center, Long Beach and Cleveland Clinic Foundation. It contains information from 303 patients,  with 14 attributes (13 input variables and 1 target variable). 

We are using this dataet to Build a machine learning model to predict if a patiet presents heart disease. The detailed information of each variable is as follows:
1. age: age in years
2. sex (male and female)
3. chest pain type
4. resting blood pressure (in mm Hg on admission to the hospital)
5. serum cholestoral in mg/dl
6. fasting blood sugar > 120 mg/dl (true and false)
7. resting electrocardiographic results
<br>   -- Value 0: normal
<br>   -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
<br>   -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
<br>   -- Value 1: upsloping
<br>   -- Value 2: flat
<br>   -- Value 3: downsloping
12. number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. num: diagnosis of heart disease (angiographic disease status)
<br>   -- Value 0: absense
<br>   -- Value 1: presence

more information of the dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

### Load the dataset
use pandas to load the csv file "heart_disease.csv" provided on LMS, then check dataset length and print the first 5 rows of the dataset

In [1]:
import pandas as pd

dataset = pd.read_csv("heart_disease.csv")
print("dataset length:", len(dataset))
dataset.head()

dataset length: 303


Unnamed: 0,Age,Sex,Chest Pain Type,Resting Blood Pressure,Serum Cholestoral,Fasting Blood Sugar,Resting electrocardiographic results,Maximum heart rate achieved,Exercise induced angina,ST depression,the slope,Number of major vessels,thal,Diagnosis
0,63,male,typical angina,145,233,True,2,150,0,2.3,3,0.0,6.0,0
1,67,male,asymptomatic,160,286,False,2,108,1,1.5,2,3.0,3.0,1
2,67,male,asymptomatic,120,229,False,2,129,1,2.6,2,2.0,7.0,1
3,37,male,non-anginal pain,130,250,False,0,187,0,3.5,3,0.0,3.0,0
4,41,female,atypical angina,130,204,False,2,172,0,1.4,1,0.0,3.0,0


### Preprocess the dataset
##### Check if there is any missing value in the dataset

In [2]:
dataset.isna().sum()

Age                                     0
Sex                                     0
Chest Pain Type                         0
Resting Blood Pressure                  0
Serum Cholestoral                       0
Fasting Blood Sugar                     0
Resting electrocardiographic results    0
Maximum heart rate achieved             0
Exercise induced angina                 0
ST depression                           0
the slope                               0
Number of major vessels                 4
thal                                    2
Diagnosis                               0
dtype: int64

##### Drop the rows which has missing values

In [3]:
dataset = dataset.dropna()
print("dataset length:", len(dataset))

dataset length: 297


##### Check variable data types

In [4]:
dataset.dtypes

Age                                       int64
Sex                                      object
Chest Pain Type                          object
Resting Blood Pressure                    int64
Serum Cholestoral                         int64
Fasting Blood Sugar                        bool
Resting electrocardiographic results      int64
Maximum heart rate achieved               int64
Exercise induced angina                   int64
ST depression                           float64
the slope                                 int64
Number of major vessels                 float64
thal                                    float64
Diagnosis                                 int64
dtype: object

We found that Number of major vessels and thal should be int but is presented as float, so we transform them into integer type

In [5]:
cols = ['Number of major vessels', 'thal']
dataset[cols] = dataset[cols].astype(int)

In [6]:
# check again
dataset.dtypes

Age                                       int64
Sex                                      object
Chest Pain Type                          object
Resting Blood Pressure                    int64
Serum Cholestoral                         int64
Fasting Blood Sugar                        bool
Resting electrocardiographic results      int64
Maximum heart rate achieved               int64
Exercise induced angina                   int64
ST depression                           float64
the slope                                 int64
Number of major vessels                   int32
thal                                      int32
Diagnosis                                 int64
dtype: object

We can see that these two variables are properly transformed now

##### Check if there is any duplicated rows in the dataset

In [7]:
dataset.duplicated().any()

False

##### check value count for the categorical variables

In [8]:
print(dataset["Sex"].value_counts(),"\n")
print(dataset["Chest Pain Type"].value_counts(),"\n")
print(dataset["Fasting Blood Sugar"].value_counts(), "\n")

male      201
female     96
Name: Sex, dtype: int64 

asymptomatic        142
non-anginal pain     83
atypical angina      49
typical angina       23
Name: Chest Pain Type, dtype: int64 

False    254
True      43
Name: Fasting Blood Sugar, dtype: int64 



##### Deal with categorical variables

Since both Sex and Fasting Blook Sugar are binary variables, we can also use 0 and 1 to replace them.

for example, for variable Sex:
<br> 1 = male; 0 = female

for variable Fasting Blood Sugar:
<br> 1 = True; 0 = False

In addition, based on domain expert's advice, we can use the following rule to transform the categorical variable Chest Pain Type:
<br>-- Value 1: typical angina
<br>-- Value 2: atypical angina
<br>-- Value 3: non-anginal pain
<br>-- Value 4: asymptomatic

In [9]:
dataset['Sex'] = dataset['Sex'].replace({'male': 1, 'female': 0})
# note Fasting Blood Sugar is a boolean variable with True and False
dataset['Fasting Blood Sugar'] = dataset['Fasting Blood Sugar'].replace({True: 1, False: 0})
dataset['Chest Pain Type'] = dataset['Chest Pain Type'].replace({'typical angina': 1, 'atypical angina': 2, 'non-anginal pain': 3, 'asymptomatic': 4})
dataset.head()

Unnamed: 0,Age,Sex,Chest Pain Type,Resting Blood Pressure,Serum Cholestoral,Fasting Blood Sugar,Resting electrocardiographic results,Maximum heart rate achieved,Exercise induced angina,ST depression,the slope,Number of major vessels,thal,Diagnosis
0,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,3,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0


##### Check dataset shape

In [10]:
dataset.shape

(297, 14)

##### Define the input variables and the target variable
target variable is the last variable Diagnosis, and input variables are the rest of the columns.

In [11]:
array = dataset.values
X = array[:,0:-1]
y = array[:,-1]

### Split the dataset and normalize data

##### Split the training and testing dataset
use 10% of dataset for testing with a random state of 1

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)

##### Apply normalization on both train and testing dataset

In [13]:
from sklearn.preprocessing import MinMaxScaler

# fit scaler on training data
norm = MinMaxScaler().fit(X_train)

# transform training data
X_train_norm = norm.transform(X_train)

# transform testing data
X_test_norm = norm.transform(X_test)

### Now we are learning how to train a model with logistic regression and SVM for classification, based on entire training dataset and then evaluate the model based on testing dataset
Be aware that, for regression model, the default evaluation metrics is R Squared. For regression task, the default evaluation metrics is accuracy

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# logistic regression model, parameters can be changed
model = LogisticRegression(solver="liblinear")
model.fit(X_train_norm, y_train)
test_score = model.score(X_test_norm, y_test)
print("Testing Accuracy of LR:", test_score)

# Support Vector Machine for classification, parameters can be changed
model = SVC()
model.fit(X_train_norm, y_train)
test_score = model.score(X_test_norm, y_test)
print("Testing Accuracy of SVC:", test_score)

Testing Accuracy of LR: 0.9666666666666667
Testing Accuracy of SVC: 0.8666666666666667


### Train a model with 5-fold cross valiation

##### Define a 5 fold cross validation with data shufflling and set the random state with 2

In [15]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=5, shuffle=True, random_state=2) #set 10-fold cross validation after shuffle the dataset with random seed 7

##### Run the 5-fold cross validation and print the average accuracy score based on the cross validation results, and evaluate both model on the testing dataset

In [16]:
from sklearn.model_selection import cross_val_score

model = LogisticRegression(solver="liblinear")
results = cross_val_score(model, X_train_norm, y_train, cv=kfold)
print("Average Accuracy of LR:",results.mean())

model = SVC()
results = cross_val_score(model, X_train_norm, y_train, cv=kfold)
print("Average Accuracy of SVM:",results.mean())

Average Accuracy of LR: 0.8464011180992314
Average Accuracy of SVM: 0.8089447938504544


In [17]:
results

array([0.75925926, 0.87037037, 0.88679245, 0.69811321, 0.83018868])

### Optimize the Logistic Regression models with cross validation
The parameters that can be applied in grid_params can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html You can add values and parameters in the grid_params_lr.

In [18]:
# fine tune parameters for lr model
from sklearn.model_selection import GridSearchCV

grid_params_lr = {
    'penalty': ['l1', 'l2'],
    'C': [1, 10],
    'solver': ['saga', 'liblinear']
}

lr = LogisticRegression(max_iter=150)
gs_lr_result = GridSearchCV(lr, grid_params_lr, cv=kfold).fit(X_train_norm, y_train)
print(gs_lr_result.best_score_)

0.8464011180992314


### Evaluate the trained Logistic Regression model using testing dataset

In [19]:
test_accuracy = gs_lr_result.best_estimator_.score(X_test_norm, y_test)
print("Accuracy in testing:", test_accuracy)

Accuracy in testing: 0.9666666666666667


check the parameter setting for the best selected model

In [20]:
gs_lr_result.best_params_

{'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}

In [21]:
# predict with the first 5 data points
y_predict = gs_lr_result.best_estimator_.predict(X_test_norm[:5]) 
print(y_predict)

[0. 1. 0. 0. 1.]


In [22]:
# predict with the pobability of class 0 and class 1 for the first 5 data points
y_predict_probability = gs_lr_result.best_estimator_.predict_proba(X_test_norm[:5]) 
print(y_predict_probability)

[[0.84249239 0.15750761]
 [0.12678442 0.87321558]
 [0.53024764 0.46975236]
 [0.93227243 0.06772757]
 [0.49511772 0.50488228]]


### Optimize the SVM models with the same steps
Parameters for SVM model can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [23]:
from sklearn.model_selection import GridSearchCV

grid_params_svc = {
    'kernel': ['linear', 'poly'],
    'C': [1, 10],
    'degree': [3, 8],
    'gamma': ['auto','scale']
}

svc = SVC()
gs_svc_result = GridSearchCV(svc, grid_params_svc, cv=kfold).fit(X_train_norm, y_train)
print(gs_svc_result.best_score_)

0.835150244584207


Evaluate the trained svm model using testing dataset

In [24]:
test_accuracy = gs_svc_result.best_estimator_.score(X_test_norm, y_test)
print("Accuracy in testing:", test_accuracy)

Accuracy in testing: 0.9333333333333333


check the parameter setting for the best selected model

In [25]:
gs_lr_result.best_params_

{'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}