## CSE5ML Lab 2 Part 2 : Machine Learning with Scikit Learn for Classification

In part 1, we learned how to use some ML models in scikit learn package on a regression task with some data preprocessing procedures. This week, we are going to review the data preprocessing procedures and apply logistic regression as well as support vector machine (SVM) on a classification task.

Task

This database is collected from the V.A. Medical Center, Long Beach and Cleveland Clinic Foundation. It contains information from 303 patients,  with 14 attributes (13 input variables and 1 target variable). 

We are using this dataet to Build a machine learning model to predict if a patiet presents heart disease. The detailed information of each variable is as follows:
1. age: age in years
2. sex (male and female)
3. chest pain type
4. resting blood pressure (in mm Hg on admission to the hospital)
5. serum cholestoral in mg/dl
6. fasting blood sugar > 120 mg/dl (true and false)
7. resting electrocardiographic results
<br>   -- Value 0: normal
<br>   -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
<br>   -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
<br>   -- Value 1: upsloping
<br>   -- Value 2: flat
<br>   -- Value 3: downsloping
12. number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. num: diagnosis of heart disease (angiographic disease status)
<br>   -- Value 0: absense
<br>   -- Value 1: presence

more information of the dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

### Load the dataset
use pandas to load the csv file "heart_disease.csv" provided on LMS, then check dataset length and print the first 5 rows of the dataset

### Preprocess the dataset
##### Check if there is any missing value in the dataset

##### Drop the rows which has missing values

##### Check variable data types

In [4]:
dataset.dtypes

Age                                       int64
Sex                                      object
Chest Pain Type                          object
Resting Blood Pressure                    int64
Serum Cholestoral                         int64
Fasting Blood Sugar                        bool
Resting electrocardiographic results      int64
Maximum heart rate achieved               int64
Exercise induced angina                   int64
ST depression                           float64
the slope                                 int64
Number of major vessels                 float64
thal                                    float64
Diagnosis                                 int64
dtype: object

We found that Number of major vessels and thal should be int but is presented as float, so we transform them into integer type

In [5]:
cols = ['Number of major vessels', 'thal']
dataset[cols] = dataset[cols].astype(int)

In [6]:
# check again
dataset.dtypes

Age                                       int64
Sex                                      object
Chest Pain Type                          object
Resting Blood Pressure                    int64
Serum Cholestoral                         int64
Fasting Blood Sugar                        bool
Resting electrocardiographic results      int64
Maximum heart rate achieved               int64
Exercise induced angina                   int64
ST depression                           float64
the slope                                 int64
Number of major vessels                   int32
thal                                      int32
Diagnosis                                 int64
dtype: object

We can see that these two variables are properly transformed now

##### Check if there is any duplicated rows in the dataset

##### check value count for the categorical variables

##### Deal with categorical variables

Since both Sex and Fasting Blook Sugar are binary variables, we can also use 0 and 1 to replace them.

for example, for variable Sex:
<br> 1 = male; 0 = female

for variable Fasting Blood Sugar:
<br> 1 = True; 0 = False

In addition, based on domain expert's advice, we can use the following rule to transform the categorical variable Chest Pain Type:
<br>-- Value 1: typical angina
<br>-- Value 2: atypical angina
<br>-- Value 3: non-anginal pain
<br>-- Value 4: asymptomatic

##### Check dataset shape

##### Define the input variables and the target variable
target variable is the last variable Diagnosis, and input variables are the rest of the columns.

### Split the dataset and normalize data

##### Split the training and testing dataset
use 10% of dataset for testing with a random state of 1

##### Apply normalization on both train and testing dataset

### Now we are learning how to train a model with logistic regression and SVM for classification, based on entire training dataset and then evaluate the model based on testing dataset
Be aware that, for regression model, the default evaluation metrics is R Squared. For regression task, the default evaluation metrics is accuracy

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# logistic regression model, parameters can be changed
model = LogisticRegression(solver="liblinear")
model.fit(X_train, y_train)
test_score = model.score(X_test, y_test)
print("Testing Accuracy of LR:", test_score)

# Support Vector Machine for classification, parameters can be changed
model = SVC()
model.fit(X_train, y_train)
test_score = model.score(X_test, y_test)
print("Testing Accuracy of SVC:", test_score)

Testing Accuracy of LR: 0.9333333333333333
Testing Accuracy of SVC: 0.6333333333333333


### Train a model with 5-fold cross valiation

##### Define a 5 fold cross validation with data shufflling and set the random state with 2

##### Run the 5-fold cross validation and print the average accuracy score based on the cross validation results, and evaluate both model on the testing dataset

### Optimize the Logistic Regression models with cross validation
The parameters that can be applied in grid_params can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html You can add values and parameters in the grid_params_lr.

In [17]:
# fine tune parameters for lr model
from sklearn.model_selection import GridSearchCV

grid_params_lr = {
    'penalty': ['l1', 'l2'],
    'C': [1, 10],
    'solver': ['saga', 'liblinear']
}

lr = LogisticRegression(max_iter=150)
gs_lr_result = GridSearchCV(lr, grid_params_lr, cv=kfold).fit(X_train_norm, y_train)
print(gs_lr_result.best_score_)

0.8315164220824599


### Evaluate the trained Logistic Regression model using testing dataset

check the parameter setting for the best selected model

### Optimize the SVM models with the same steps
Parameters for SVM model can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Evaluate the trained Logistic Regression model using testing dataset

check the parameter setting for the best selected model