In [1]:
import mlp
import matplotlib.pyplot as plt
from cross_validation import cross_validation

In [None]:
# # For Reproducibility
# import random
# random.seed(6630)

### Data Preprocessing

This tutorial uses the developer's data as example. <br/>
The data_opener component only works for developer's data. <br/>

In [2]:
import data_opener
data = data_opener.get_data()
X_train, X_test, X_val, y_train, y_test, y_val = data_opener.train_test_val_split(data)

If the user wants to use his own data, he should prepare the data by himself. <br/>
(i.e., read from file and format data into the following shape:)  <br/>

In [7]:
# Show the first row of X_train
print(X_train.iloc[0])

age              0.921562
height           0.708000
weight           0.450000
ap_hi            0.009988
ap_lo            0.009091
smoke            0.000000
alco             0.000000
active           1.000000
cholesterol_2    0.000000
cholesterol_3    0.000000
gluc_2           0.000000
gluc_3           0.000000
is_man           1.000000
Name: 9289, dtype: float64
1.0


In [8]:
# Show the first row of y_train
print(y_train.iloc[0])

1.0


Please do data preprocessing before train the model.  <br/>
Specifically, please note that:  <br/>

1. All the features must be numeric (Could be binary).  <br/>
2. For each categorical variables with n classes, please refactor it into (n-1) binary variables.
3. Remove highly linear dependent columns or use less columns (such as index)

### Train model
For demonstration purpose, here we use only the first 200 samples as example

In [9]:
n_samples = 200

X_train = X_train.head(n_samples)
y_train = y_train.head(n_samples)

1. Decide the hyper-parameters you want to use.</br>
    You can skip this step, but you need to specify them later

In [11]:
n_features = 13
lr = .00001
n_epochs = 1000
batch_size = 25

2. Call the "MLP" class to initilize a MLP classifier. <br/>
    The user need to specify the `numer of features` and `sizes of hidden layer`s at this point. <br/>
    The user could also specify the `learning rate`, `batch size`, <br/>
    number of `epochs`, and whether to `include bias` in his model, <br/>
    the value for which by defaut are `25`, `1`, `1` and `False`.

In [12]:
mlp_clf = mlp.MLP(n_features = n_features, 
                    hidden_sizes = [8], 
                    n_epochs = n_epochs,
                    batch_size = batch_size,
                    include_bias = True)

3. Call the `train` method <br/>
For the `learning rate`, the user could specify a different value each time he train a new model. (This is also optional.)

In [13]:
# Train on the training set
mlp_clf.train(X_train, y_train, lr = lr)

4. Use the model to `predict` whether a patient is at high risk of haveing cariovascular diseases. <br/>
The output from the `predict` method will be float point values from 0 to 1, <br/>
where y >= 0.5 indicates that the patient is at high risk of having cardiovascular diseases now or in the future.

In [27]:
# Predict on Training set
y_pred = mlp_clf.pred(X_train)
y_pred_train = list(map(lambda x: 1 if x >= 0.5 else 0, y_pred))

5. Calculate statistics metrics on training set prediction, i.e., acc, recall, etc.

In [28]:
# print statistics and confusion matrix for prediction on training set
calc_stats = cross_validation()
stats_train = calc_stats.print_stat(y_train, y_pred_train)
print("Training set: ===============")
print(stats_train[0])
print(stats_train[1])

Accuracy:	0.46
ErrorRate:	0.54
Precision:	0.48295454545454547
Recall:	0.8333333333333334

Confusion matrix:
			Positive	Negative	
pred_pos	85			91	
pred_neg	17			7	



### Predict on test set
For model selection, use the model to predict your test data.

In [25]:
# Predict on Test set
y_pred = mlp_clf.pred(X_test)
y_pred_test = list(map(lambda x: 1 if x >= 0.5 else 0, y_pred))

# print statistics and confusion matrix for prediction on test set
# calc_stats = cross_validation()
stats_test = calc_stats.print_stat(y_test, y_pred_test)
print("Test set: ===============")
print(stats_test[0])
print(stats_test[1])

Accuracy:	0.4635714285714286
ErrorRate:	0.5364285714285715
Precision:	0.4824033627297453
Recall:	0.8263447691656078

Confusion matrix:
			Positive	Negative	
pred_pos	5853			6280	
pred_neg	1230			637	



### Predict on validation set
Assume that we have found the optimal model using the previous hypermeters, i.e., the current model. <br/>
Now, predict on the validation set to get the model performance:

In [30]:
# Predict on Validation set
y_pred = mlp_clf.pred(X_val)
y_pred_val = list(map(lambda x: 1 if x >= 0.5 else 0, y_pred))

# print statistics and confusion matrix for prediction on Validation set
# calc_stats = cross_validation()
stats_val = calc_stats.print_stat(y_val, y_pred_val)
print("Validation: ===============")
print(stats_val[0])
print(stats_val[1])

Accuracy:	0.46014285714285713
ErrorRate:	0.5398571428571428
Precision:	0.4762996316004912
Recall:	0.8336437885083823

Confusion matrix:
			Positive	Negative	
pred_pos	5818			6397	
pred_neg	1161			624	



Now you say the your model performance is 46% (bad performance in fact. Please find better hyper-parameters)

### Cross-Validation
To prevent or reduce overfitting, the user may want to do K-fold `cross validation`. <br/>
For `tutorial` of how to run cross validation, please refer to "`cv_main.py`"