# Using an RMB to learn parkinson features

In this notebook I will tackle the problem of letting an RMB learn features that I will use to train an MLP in order to solve the Parkinson detection problem. Please note that this notebook contains only one of the three proposed solutions, in order to get a complete view you should check the other notebook and the documentation.

## Feature selection

The main goal is to let the RMB learn the features but in previous attemps when giving it all the data and etting it trainit didn't manage to learn anything. So I will be reusing most of the previously used feature selection techniques. If you have any doubts pls check the other notebook.

In [26]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage import convolve
from sklearn import metrics
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import Pipeline
from sklearn.base import clone
from sklearn.metrics import confusion_matrix
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import featureExtractionRBM as fe
import pandas as pd
import numpy as np
import glob


## Aproach to feature selection
I will follow the same procedure that was followed in the other notebook with the exception that I will not be using the variance or the test_id for the RMB. One of the main points I want to stress here is the fact that I am making splits of the same exact size for both the control and the Parkinson data, this is crucial if we are to keep the integrity of the testing process. The amount of splits has gone down too in order to make the process faster.

In [4]:
control_path = "Data/control"
parkinson_path = "Data/parkinson"


def getFiles(path):
    all_files = glob.glob(path + "/*.txt")
    data = []

    for file in all_files:
        df = pd.read_csv(file,sep=";")
        data.append(df)
    
    return data
         

def extractFeatures(df):
    features = []
    features.append(np.argmax( df.values[:,0]) - np.argmin( df.values[:,0]))
    features.append(np.argmax( df.values[:,1]) - np.argmin( df.values[:,1]))
    features.append(np.argmax( df.values[:,2]) - np.argmin( df.values[:,2]))
    features.append(np.argmax( df.values[:,3]) - np.argmin( df.values[:,3]))
    features.append(np.argmax( df.values[:,4]) - np.argmin( df.values[:,4]))
    return features
    

def getControlFeatures():
    control_data = []
    control = getFiles(control_path)
    for i in control:
        new_df = splitByTest(i)
        for k in range(new_df.shape[0]):
            last_df = np.array_split(i,40)
            for j in last_df:
                control_data.append(extractFeatures(j))
    return pd.DataFrame(control_data)


def getParkinsonFeatures():
    parkinson_data = []
    parkinson = getFiles(parkinson_path)
    for i in parkinson:
        new_df = splitByTest(i)
        for k in range(new_df.shape[0]):
            last_df = np.array_split(i,40)
            for j in last_df:
                parkinson_data.append(extractFeatures(j))
    return pd.DataFrame(parkinson_data)


def getClasses(control, parkinson):
    classes = []
    for i in range(control.shape[0]):
        classes.append(0)
    for i in range(parkinson.shape[0]):
        classes.append(1)
    res = np.array(classes)

    return res

def splitByTest(df):
    data = []
    test = df.values[0][6]
    split = []
    cont = 0
    for i in range(df.shape[0]):
        if df.values[i][6] == test:
            split.append(df.values[i])
        else:
            aux = pd.DataFrame(split)
            data.append(aux)
            split = []
            test = df.values[i][6]
            split.append(df.values[i])
    data.append(split)
    return pd.DataFrame(data)



## RBM and MLP pipeline

I decided to used the RMB that is abailable in SKlearn, which is the Bernoulli Restricted Blotzmann machine, I will create a pipeline with this RBM and skelarn MLPClassifier.

## Let's set up the RBM's parameters

In [5]:
rbm = BernoulliRBM( verbose=True)

rbm.learning_rate = 0.0002
rbm.n_iter = 300

rbm.n_components = 200

## Setting up the MLP and the pipeline

The alpha I chose was the default. For the layers I used two layers of sizes 70 and 10. For the solver Adam worked the best and the activation function will be relu. The learning rate a settled on is 0.0005 and it will train for 5k epochs with a momentum of 0.7.


In [8]:
mlp = MLPClassifier(alpha=1e-08, hidden_layer_sizes=(70,10),solver='adam',
                    activation='relu',  learning_rate_init = .0005,
                    max_iter=5000, momentum = 0.7)

rbm_features_classifier = Pipeline(steps=[('rbm', rbm), ('MLP', mlp)])

 Lets gather our data and train the pipeline. This procedures are similar to the ones showed in the previous notebook

In [21]:
data = getControlFeatures()
park = getParkinsonFeatures()

all_data = np.concatenate((data,park),axis= 0)
target = fe.getClasses(data,park)

train_data, test_data, train_target, test_target = train_test_split(all_data,target, test_size= 0.3, random_state=30 )

rbm_features_classifier.fit(train_data, train_target)

[BernoulliRBM] Iteration 1, pseudo-likelihood = -107231.14, time = 0.09s
[BernoulliRBM] Iteration 2, pseudo-likelihood = -219247.59, time = 0.23s
[BernoulliRBM] Iteration 3, pseudo-likelihood = -327501.41, time = 0.23s
[BernoulliRBM] Iteration 4, pseudo-likelihood = -450910.63, time = 0.22s
[BernoulliRBM] Iteration 5, pseudo-likelihood = -542645.88, time = 0.21s
[BernoulliRBM] Iteration 6, pseudo-likelihood = -657398.11, time = 0.21s
[BernoulliRBM] Iteration 7, pseudo-likelihood = -757120.09, time = 0.22s
[BernoulliRBM] Iteration 8, pseudo-likelihood = -945923.95, time = 0.24s
[BernoulliRBM] Iteration 9, pseudo-likelihood = -972718.21, time = 0.21s
[BernoulliRBM] Iteration 10, pseudo-likelihood = -1105794.55, time = 0.22s
[BernoulliRBM] Iteration 11, pseudo-likelihood = -1238721.10, time = 0.21s
[BernoulliRBM] Iteration 12, pseudo-likelihood = -1385335.46, time = 0.22s
[BernoulliRBM] Iteration 13, pseudo-likelihood = -1490246.37, time = 0.23s
[BernoulliRBM] Iteration 14, pseudo-likelih

[BernoulliRBM] Iteration 111, pseudo-likelihood = -12733285.23, time = 0.23s
[BernoulliRBM] Iteration 112, pseudo-likelihood = -13012887.70, time = 0.22s
[BernoulliRBM] Iteration 113, pseudo-likelihood = -13145318.41, time = 0.23s
[BernoulliRBM] Iteration 114, pseudo-likelihood = -12986219.10, time = 0.22s
[BernoulliRBM] Iteration 115, pseudo-likelihood = -13344143.03, time = 0.24s
[BernoulliRBM] Iteration 116, pseudo-likelihood = -12934595.68, time = 0.22s
[BernoulliRBM] Iteration 117, pseudo-likelihood = -13216578.89, time = 0.22s
[BernoulliRBM] Iteration 118, pseudo-likelihood = -12684799.22, time = 0.23s
[BernoulliRBM] Iteration 119, pseudo-likelihood = -14185872.65, time = 0.22s
[BernoulliRBM] Iteration 120, pseudo-likelihood = -13523404.08, time = 0.22s
[BernoulliRBM] Iteration 121, pseudo-likelihood = -14398319.30, time = 0.22s
[BernoulliRBM] Iteration 122, pseudo-likelihood = -14465580.40, time = 0.23s
[BernoulliRBM] Iteration 123, pseudo-likelihood = -14167494.84, time = 0.21s

[BernoulliRBM] Iteration 218, pseudo-likelihood = -26652949.03, time = 0.22s
[BernoulliRBM] Iteration 219, pseudo-likelihood = -25283236.71, time = 0.21s
[BernoulliRBM] Iteration 220, pseudo-likelihood = -24410736.66, time = 0.22s
[BernoulliRBM] Iteration 221, pseudo-likelihood = -25544630.27, time = 0.22s
[BernoulliRBM] Iteration 222, pseudo-likelihood = -26229721.81, time = 0.22s
[BernoulliRBM] Iteration 223, pseudo-likelihood = -24976564.05, time = 0.22s
[BernoulliRBM] Iteration 224, pseudo-likelihood = -26062841.53, time = 0.23s
[BernoulliRBM] Iteration 225, pseudo-likelihood = -26004291.97, time = 0.19s
[BernoulliRBM] Iteration 226, pseudo-likelihood = -27524362.52, time = 0.21s
[BernoulliRBM] Iteration 227, pseudo-likelihood = -25688510.71, time = 0.23s
[BernoulliRBM] Iteration 228, pseudo-likelihood = -27816103.55, time = 0.22s
[BernoulliRBM] Iteration 229, pseudo-likelihood = -26443549.74, time = 0.22s
[BernoulliRBM] Iteration 230, pseudo-likelihood = -25004219.55, time = 0.24s

Pipeline(steps=[('rbm',
                 BernoulliRBM(learning_rate=0.0002, n_components=200,
                              n_iter=300, verbose=True)),
                ('MLP',
                 MLPClassifier(alpha=1e-08, hidden_layer_sizes=(70, 10),
                               learning_rate_init=0.0005, max_iter=5000,
                               momentum=0.7))])

## Evaluating the results
The next step if to use our testing data and evaluate the results

In [25]:
Y_pred = rbm_features_classifier.predict(test_data)

print(confusion_matrix(Y_pred, test_target))
print(accuracy_score(Y_pred, test_target))

[[  10    3]
 [ 425 1986]]
0.8234323432343235


In [23]:
print("MLP using RBM features:\n%s\n" % (
    metrics.classification_report(test_target, Y_pred)))
    

MLP using RBM features:
              precision    recall  f1-score   support

           0       0.77      0.02      0.04       435
           1       0.82      1.00      0.90      1989

    accuracy                           0.82      2424
   macro avg       0.80      0.51      0.47      2424
weighted avg       0.81      0.82      0.75      2424




As we can see using this approach gives us good results comparable to the one obtained with the other 2 approaches. But once again without tampering with the testing data.