# Random forest 

1. Load the merged_df dataset. 

2. Split it into train and test: This step is crucial for evaluating the model's performance. We'll use the 70-30 split, and we'll use X_train_res and y_train_res for training and X_test and y_test for testing.

3. Apply SMOTE on the train set: We'll apply SMOTE only to the training set (X_train, y_train) to address class imbalance, resulting in X_train_res and y_train_res.

4. Create the pipeline with RFE and the model: We'll create a pipeline that includes Recursive Feature Elimination (RFE) and the chosen model. This pipeline will be trained on the resampled training set (X_train_res, y_train_res).

5. Hyperparameter tuning using grid search: We'll perform hyperparameter tuning using grid search within an outer loop of cross-validation. Since grid search involves training multiple models with different hyperparameter combinations, it should be performed on the resampled training set (X_train_res, y_train_res).

6. Metrics to evaluate: Finally, we'll evaluate the model's performance using various metrics on the test set (X_test, y_test). Common metrics include accuracy, precision, recall, F1-score, and ROC AUC score.


Import libraries. 

In [1]:
# Import necessary libraries
import json
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score, auc, precision_recall_curve, confusion_matrix
import pandas as pd
import logging
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
# from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import precision_recall_curve, auc
from sklearn.metrics import confusion_matrix 
from collections import Counter

Standardized_complete_dataset is the dataframe I created with all the extracted features from TS fresh. 

In [18]:
complete_dataset = pd.read_csv('/Users/dionnespaltman/Desktop/V4/standardized_complete_dataset.csv', sep=',')
complete_dataset.drop('Unnamed: 0', axis=1, inplace=True)
display(complete_dataset)

Unnamed: 0,ID,Sum_12,Sum_4567,VVR_1,VVR_2,Sum_456,VVR_group,Condition,Date,Gender,...,AU26_r__standard_deviation,AU26_r__maximum,AU26_r__mean,AU26_r__root_mean_square,AU45_r__sum_values,AU45_r__variance,AU45_r__standard_deviation,AU45_r__maximum,AU45_r__mean,AU45_r__root_mean_square
0,23,24.0,37.0,13.0,11.0,27.0,0,2,2020-08-01,2,...,-0.289518,0.458331,-0.574983,-0.554028,-0.333560,0.221977,0.345618,0.548923,0.603537,0.429001
1,24,23.0,37.0,12.0,11.0,28.0,0,2,2020-01-22,2,...,1.793425,0.458331,2.294469,2.317495,0.203487,-0.260024,-0.030426,0.619766,-0.659902,-0.281599
2,25,28.0,44.0,16.0,12.0,33.0,1,2,2020-05-02,2,...,0.557817,0.458331,0.196854,0.400873,-0.376228,0.033873,0.204639,-0.135892,0.093018,0.147314
3,26,30.0,37.0,15.0,15.0,29.0,0,1,2020-06-02,1,...,-0.394321,0.458331,-0.847716,-0.761866,-0.868821,-0.323805,-0.084215,0.541052,-1.092069,-0.441394
4,27,22.0,39.0,11.0,11.0,31.0,0,2,2020-06-02,1,...,1.682971,1.863336,-2.230621,0.556978,2.456784,1.071168,0.914462,0.651252,3.692845,1.996765
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99,140,16.0,32.0,8.0,8.0,24.0,0,3,2021-05-26,2,...,-0.033310,0.458331,-0.446575,-0.318795,0.861706,0.199480,0.329100,0.210451,0.318307,0.327148
100,142,20.0,34.0,11.0,9.0,26.0,0,3,2021-05-31,1,...,-0.853016,0.458331,-1.199405,-1.246095,-0.067533,-0.585183,-0.317237,-0.537335,-0.522513,-0.509273
101,144,24.0,35.0,12.0,12.0,27.0,0,3,2021-01-06,1,...,-0.565965,0.458331,-0.851589,-0.878346,-0.720364,-0.806836,-0.534298,-0.340549,-0.961821,-0.835709
102,145,20.0,37.0,11.0,9.0,28.0,0,1,2021-02-06,2,...,-0.485857,0.366881,-0.548002,-0.665071,1.190821,0.027513,0.199753,0.076637,-0.006872,0.113717


In [19]:
print(list(complete_dataset.columns))

['ID', 'Sum_12', 'Sum_4567', 'VVR_1', 'VVR_2', 'Sum_456', 'VVR_group', 'Condition', 'Date', 'Gender', 'AU01_r__sum_values', 'AU01_r__variance', 'AU01_r__standard_deviation', 'AU01_r__maximum', 'AU01_r__mean', 'AU01_r__root_mean_square', 'AU02_r__sum_values', 'AU02_r__variance', 'AU02_r__standard_deviation', 'AU02_r__maximum', 'AU02_r__mean', 'AU02_r__root_mean_square', 'AU04_r__sum_values', 'AU04_r__variance', 'AU04_r__standard_deviation', 'AU04_r__maximum', 'AU04_r__mean', 'AU04_r__root_mean_square', 'AU05_r__sum_values', 'AU05_r__variance', 'AU05_r__standard_deviation', 'AU05_r__maximum', 'AU05_r__mean', 'AU05_r__root_mean_square', 'AU06_r__sum_values', 'AU06_r__variance', 'AU06_r__standard_deviation', 'AU06_r__maximum', 'AU06_r__mean', 'AU06_r__root_mean_square', 'AU07_r__sum_values', 'AU07_r__variance', 'AU07_r__standard_deviation', 'AU07_r__maximum', 'AU07_r__mean', 'AU07_r__root_mean_square', 'AU09_r__sum_values', 'AU09_r__variance', 'AU09_r__standard_deviation', 'AU09_r__maximum

I want to know how many people are in each group. 

In [20]:
# Count the number of instances of people in VVR_group = 1 and VVR_group = 0
count_vvr_group = complete_dataset['VVR_group'].value_counts()

# Print the counts
print("Number of instances in VVR_group = 1:", count_vvr_group[1])
print("Number of instances in VVR_group = 0:", count_vvr_group[0])

Number of instances in VVR_group = 1: 23
Number of instances in VVR_group = 0: 81


In [21]:
columns_to_drop = ['ID', 'Sum_12', 'Sum_4567', 'Sum_456', 'VVR_group', 'Condition', 'Gender', 'Date'] 

In [22]:
# X = complete_dataset.drop(columns_to_drop, axis=1)
# y = complete_dataset['VVR_group']

These are the columns I use to predict, so all my features. I need these as a list to establish my featurizer. 
I have 102 features from TS fresh and then I added the two VVR measurements from stage 1 and 2. 

In [23]:
columns_ts_fresh = ['AU01_r__sum_values', 'AU01_r__variance', 'AU01_r__standard_deviation', 'AU01_r__maximum', 
                    'AU01_r__mean', 'AU01_r__root_mean_square', 'AU02_r__sum_values', 'AU02_r__variance', 
                    'AU02_r__standard_deviation', 'AU02_r__maximum', 'AU02_r__mean', 'AU02_r__root_mean_square', 
                    'AU04_r__sum_values', 'AU04_r__variance', 'AU04_r__standard_deviation', 'AU04_r__maximum', 
                    'AU04_r__mean', 'AU04_r__root_mean_square', 'AU05_r__sum_values', 'AU05_r__variance',
                    'AU05_r__standard_deviation', 'AU05_r__maximum', 'AU05_r__mean', 'AU05_r__root_mean_square', 
                    'AU06_r__sum_values', 'AU06_r__variance', 'AU06_r__standard_deviation', 'AU06_r__maximum', 
                    'AU06_r__mean', 'AU06_r__root_mean_square', 'AU07_r__sum_values', 'AU07_r__variance', 
                    'AU07_r__standard_deviation', 'AU07_r__maximum', 'AU07_r__mean', 'AU07_r__root_mean_square', 
                    'AU09_r__sum_values', 'AU09_r__variance', 'AU09_r__standard_deviation', 'AU09_r__maximum',
                    'AU09_r__mean', 'AU09_r__root_mean_square', 'AU10_r__sum_values', 'AU10_r__variance', 
                    'AU10_r__standard_deviation', 'AU10_r__maximum', 'AU10_r__mean', 'AU10_r__root_mean_square',
                    'AU12_r__sum_values', 'AU12_r__variance', 'AU12_r__standard_deviation', 'AU12_r__maximum', 
                    'AU12_r__mean', 'AU12_r__root_mean_square', 'AU14_r__sum_values', 'AU14_r__variance', 
                    'AU14_r__standard_deviation', 'AU14_r__maximum', 'AU14_r__mean', 'AU14_r__root_mean_square', 
                    'AU15_r__sum_values', 'AU15_r__variance', 'AU15_r__standard_deviation', 'AU15_r__maximum', 
                    'AU15_r__mean', 'AU15_r__root_mean_square', 'AU17_r__sum_values', 'AU17_r__variance', 
                    'AU17_r__standard_deviation', 'AU17_r__maximum', 'AU17_r__mean', 'AU17_r__root_mean_square', 
                    'AU20_r__sum_values', 'AU20_r__variance', 'AU20_r__standard_deviation', 'AU20_r__maximum', 
                    'AU20_r__mean', 'AU20_r__root_mean_square', 'AU23_r__sum_values', 'AU23_r__variance', 
                    'AU23_r__standard_deviation', 'AU23_r__maximum', 'AU23_r__mean', 'AU23_r__root_mean_square', 
                    'AU25_r__sum_values', 'AU25_r__variance', 'AU25_r__standard_deviation', 'AU25_r__maximum', 
                    'AU25_r__mean', 'AU25_r__root_mean_square', 'AU26_r__sum_values', 'AU26_r__variance', 
                    'AU26_r__standard_deviation', 'AU26_r__maximum', 'AU26_r__mean', 'AU26_r__root_mean_square', 
                    'AU45_r__sum_values', 'AU45_r__variance', 'AU45_r__standard_deviation', 'AU45_r__maximum', 
                    'AU45_r__mean', 'AU45_r__root_mean_square']

print(len(columns_ts_fresh))

102


In [24]:
columns_features =  ['VVR_1', 'VVR_2', 'AU01_r__sum_values', 'AU01_r__variance', 'AU01_r__standard_deviation', 'AU01_r__maximum', 
                    'AU01_r__mean', 'AU01_r__root_mean_square', 'AU02_r__sum_values', 'AU02_r__variance', 
                    'AU02_r__standard_deviation', 'AU02_r__maximum', 'AU02_r__mean', 'AU02_r__root_mean_square', 
                    'AU04_r__sum_values', 'AU04_r__variance', 'AU04_r__standard_deviation', 'AU04_r__maximum', 
                    'AU04_r__mean', 'AU04_r__root_mean_square', 'AU05_r__sum_values', 'AU05_r__variance',
                    'AU05_r__standard_deviation', 'AU05_r__maximum', 'AU05_r__mean', 'AU05_r__root_mean_square', 
                    'AU06_r__sum_values', 'AU06_r__variance', 'AU06_r__standard_deviation', 'AU06_r__maximum', 
                    'AU06_r__mean', 'AU06_r__root_mean_square', 'AU07_r__sum_values', 'AU07_r__variance', 
                    'AU07_r__standard_deviation', 'AU07_r__maximum', 'AU07_r__mean', 'AU07_r__root_mean_square', 
                    'AU09_r__sum_values', 'AU09_r__variance', 'AU09_r__standard_deviation', 'AU09_r__maximum',
                    'AU09_r__mean', 'AU09_r__root_mean_square', 'AU10_r__sum_values', 'AU10_r__variance', 
                    'AU10_r__standard_deviation', 'AU10_r__maximum', 'AU10_r__mean', 'AU10_r__root_mean_square',
                    'AU12_r__sum_values', 'AU12_r__variance', 'AU12_r__standard_deviation', 'AU12_r__maximum', 
                    'AU12_r__mean', 'AU12_r__root_mean_square', 'AU14_r__sum_values', 'AU14_r__variance', 
                    'AU14_r__standard_deviation', 'AU14_r__maximum', 'AU14_r__mean', 'AU14_r__root_mean_square', 
                    'AU15_r__sum_values', 'AU15_r__variance', 'AU15_r__standard_deviation', 'AU15_r__maximum', 
                    'AU15_r__mean', 'AU15_r__root_mean_square', 'AU17_r__sum_values', 'AU17_r__variance', 
                    'AU17_r__standard_deviation', 'AU17_r__maximum', 'AU17_r__mean', 'AU17_r__root_mean_square', 
                    'AU20_r__sum_values', 'AU20_r__variance', 'AU20_r__standard_deviation', 'AU20_r__maximum', 
                    'AU20_r__mean', 'AU20_r__root_mean_square', 'AU23_r__sum_values', 'AU23_r__variance', 
                    'AU23_r__standard_deviation', 'AU23_r__maximum', 'AU23_r__mean', 'AU23_r__root_mean_square', 
                    'AU25_r__sum_values', 'AU25_r__variance', 'AU25_r__standard_deviation', 'AU25_r__maximum', 
                    'AU25_r__mean', 'AU25_r__root_mean_square', 'AU26_r__sum_values', 'AU26_r__variance', 
                    'AU26_r__standard_deviation', 'AU26_r__maximum', 'AU26_r__mean', 'AU26_r__root_mean_square', 
                    'AU45_r__sum_values', 'AU45_r__variance', 'AU45_r__standard_deviation', 'AU45_r__maximum', 
                    'AU45_r__mean', 'AU45_r__root_mean_square']

print(len(columns_features))

104


First we'll split the data into a train and test set. 

In total, we have 104 participants.

I started with a test size of 20%. Then there are 83 people in the train set and 21 in the test set. 
With a test size of 30#, there are 72 people in the train set and 32 in the test set. 
Naturally, we stratify on VVR_group. 

In [30]:
train, test = train_test_split(complete_dataset, test_size=0.3, random_state=123, stratify=complete_dataset['VVR_group'])

print(train.shape)
print(test.shape)

(72, 112)
(32, 112)


Unfortunately, the test set is very small with only 8 people in the high VVR condition. 

In [31]:
columns_to_drop = [ 'ID', 'Sum_12', 'Sum_4567', 'Sum_456', 'VVR_group', 'Condition', 'Date', 'Gender'] 

X_test = test.drop(columns_to_drop, axis=1)
y_test = test['VVR_group']

# Print original class distribution
print('Original dataset shape %s' % Counter(y_test))

Original dataset shape Counter({0: 25, 1: 7})


# Applying SMOTE

In [32]:
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE

We apply SMOTE on the train data. The strategy is to make the minority as big as the majority class. 

In [42]:
columns_to_drop = [ 'ID', 'Sum_12', 'Sum_4567', 'Sum_456', 'VVR_group', 'Condition', 'Date', 'Gender'] 

X_train = train.drop(columns_to_drop, axis=1)
y_train = train['VVR_group']

# Print original class distribution
print('Original dataset shape %s' % Counter(y_train))

# Apply SMOTE to the training data with sampling strategy set to 'auto' (default)
sm = SMOTE(sampling_strategy='not majority', random_state=42, k_neighbors=5)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

# Print resampled class distribution
print('Resampled dataset shape %s' % Counter(y_train_res))

Original dataset shape Counter({0: 56, 1: 16})
Resampled dataset shape Counter({1: 56, 0: 56})


In [37]:
display(X_train_res)

Unnamed: 0,VVR_1,VVR_2,AU01_r__sum_values,AU01_r__variance,AU01_r__standard_deviation,AU01_r__maximum,AU01_r__mean,AU01_r__root_mean_square,AU02_r__sum_values,AU02_r__variance,...,AU26_r__standard_deviation,AU26_r__maximum,AU26_r__mean,AU26_r__root_mean_square,AU45_r__sum_values,AU45_r__variance,AU45_r__standard_deviation,AU45_r__maximum,AU45_r__mean,AU45_r__root_mean_square
0,12.000000,13.000000,-0.070220,-0.403569,-0.161650,0.428359,-0.673859,-0.387663,0.994371,0.345623,...,-0.537730,0.225549,-1.036133,-0.951754,-1.025247,-1.526773,-1.479742,-1.237893,-2.161830,-2.049390
1,10.000000,10.000000,2.226447,1.288304,1.124217,0.514376,0.729448,1.091318,3.433959,2.208680,...,0.774622,0.458331,0.368352,0.632011,1.209277,-0.225126,-0.001447,0.619766,-0.202961,-0.127515
2,8.000000,8.000000,-0.328541,-0.286339,-0.049149,0.514376,-0.191466,-0.152619,-0.351110,-0.286124,...,0.539267,0.408449,0.316323,0.450683,-0.834081,-0.909004,-0.642000,0.297037,-1.125698,-0.981679
3,10.000000,8.000000,-0.179816,1.024130,0.956659,0.514376,0.292720,0.824388,-0.176376,0.226388,...,0.891064,0.458331,0.480862,0.766552,-0.208294,0.810246,0.749643,0.257680,0.422548,0.743912
4,8.000000,8.000000,-0.687261,-0.791730,-0.584920,0.514376,-0.763173,-0.767495,-0.530851,-0.455929,...,-0.558621,0.366881,-0.858611,-0.877156,-0.419249,-0.490562,-0.230352,-0.340549,-0.573360,-0.444615
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107,15.208788,9.791212,-0.487217,-0.225515,0.005625,0.181101,-0.528458,-0.206212,-0.503146,-0.506388,...,-0.226106,0.412286,-0.469718,-0.456650,-0.012510,0.388434,0.334707,0.252315,0.284033,0.322442
108,16.925449,10.850899,-0.629858,-0.610840,-0.380021,0.392113,-0.678192,-0.572561,-0.503282,-0.437112,...,-0.168894,0.458331,-0.072297,-0.205383,-0.447768,-0.289515,-0.066093,0.619766,-0.590136,-0.295949
109,10.986887,11.973774,1.755451,0.208069,0.373092,0.514376,0.195169,0.312097,0.745968,-0.354988,...,0.398382,0.458331,0.303690,0.353698,3.025754,1.353332,1.082597,0.619766,1.220385,1.279031
110,15.861224,10.544490,-0.583734,-0.648807,-0.418202,0.436879,-0.696419,-0.609791,-0.462160,-0.415041,...,0.646099,0.458331,0.929210,0.852713,-0.537762,-0.516713,-0.266038,0.470967,-0.855955,-0.556206


# Adding class weights

I will add class weights to my models, because of my inbalanced data set. 
https://medium.com/@ravi.abhinav4/improving-class-imbalance-with-class-weights-in-machine-learning-af072fdd4aa4 

Results: 
{0: 0.6428571428571429, 1: 2.25}

In [35]:
import numpy as np

def calculate_class_weights(y):
    unique_classes, class_counts = np.unique(y, return_counts=True)
    total_samples = len(y)
    class_weights = {}

    for class_label, class_count in zip(unique_classes, class_counts):
        class_weight = total_samples / (2.0 * class_count)
        class_weights[class_label] = class_weight

    return class_weights

# Assuming 'y' contains the class labels (0s and 1s) for the binary classification problem
class_weights = calculate_class_weights(y_train)
print("Class weights:", class_weights)

# this is how you would implement it 
# model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'], class_weight=class_weights)

Class weights: {0: 0.6428571428571429, 1: 2.25}


# Featurizer 

The featurizer is no longer necessary, since I already standardized the values. 

In [16]:
# featurizer = ColumnTransformer(transformers=[("numeric", StandardScaler(), columns_au_12)], remainder='drop')

# Model (not complete but a intermediate step)

Strange enough the recall is much higher (0.27 instead of 0.18) when I don't use class weights. 

The model you can see here is not using K-fold cross validation. 

Here I haven't done any Recursive Feature Analysis or cross validation. 

In [43]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_recall_curve, auc
from collections import Counter

# Train the RandomForestClassifier directly on the resampled training data
rf_classifier = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=0)
rf_classifier.fit(X_train_res, y_train_res)

columns_to_drop = [ 'ID', 'Sum_12', 'Sum_4567', 'Sum_456', 'VVR_group', 'Condition', 'Date', 'Gender'] 

# Predict on the test data
pred = rf_classifier.predict(test.drop(columns=columns_to_drop, axis=1))

# Calculate evaluation metrics
accuracy = accuracy_score(test['VVR_group'].values, pred)
report = classification_report(test['VVR_group'].values, pred)
cm = confusion_matrix(test['VVR_group'].values, pred)
precision, recall, _ = precision_recall_curve(test['VVR_group'].values, pred)
auc_pr = auc(recall, precision)

# Print results
print(f"Accuracy on Validation Data: {accuracy}")
print(f"AUC-PR on Validation Data: {auc_pr}")
print("Classification Report:")
print(report)


Accuracy on Validation Data: 0.78125
AUC-PR on Validation Data: 0.4709821428571429
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.92      0.87        25
           1       0.50      0.29      0.36         7

    accuracy                           0.78        32
   macro avg       0.66      0.60      0.62        32
weighted avg       0.75      0.78      0.76        32



# Recursive feature elimination to optimize the scores - not using SMOTE

I based my code partially on the following article: 
https://machinelearningmastery.com/rfe-feature-selection-in-python/

Here I did apply cross validation, but not yet with hyperparameter tuning, so it's not the final model yet. 

For this step we don't use SMOTE because you can't specify what the test and train set is. So here I make a X and y and use this. 

In [44]:
complete_dataset = pd.read_csv('/Users/dionnespaltman/Desktop/V4/standardized_complete_dataset.csv', sep=',')
complete_dataset.drop('Unnamed: 0', axis=1, inplace=True)
display(complete_dataset)

Unnamed: 0,ID,Sum_12,Sum_4567,VVR_1,VVR_2,Sum_456,VVR_group,Condition,Date,Gender,...,AU26_r__standard_deviation,AU26_r__maximum,AU26_r__mean,AU26_r__root_mean_square,AU45_r__sum_values,AU45_r__variance,AU45_r__standard_deviation,AU45_r__maximum,AU45_r__mean,AU45_r__root_mean_square
0,23,24.0,37.0,13.0,11.0,27.0,0,2,2020-08-01,2,...,-0.289518,0.458331,-0.574983,-0.554028,-0.333560,0.221977,0.345618,0.548923,0.603537,0.429001
1,24,23.0,37.0,12.0,11.0,28.0,0,2,2020-01-22,2,...,1.793425,0.458331,2.294469,2.317495,0.203487,-0.260024,-0.030426,0.619766,-0.659902,-0.281599
2,25,28.0,44.0,16.0,12.0,33.0,1,2,2020-05-02,2,...,0.557817,0.458331,0.196854,0.400873,-0.376228,0.033873,0.204639,-0.135892,0.093018,0.147314
3,26,30.0,37.0,15.0,15.0,29.0,0,1,2020-06-02,1,...,-0.394321,0.458331,-0.847716,-0.761866,-0.868821,-0.323805,-0.084215,0.541052,-1.092069,-0.441394
4,27,22.0,39.0,11.0,11.0,31.0,0,2,2020-06-02,1,...,1.682971,1.863336,-2.230621,0.556978,2.456784,1.071168,0.914462,0.651252,3.692845,1.996765
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99,140,16.0,32.0,8.0,8.0,24.0,0,3,2021-05-26,2,...,-0.033310,0.458331,-0.446575,-0.318795,0.861706,0.199480,0.329100,0.210451,0.318307,0.327148
100,142,20.0,34.0,11.0,9.0,26.0,0,3,2021-05-31,1,...,-0.853016,0.458331,-1.199405,-1.246095,-0.067533,-0.585183,-0.317237,-0.537335,-0.522513,-0.509273
101,144,24.0,35.0,12.0,12.0,27.0,0,3,2021-01-06,1,...,-0.565965,0.458331,-0.851589,-0.878346,-0.720364,-0.806836,-0.534298,-0.340549,-0.961821,-0.835709
102,145,20.0,37.0,11.0,9.0,28.0,0,1,2021-02-06,2,...,-0.485857,0.366881,-0.548002,-0.665071,1.190821,0.027513,0.199753,0.076637,-0.006872,0.113717


In [46]:
columns_to_drop = [ 'ID', 'Sum_12', 'Sum_4567', 'Sum_456', 'VVR_group', 'Condition', 'Date', 'Gender'] 

X = complete_dataset.drop(columns_to_drop, axis=1)
y = complete_dataset['VVR_group']

print(X.shape)

(104, 104)


Recall. 

In [61]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# create pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model = RandomForestClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])

# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='recall', cv=cv, n_jobs=-1, error_score='raise')

# report performance
print('Recall: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# Fit the pipeline on the entire dataset
pipeline.fit(X, y)

# Get the selected features
selected_features = X.columns[rfe.support_]
print("Selected Features:")
print(selected_features)


Recall: 0.478 (0.370)
Selected Features:
Index(['VVR_1', 'AU04_r__maximum', 'AU15_r__variance', 'AU20_r__sum_values',
       'AU26_r__sum_values'],
      dtype='object')


Precision. 

In [60]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# create pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model = RandomForestClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])

# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='precision', cv=cv, n_jobs=-1, error_score='raise')

# report performance
print('Precision: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# Fit the pipeline on the entire dataset
pipeline.fit(X, y)

# Get the selected features
selected_features = X.columns[rfe.support_]
print("Selected Features:")
print(selected_features)


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Precision: 0.492 (0.363)
Selected Features:
Index(['VVR_1', 'AU05_r__mean', 'AU17_r__variance', 'AU20_r__sum_values',
       'AU26_r__sum_values'],
      dtype='object')


Accuracy. 

In [59]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# create pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model = RandomForestClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])

# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# Fit the pipeline on the entire dataset
pipeline.fit(X, y)

# Get the selected features
selected_features = X.columns[rfe.support_]
print("Selected Features:")
print(selected_features)


Accuracy: 0.809 (0.085)
Selected Features:
Index(['VVR_1', 'AU05_r__sum_values', 'AU15_r__standard_deviation',
       'AU20_r__sum_values', 'AU26_r__sum_values'],
      dtype='object')


# Model including RFE and hyperparameter tuning in Grid Search (not yet with inner and outer split!)

In [None]:
print(list(X_train_res.columns))
print()
print(list(X_test.columns))

In [None]:
X_test.drop(columns=['Date', 'Gender'], inplace=True)
print(list(X_test.columns))

In [86]:
print(y_test.values)

[0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0]


In [88]:
print(y_test.shape)

(32,)


In [93]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFE
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Create the pipeline with RFE and the model
rfe_new = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model_new = RandomForestClassifier()
pipeline_new = Pipeline(steps=[('s', rfe_new), ('m', model_new)])

# Hyperparameter tuning using grid search
param_grid = {
    'm__n_estimators': [100, 200, 300],  # Number of trees in the forest
    # 'm__max_depth': [None, 10, 20],  # Maximum depth of the tree
    # 'm__min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    # 'm__min_samples_leaf': [1, 2, 4],  # Minimum number of samples required to be at a leaf node
    # 's__n_features_to_select': [5, 10, 15]  # Number of features to select with RFE
}
grid_search = GridSearchCV(pipeline_new, param_grid, cv=5, scoring='recall', n_jobs=-1)
grid_search.fit(X_train_res, y_train_res)

# Define the cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
cv_scores = cross_val_score(grid_search.best_estimator_, X_train_res, y_train_res, cv=cv, scoring='recall')

# Print cross-validation scores
print("Cross-validation scores: ", cv_scores)
print("Mean CV recall: ", cv_scores.mean())
print("Standard deviation of CV recall: ", cv_scores.std())

# Print best parameters
print("Best parameters found:")
print(grid_search.best_params_)

# Metrics to evaluate
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))

# from sklearn.metrics import roc_auc_score

# # Compute AUC-ROC per class
# y_pred_proba = np.array(best_model.predict_proba(X_test)) # Get predicted probabilities

# print(y_pred_proba)
# print(y_pred_proba.shape)
# print(y_test.shape)

# auc_per_class = roc_auc_score(y_test.values, y_pred_proba, average=None)

# # Print AUC-ROC per class
# print("AUC-ROC per class:")
# for i, auc in enumerate(auc_per_class):
#     print(f"Class {i}: {auc}")


Cross-validation scores:  [0.75       0.90909091 0.72727273 0.90909091 1.        ]
Mean CV recall:  0.8590909090909091
Standard deviation of CV recall:  0.10405021038417814
Best parameters found:
{'m__n_estimators': 100}
              precision    recall  f1-score   support

           0       0.92      0.88      0.90        25
           1       0.62      0.71      0.67         7

    accuracy                           0.84        32
   macro avg       0.77      0.80      0.78        32
weighted avg       0.85      0.84      0.85        32



# Model with inner and outer split

https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html 

In [96]:
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Declare the inner and outer cross-validation strategies
inner_cv = KFold(n_splits=5, shuffle=True, random_state=0)
outer_cv = KFold(n_splits=3, shuffle=True, random_state=0)

# Create the pipeline with RFE and the model
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model = RandomForestClassifier()
pipeline = Pipeline(steps=[('s', rfe), ('m', model)])

param_grid = {
    'm__n_estimators': [100, 200, 300],  # Number of trees in the forest
    # 'm__max_depth': [None, 10, 20],  # Maximum depth of the tree
    # 'm__min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    # 'm__min_samples_leaf': [1, 2, 4],  # Minimum number of samples required to be at a leaf node
    # 's__n_features_to_select': [10, 20, 30, 40]  # Number of features to select with RFE
}

# Inner cross-validation for parameter search
model = GridSearchCV(
    estimator=pipeline, param_grid=param_grid, cv=inner_cv, n_jobs=2
)

# Outer cross-validation to compute the testing score
test_score = cross_val_score(model, X_train_res, y_train_res, cv=outer_cv, n_jobs=2)
print(
    "The mean score using nested cross-validation is: "
    f"{test_score.mean():.3f} ± {test_score.std():.3f}"
)

# Fit model to training data to get best parameters
model.fit(X_train_res, y_train_res)

# Print best parameters
print(model.best_params_)

# Evaluate on the test set
best_model = model.best_estimator_
y_pred = best_model.predict(X_test)
print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred))


The mean score using nested cross-validation is: 0.875 ± 0.047
{'m__n_estimators': 100}

Classification Report on Test Set:
              precision    recall  f1-score   support

           0       0.92      0.92      0.92        25
           1       0.71      0.71      0.71         7

    accuracy                           0.88        32
   macro avg       0.82      0.82      0.82        32
weighted avg       0.88      0.88      0.88        32

