# Cognitive, behavioral and social data
**DATASET**: PCL5  
**Author**: Mattia Brocco

MERGE OF DATASETS FOR **R_NEO_PI**
```python
a = pd.read_excel(data_dir + "\\R_NEO_PI_Faked.xlsx")
b = pd.read_excel(data_dir + "\\R_NEO_PI_Honest.xlsx")

a.columns = [" ".join([pd.Series(a.columns).apply(lambda s: np.nan if "Unnamed"
                                                  in s else s).fillna(method = "ffill").tolist()[i],
                       a.loc[0][i]]) for i in range(len(a.columns))]
b.columns = [" ".join([pd.Series(b.columns).apply(lambda s: np.nan if "Unnamed"
                                                  in s else s).fillna(method = "ffill").tolist()[i],
                       b.loc[0][i]]) for i in range(len(b.columns))]

a = a.drop(0).reset_index(drop = True)
b = b.drop(0).reset_index(drop = True)

a["CONDITION"] = "FAKE"
b["CONDITION"] = "HONEST"

pd.concat([a, b], ignore_index = True).to_excel(data_dir + "\\R_NEO_PI.xlsx", index = False)
```

In [1]:
import os
import sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


import support
from engine import Classification

%load_ext autoreload
%autoreload 2

data_dir = ".\\data"

pd.options.display.max_columns = 500

### Datasets at hand

In [20]:
data_descr = pd.DataFrame(dict(zip([f for f in os.listdir(data_dir) if "feather" in f],
                                   [pd.read_feather(f"{data_dir}\\{f}").shape for f in os.listdir(data_dir)
                                    if "feather" in f]))).T.reset_index()
data_descr = data_descr.rename(columns = {"index": "Data", 0: "Sample size",
                                          1: "Features"})
data_descr

Unnamed: 0,Data,Sample size,Features
0,BF_df_CTU.feather,442,11
1,BF_df_OU.feather,460,11
2,BF_df_V.feather,486,11
3,DT_df_CC.feather,482,28
4,DT_df_JI.feather,864,28
5,IADQ_df.feather,450,10
6,IESR_df.feather,358,23
7,NAQ_R_df.feather,712,23
8,PCL5_df.feather,402,21
9,PHQ9_GAD7_df.feather,1118,17


---
## Design a pipeline
***

The goal is to find a stable subset of features across datasets that performs roughly the same across different classifiers. Accoringly, we can define a "good" feature selection procedure, one that does not depend on a specific model, but that allows models to perform roughly the same, and for all the datasets within the scope.

##### DESCRIPTION
1. A given dataset is split in training and test. For every feature, the mean and the standard deviation are computed in order to scale that feauture: $Z=\frac{X-\mu}{\sigma}$. Scaling on the test set is carried out using the same values computed for training data.
2. The actual phase of the selection of a subset of features occurs through a 3-step process:  
    * Train a Decision Tree and apply minimal cost-complexity pruning. <sup>[1]</sup>  
    * Train a Random Forest that expolits gradient boosting <sup>[2]</sup>, in which each tree retains the cost-complexity parameter obtained in the previous step.  
    * Compute permutation importance on this Random Forest <sup>[3]</sup>.  
    * Perform a one-sample test on the mean, given the distribution obtained for each feature importance (with confidence level at 99.999%). This way only features whose importance is significantly greater than zero are retained, all the others are discarded. We call this subset of feature $A^*$  
    * Perform a Wilks test <sup>[4]</sup> comparing two logistic regressions, one fitted with the full set of features, the other with $A^*$. By accepting the null hypothesis of the test (at 95% confidence level), the assertion "the ration  between the likelihoods of the two (nested) models is one".  
    * Train an arbitrary amount of different models in order to assess the quality of the feature selection procedure. If all models show very close accuracy, then the procedure proves to provide a model-indifferent subset $A^*$ of features.


##### SOURCES
* [How can I get statistics to compare nested models in a logistic regression in SPSS?](https://www.ibm.com/support/pages/how-can-i-get-statistics-compare-nested-models-logistic-regression-spss)
* [Likelihood-ratio test](https://en.wikipedia.org/wiki/Likelihood-ratio_test)

##### SUPPORTING PAPERS
[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone. *Classification and Regression Trees*. Wadsworth, Belmont, CA, 1984.  
[2] J. Friedman, Greedy Function Approximation: *A Gradient Boosting Machine*, The Annals of Statistics, Vol. 29, No. 5, 2001  
[3] Leo Breiman. Random forests. *Machine learning*, 45(1):5-32, 2001.  
[4] Li, Bing; Babu, G. Jogesh (2019). *A Graduate Course on Statistical Inference*. Springer. p. 331

In [18]:
### PIPELINE
# Organize datasets
data_collection = {}
for dataset in [f for f in os.listdir(data_dir) if "feather" in f]:
    print(dataset.split(".")[0])
    a, b, c, d = Classification().prepare_data(f"{data_dir}\\{dataset}", "CONDITION")
    e = Classification().variable_selection(a, b, c, d)
    f = Classification().benchmark_models(a, b, c, d, e)
    data_collection[dataset.split(".")[0]] = [a, b, c, d, e, f]
    print(e)
    print(f)
    print("-" * 50)
    print()

BF_df_CTU
Train size: 309
Selected 4 features out of 10
{'Features': [2, 4, 6, 7], 'Wilks test p-value': 0.9893064, 'High correlation': False}
Full Logit             0.774436
Logistic Regression    0.812030
SVC                    0.842105
Random Forest          0.827068
Neural Network         0.812030
dtype: float64
--------------------------------------------------

BF_df_OU
Train size: 322
Selected 4 features out of 10
{'Features': [0, 4, 6, 7], 'Wilks test p-value': 1.0, 'High correlation': False}
Full Logit             0.833333
Logistic Regression    0.797101
SVC                    0.847826
Random Forest          0.840580
Neural Network         0.833333
dtype: float64
--------------------------------------------------

BF_df_V
Train size: 340
Selected 2 features out of 10
{'Features': [4, 7], 'Wilks test p-value': 0.9997179, 'High correlation': False}
Full Logit             0.760274
Logistic Regression    0.719178
SVC                    0.719178
Random Forest          0.726027
Neur

## Summarize results
***

In [56]:
summary = []
for k, v in data_collection.items():
    
    # Dataset name
    # Sample size
    # Training size
    # Initial number of features
    # Selected features
    # ACCURACY: Full Logit
    # ACCURACY: Logistic Regression
    # ACCURACY: SVC
    # ACCURACY: Random Forest
    # ACCURACY: Neural Network
    # Average accuracy (full logit excluded)
    # Accuracy Standard deviation (full logit excluded)
    
    summary += [[k, len(v[2]) + len(v[3]), len(v[2]),
                 v[0].shape[1], len(v[4]["Features"]),
                 *v[5].tolist(), v[5][1:].mean(), v[5][1:].std()]]
    
summary = pd.DataFrame(summary,
                       columns = ["Dataset name", "Sample size", "Training size", "Number of Features",
                                  "Selected Features", "ACCURACY - Logit with all features",
                                  "ACCURACY - Logistic Regression", "ACCURACY - SVM", "ACCURACY - Random Forest",
                                  "ACCURACY - Neural Network", "Average Accuracy on selected features",
                                  "Accuracy std on selected features"])

summary

Unnamed: 0,Dataset name,Sample size,Training size,Number of Features,Selected Features,ACCURACY - Logit with all features,ACCURACY - Logistic Regression,ACCURACY - SVM,ACCURACY - Random Forest,ACCURACY - Neural Network,Average Accuracy on selected features,Accuracy std on selected features
0,BF_df_CTU,442,309,10,4,0.774436,0.81203,0.842105,0.827068,0.81203,0.823308,0.014397
1,BF_df_OU,460,322,10,4,0.833333,0.797101,0.847826,0.84058,0.833333,0.82971,0.02253
2,BF_df_V,486,340,10,2,0.760274,0.719178,0.719178,0.726027,0.719178,0.72089,0.003425
3,DT_df_CC,482,337,27,5,0.682759,0.717241,0.717241,0.731034,0.703448,0.717241,0.011262
4,DT_df_JI,864,604,27,4,0.661538,0.646154,0.584615,0.592308,0.638462,0.615385,0.031404
5,IADQ_df,450,315,9,3,0.851852,0.851852,0.837037,0.837037,0.844444,0.842593,0.007092
6,IESR_df,358,250,22,4,0.935185,0.888889,0.925926,0.907407,0.925926,0.912037,0.01773
7,NAQ_R_df,712,498,22,6,0.953271,0.96729,0.976636,0.971963,0.976636,0.973131,0.004474
8,PCL5_df,402,281,20,2,0.809917,0.826446,0.826446,0.818182,0.826446,0.82438,0.004132
9,PHQ9_GAD7_df,1118,782,16,3,0.991071,0.979167,0.982143,0.97619,0.979167,0.979167,0.00243
