Example dataset. We create a dataset that is split into a train (data, labels) and a test (test_data, test_labels) dataset with corresponding labels.

In [1]:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

data = make_classification(n_samples=250, n_features=100, n_informative=20, n_redundant=10, random_state=0, shuffle=False)

my_data = pd.DataFrame(data[0])
my_target = data[1]
my_feat_names = ['f{0}'.format(x+1) for x in range(len(my_data.columns))]

data, test_data, labels, test_labels = train_test_split(my_data, my_target, test_size=0.3, random_state=0)

The next cell contains code with which we perform best parameter selection, i.e. the best parameter combination for the elastic net models in RENT. This step is NOT REQUIRED for running RENT; parameters can be user defined in a later step of this document. If you play around with RENT and use it the first time, just ignore this cell. 
It is important to remember that the paramter "C" which is responsible for the regularization strength, is inversely defined in the LogisticRegression function of scikit learn. This can be important to know when you want to further interpret results etc. 

In [None]:
import parameter_selection as ps
import warnings

# Activate this to not show all the convergence warnings.
warnings.filterwarnings("ignore")

my_reg_params = [0.1,1,10]
my_l1_params = [0,0.1,0.25, 0.5, 0.75, 0.9, 1]
testsize_range = (0.25, 0.25)

best_C, best_l1 = ps.parameter_selection(data=data, labels=labels, 
                     my_reg_params=my_reg_params, 
                     my_l1_params=my_l1_params,
                     n_splits=5, 
                     testsize_range = testsize_range)

In [None]:
print("best C: ", best_C)
print("best l1: ", best_l1)

This would be the best parameter combination with the procedure from above. As stated before, this is not necessary but you can select own parameters you would like to try RENT with, directly in RENT. RENT delivers then the best paramter combination found. The process above is useful for speeding up the whole procedure as RENT is then run only with one parameter combination. For the fundamental applcation of RENT see the next cell.

RENT offers different settings which are described in the RENT_parallel file. The setting here is the standard setting used in our paper with fewer tt_splits (faster computation). 

In [2]:
import RENT_parallel as fs
# C parameters you would like to try
my_reg_params = [0.1, 1]
# l1-strengths you would like to try
my_l1_params = [0.5, 0.9]

analysis = fs.RENT(data=data, 
                                target=labels,
                                feat_names=data.columns,
                                reg_params=my_reg_params,
                                poly='OFF', 
                                scoring='f1',
                                clf='logreg',
                                testsize_range=(0.25, 0.25),
                                num_tt=20, 
                                num_w_init=1,
                                l1_params = my_l1_params,
                                verbose = 0)


analysis.run_analysis()

  if(self.clf is not "RM" and self.clf is not "linSVC"):
  if(self.clf is not "RM" and self.clf is not "linSVC"):


Dim data: (175, 100)
Dim target (175,)
reg param C: [0.1, 1]
l1_params: [0.5, 0.9]
num TT splits: 20
num weight inits: 1
data type: <class 'pandas.core.frame.DataFrame'>
verbose: 0


We can take a closer look into different calculations from RENT. 

In [None]:
scores = analysis.get_scores_summary_by_regParam()
zeroes = analysis.get_average_zero_features()


print(scores)
print(zeroes)

In [None]:
import numpy as np

normed_scores = (scores-np.nanmin(scores.values))/(np.nanmax(scores.values)-np.nanmin(scores.values))
normed_zeroes = (zeroes-np.nanmin(zeroes.values))/(np.nanmax(zeroes.values)-np.nanmin(zeroes.values))
normed_zeroes = normed_zeroes.astype("float")

combi = (normed_scores ** -1 + normed_zeroes ** -1) ** -1
print(combi)

From the combination matrix we see that the combination C = 0.1, l1 = 0.9 has the highest value. We will use it now as the "best" combination

In [None]:
best_combi_row, best_combi_col  =np.where(combi == np.nanmax(combi.values))
l1 = combi.index[np.nanmax(best_combi_row)]
C = combi.columns[np.nanmin(best_combi_col)]
print("C: ", C, "l1: ", l1)

summary_spec_weights, sel_feat_df, variables = analysis.get_spec_weights_summary(reg_param=C, l1_param=l1, cutoff_perc=0.9, cutoff_means_ratio=0.9, cutoff_mean_std_ratio=0.975,  sel_approach = "new")

The image above shows the counts for each feature

In [None]:
# summary specific weight shows the summary statistics for each feature
summary_spec_weights

In [None]:
# sel_feat_df contains the selected features
sel_feat_df

We can also perform a feasibility study -- see paper

In [None]:
analysis.feasibility_study(test_data=test_data, test_labels=test_labels, feature_size= len(sel_feat_df.columns), 
                          features=sel_feat_df.columns)

In [None]:
# predict test set

from sklearn.metrics import f1_score, precision_score, recall_score, matthews_corrcoef, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression as LR
sc = StandardScaler()
train_data_1 = sc.fit_transform(data.loc[:, sel_feat_df.columns])
test_data_1 = sc.transform(test_data.loc[:, sel_feat_df.columns])
model = LR(penalty='none', max_iter=8000, solver="saga", random_state=0).\
        fit(train_data_1,labels)
print("All features f1 1: ", f1_score(test_labels, model.predict(test_data_1)))
print("All features f1 0: ", f1_score(1 - test_labels, 1 - model.predict(test_data_1)))
print("All features acc: ", accuracy_score(test_labels, model.predict(test_data_1)))
print("All features matthews: ", matthews_corrcoef(test_labels, model.predict(test_data_1)))




Besides the feature selection RENT has the property of summarizing the predictive behavior of single samples. Before we can generate plots for them we also need to check how often they were classified incorrectly.

In [None]:
inc = analysis.get_spec_incorr_lables(0.1, 0.9)

In [None]:
inc[0]

In [None]:
analysis.confusion_variance_plot()

To get an overview of the logisic regression predicitons for each sample we need to store them.

In [None]:
analysis.pred_proba()
analysis.pred_proba_plot(0.1, 0.9, [1,2])