# RENT applied to a binary classification problem

This Jupyter notebook illustrates how to apply RENT to your data for feature selection with a *binary classification* problem. It is also complimentary to the manscript published at arXiv.org.

[RENT -- Repeated Elastic Net Technique for Feature Selection](https://arxiv.org/abs/2009.12780)

For an example on how to use RENT for feature selection on a **regression problem**, please have a look this [Jupyter Notebook](https://github.com/NMBU-Data-Science/RENT/blob/master/src/RENT/Regression_example.ipynb).

## Content

1. [Load Wisconsin Breast Cancer dataset](#Load-Wisconsin-Breast-Cancer-dataset)
2. [Define RENT ensemble for binary classification](#Define-RENT-ensemble-for-binary-classification)

---

First import needed modulse and apply some settings to the Jupyter notebook for better visualisation of results.

In [1]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 2000)
import RENT

import warnings
warnings.filterwarnings("ignore")

### Load Wisconsin Breast Cancer dataset

Now load the data from sciki-learn, store it in a pandas DataFrame and split it into a training and test set.

In [2]:
from sklearn.datasets import load_breast_cancer
wisconsin = load_breast_cancer()
data = pd.DataFrame(wisconsin.data)
target = wisconsin.target

In [None]:
from sklearn.model_selection import train_test_split
train_data, test_data, train_labels, test_labels = train_test_split(data, target, random_state=0, shuffle=True)

In [None]:
train_data.head()

### Define RENT ensemble for binary classification

**The main idea**

Using the RENT approach we will train an **ensemble of unique models** based on **unique subsets** of the training data. Since each model is trained on a unique subset of the training data, all models will be slightly different from each other and elastic net regularisation **may select different features for each model**. 

We investigate **how consistenly elastic net selects features** across all unique models by analysing distributions of the weight sizes of each feature. Using specific criteria $\tau_1$, $\tau_2$ and $\tau_3$ applied to those weight size distributions, we can regulate how aggressively RENT will select features from the full set of features. 

In [None]:
# Define a range of regularisation parameters C for elastic net. A minimum of at least one value is required.
my_C_params = [0.1, 1, 10]

# Define a reange of l1-ratios for elastic net.  A minimum of at least one value is required.
my_l1_ratios = [0, 0.1, 0.25, 0.5, 0.75, 0.9, 1]

analysis = RENT.RENT_Classification(data=train_data, 
                                    target=train_labels, 
                                    feat_names=train_data.columns, 
                                    C=my_C_params, 
                                    l1_ratios=my_l1_ratios,
                                    parameter_selection=True,
                                    poly='OFF',
                                    testsize_range=(0.25,0.25),
                                    scoring='mcc',
                                    method='logreg',
                                    K=100,
                                    verbose=0)

In [None]:
analysis.train()

In [None]:
analysis.summary_criteria()

In [None]:
# can play around with enet parameter setting...
analysis.get_enet_params()

In [None]:
analysis.set_enet_params(1,1)

In [None]:
analysis.get_enet_params()

In [None]:
analysis.set_enet_params(0.1, 1)

In [None]:
# contain only one element as we did paramter selection beforehand...
analysis.get_enetParam_matrices()

In [None]:
analysis.get_object_probabilities()

In [None]:
analysis.plot_object_probabilities(object_id=[293,332])

In [None]:
analysis.get_weight_distributions()

In [None]:
analysis.summary_objects()

In [None]:
fs_vars = analysis.selectFeatures(tau_1=0.9, tau_2=0.9, tau_3=0.975)

In [None]:
analysis.plot_selection_frequency()

In [None]:
analysis.get_runtime()

In [None]:
analysis.plot_object_PCA(group=0)

In [None]:
analysis.plot_object_PCA(group=1)

In [None]:
analysis.plot_object_PCA(group='both')

In [None]:
#predit test data

# Import what is needed for prediction and evaluation of predictions from test set
from sklearn.metrics import f1_score, precision_score, recall_score, matthews_corrcoef, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression as LR

# Scale the data accordingly
sc = StandardScaler()
train_data_1 = sc.fit_transform(train_data.iloc[:, fs_vars])
test_data_1 = sc.transform(test_data.iloc[:, fs_vars])

# Train model with 
model = LR(penalty='none', max_iter=8000, solver="saga", random_state=0).\
        fit(train_data_1, train_labels)

# Print results
print("f1 1: ", f1_score(test_labels, model.predict(test_data_1)))
print("f1 0: ", f1_score(1 - test_labels, 1 - model.predict(test_data_1)))
print("Accuracy: ", accuracy_score(test_labels, model.predict(test_data_1)))
print("Matthews correlation coefficient: ", matthews_corrcoef(test_labels, model.predict(test_data_1)))
