## Feature selection graph generator
This notebook will generate the JSON for the feature selection solver.
You need to fill the first cell with this parameters:
* input_filename: Relative or absolute path of your dataset file
* output_filename: Relative or absolute path of your JSON output file (this will be the JSON for the quantum solver)
* difference_field: This is the name of the dataset’s classification column for your ML algorithm
* cols_excluded: Columns to be excluded for the analysis. Typically, text columns. If you want to include non-numerical columns, you will need to transformate into numerical columns.
* (optional) lower and/or upper bound: These two optional parameters will determine the range of the lengths of different solutions for the solver. By default, the solver will use the range (1..N) with N number of features in your dataset. For each value from lower_bound to upper_bound the solver will run one time, so the number of solver executions will be \(upper_bound - lower_bound + 1\)

For testing purpose, you can use this dataset: https://www.kaggle.com/code/ryanholbrook/binary-classification/data?select=songs.csv

The quantum solver output will have this format:
\[ { i : [$a_{1}$ $\ldots$ $a_{i}$] } ], i=lower_bound$\ldots$upper_bound

In [16]:
#Configuration variables

input_filename="..//Test//songs.csv" #Name of your file
output_filename='..//Feature_selection_quantum//input.json' #Name of the JSON output file
cols_excluded=[] #Columns to be discarded
difference_field='year' #Classification column
extra_arguments={'lower_bound':8,
                 'upper_bound':8} #The feature selection algorithm will run for every number of selected features between lower_bound and upper_bound.

In [9]:
import pandas as pd
import itertools
import json
import multiprocessing, logging
from joblib import Parallel, delayed
import sys
from tqdm import tqdm
import numpy as np

### Auxiliary functions
These functions belong to information theory (https://en.wikipedia.org/wiki/Information_theory)

In [10]:
def prob(dataset, max_bins=10):
    """Joint probability distribution P(X) for the given data."""

    # bin by the number of different values per feature
    num_rows, num_columns = dataset.shape
    bins = [min(len(np.unique(dataset[:, ci])), max_bins) for ci in range(num_columns)]

    freq, _ = np.histogramdd(dataset, bins)
    p = freq / np.sum(freq)
    return p

def shannon_entropy(p):
    """Shannon entropy H(X) is the sum of P(X)log(P(X)) for probabilty distribution P(X)."""
    p = p.flatten()
    return -sum(pi*np.log2(pi) for pi in p if pi)

def conditional_shannon_entropy(p, *conditional_indices):
    """Shannon entropy of P(X) conditional on variable j"""

    axis = tuple(i for i in np.arange(len(p.shape)) if i not in conditional_indices)

    return shannon_entropy(p) - shannon_entropy(np.sum(p, axis=axis))

def mutual_information(p, j):
    """Mutual information between all variables and variable j"""
    return shannon_entropy(np.sum(p, axis=j)) - conditional_shannon_entropy(p, j)

def conditional_mutual_information(p, j, *conditional_indices):
    """Mutual information between variables X and variable Y conditional on variable Z."""

    marginal_conditional_indices = [i-1 if i > j else i for i in conditional_indices]

    return (conditional_shannon_entropy(np.sum(p, axis=j), *marginal_conditional_indices)
            - conditional_shannon_entropy(p, j, *conditional_indices))
def parallel_function(f0,f1,difference_field,data):
    return (f0,f1,conditional_mutual_information(prob(data[[difference_field, f0, f1]].values), 1, 2),conditional_mutual_information(prob(data[[difference_field, f1, f0]].values), 1, 2))

### Pandas dataset read
This block intends to read your file into a Pandas dataset, discarding the column set in cols_excluded parameter. 
Please, use the Pandas reader suitable for your file (CSV, Pickle, Parquet...) and remove NaN of your dataset

In [11]:
#Get a Pandas Dataframe from your file. In this example it is a CSV
data=pd.read_csv(input_filename,sep=',').drop(cols_excluded,axis=1)
data=data.fillna(0)

### Graph generator
This block generates the graph needed for the algorithm. It uses tqdm module to watch vertex and edges generation. 
For edge generation, the code is parallelized with Joblib



In [12]:
mi = {}
features = list(set(data.columns).difference((difference_field,)))
for feature in tqdm(features):
    mi[feature] = mutual_information(prob(data[[difference_field, feature]].values), 1)
cmi={}
i=0
for f0 in features:
    cmi[f0]={}
    for f1 in features:
        if(f0!=f1):
            cmi[f0][f1]=0
x=list(itertools.combinations(features, 2))
res = Parallel(n_jobs=-1,require='sharedmem')(delayed(parallel_function)(s[0],s[1],difference_field,data) for s in tqdm(x))
for x in res:
    cmi[x[0]][x[1]]=x[2]
    cmi[x[1]][x[0]]=x[3]
graph={}
graph['vertex']=mi
graph['edges']=cmi

100%|██████████████████████████████████████████████████████████████████████████████████| 90/90 [00:12<00:00,  7.37it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 4005/4005 [07:02<00:00,  9.49it/s]


### Writing block
This block will generate the JSON representation of your graph and fill with other parameters. It will write it to the output_filename and also to your console

In [None]:
final_json={}
final_json['data']=graph
final_json['extra_arguments']=extra_arguments
json_str=json.dumps(final_json,indent=4)
with  open(output_filename,"wt") as f:
    f.write(json_str)
print(json_str)