<img src="../images/CD_image.png" width="600">

<div style="background-color:white; text-align:center; padding:10px; color:black; margin-left:0px; border-radius: 10px; font-family:Trebuchet MS; font-size:45px">
<strong>Example of Feature Selection</strong>
</div>

# Environment

**Spark session**

In [1]:
!python3 --version

Python 3.9.2


In [2]:
sc

In [3]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

## Libraries

In [4]:
from fns_four import *
%config InlineBackend.figure_format = "retina"

# Load dataset

In [5]:
Ruta = "../data/"
baseDf = sqlContext.read.format("parquet").load(Ruta + "tmp_evol4")

In [6]:
print("Filas: {0:,}\tColumnas: {1:,}".format(baseDf.count(), len(baseDf.columns)))

Filas: 691,782	Columnas: 11


In [7]:
baseDf.show(5)

+--------+---------+----+------------+---------+---------+---------+----------------+------------+-----+-----------+
|incmpl_d|cohorte_d|segm|       num_1|    cat_1|    cat_2|    cat_3|           num_2|       num_3|num_4|      num_5|
+--------+---------+----+------------+---------+---------+---------+----------------+------------+-----+-----------+
|       0|   200804|   1|-9.9999992E7|-99999995|-99999995|-99999995|1.10148620475244|0.2370735282| -0.3|6.745534424|
|       0|   200805|   1|-9.9999992E7|-99999995|-99999995|-99999995|1.10148620475244|0.2370735282| -0.3|6.745534424|
|       0|   200806|   1|-9.9999992E7|-99999995|-99999995|-99999995|1.10148620475244|0.2370735282| -0.3|6.745534424|
|       0|   200807|   1|-9.9999992E7|-99999995|-99999995|-99999995|1.10148620475244|0.2370735282| -0.3|6.745534424|
|       0|   200808|   1|-9.9999992E7|-99999995|-99999995|-99999995|1.10148620475244|0.2370735282| -0.3|6.745534424|
+--------+---------+----+------------+---------+---------+------

# Feature selection

In [8]:
four = cp_four(sqlContext)

In [9]:
baseDf.columns

['incmpl_d',
 'cohorte_d',
 'segm',
 'num_1',
 'cat_1',
 'cat_2',
 'cat_3',
 'num_2',
 'num_3',
 'num_4',
 'num_5']

In [10]:
var_interes = ["incmpl_d","num_1","num_2","num_3","cat_1","cat_2"]
list_espvalues = [-99999999,-99999998,-99999997,-99999996,-99999995,-99999994,-99999993,-99999992,-99999991,-99999990]

In [11]:
param_dict = {"var_interes": var_interes,
              "special_values" : {"num_1": list_espvalues,
                                  "num_2": list_espvalues,
                                  "num_3": list_espvalues,
                                  "num_4": list_espvalues,
                                  "num_5": list_espvalues},
              "var_obj": "incmpl_d",
              "lim_none": 0.20,
              "lim_corr": 0.50,
              "lim_variab": 5,
              "analysis_rf": {"lim_imprf": 1,
                              "num_trees": 20,
                              "max_depth": 4,
                              "rf_seed": 12345},
              "black_list": ["cohorte_d","segm"],
              "white_list": []}

## Correlation analysis

In this case, we not only verify the correlations between the variables, we want to discover those variables that could be considered in the score model. For this porpouse, we also use the Gini of each variable in order to decide which one is better

In [12]:
%%time
w = four.correlation_analysis(baseDf, param_dict)
print("\n----------")

Iteration of correlation (gini): 2/2	Progress: 100.00%
----------
CPU times: user 38.7 ms, sys: 11.1 ms, total: 49.8 ms
Wall time: 24.2 s


In [13]:
w[0]

Unnamed: 0,row,column,correlation,row_gini,column_gini,drop
0,cat_1,cat_2,1.0,0.161381,0.137252,cat_2


In [14]:
w[1]

Unnamed: 0,variable,gini,tndnc
0,cat_2,0.137252,d
1,cat_1,0.161381,d


In [15]:
w[2]

['cat_2']

## Variability analysis

Features with high variability are considered as good candidates to create a good model

In [16]:
z = four.variability_analysis(baseDf, param_dict)


-----
Iteration of variability and none values: 5/5	Progress: 100.00%

In [17]:
z[0]

Unnamed: 0,q_0,q_25,median,q_75,q_100,variable,IQR,variability
4,-100000000.0,-100000000.0,-100000000.0,4.0,8.0,cat_2,100000000.0,99.999996
3,-100000000.0,-100000000.0,-100000000.0,3.0,8.0,cat_1,100000000.0,99.999995
1,-35.14523,0.753673,1.752023,3.687792,2068.478261,num_2,2.934119,0.139479
0,0.0,0.3326354,0.7433755,1.586602,1796.171741,num_1,1.253966,0.069813
2,0.0,0.02288464,0.2335762,0.781704,1882.0,num_3,0.758819,0.04032


In [18]:
z[1]

['num_2', 'num_1', 'num_3']

## Random Forest analysis

In [19]:
a = four.analysis_rf(baseDf, param_dict)


-----
Feature importance by RF is done.


In [20]:
a[0]

SparseVector(5, {0: 0.0853, 1: 0.1468, 2: 0.0293, 3: 0.5508, 4: 0.1878})

In [21]:
a[1]

Unnamed: 0,idx,name,score
0,3,cat_1,0.550817
1,4,cat_2,0.187829
2,1,num_2,0.146806
3,0,num_1,0.085251
4,2,num_3,0.029298


In [22]:
a[2]

['num_3']

## Feature selection matrix

The variables with larger numbers in **"count" column** could be discarded

In [23]:
b = four.matrix_fs(param_dict,w,z,a)

In [24]:
b[0]

Unnamed: 0,variable,correlation,variability,low_importance,count
0,num_1,0.0,1.0,0.0,1.0
1,num_2,0.0,1.0,0.0,1.0
2,num_3,0.0,1.0,1.0,2.0
3,cat_1,0.0,0.0,0.0,0.0
4,cat_2,1.0,0.0,0.0,1.0


In [25]:
b[1]

Unnamed: 0,variable,correlation,variability,low_importance,count,correlation with
0,num_3,0.0,1.0,1.0,2.0,
1,num_1,0.0,1.0,0.0,1.0,
2,num_2,0.0,1.0,0.0,1.0,
3,cat_2,1.0,0.0,0.0,1.0,cat_1
