# Hoja de Trabajo \# 6


---


por Josué Obregón <br>
DS6011 - Feature Engineering <br>
UVG Masters - Escuela de Negocios<br>


## Objetivos

El objetivo de esta hoja de trabajo  es presentar al estudiante diferentes técnicas selección de atributos.

También se busca que el estudiante practique la utilización de éstas técnicas con las librerías disponibles en el lenguaje Python.

## Importación de librerías y carga de los datos

Las librerías que importaremos para empezar son pandas y numpy para el manejo de los datos, y matplotlib, seaborn y plotly para la generación de visualizaciones. 



In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

### Cargando los datos

In [3]:
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [4]:
credit_g =  fetch_openml('credit-g') # https://www.openml.org/t/31

  warn(
  warn(


In [5]:
credit_g.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

In [6]:
print(credit_g['DESCR'])

**Author**: Dr. Hans Hofmann  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) - 1994    
**Please cite**: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)

**German Credit dataset**  
This dataset classifies people described by a set of attributes as good or bad credit risks.

This dataset comes with a cost matrix: 
``` 
Good  Bad (predicted)  
Good   0    1   (actual)  
Bad    5    0  
```

It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).  

### Attribute description  

1. Status of existing checking account, in Deutsche Mark.  
2. Duration in months  
3. Credit history (credits taken, paid back duly, delays, critical accounts)  
4. Purpose of the credit (car, television,...)  
5. Credit amount  
6. Status of savings account/bonds, in Deutsche Mark.  
7. Present employment, in number of years.  
8. Installment rate in percentage of disposable income  
9. Perso

In [7]:
df_credit = credit_g['data']
lbl_enc = LabelEncoder()
df_credit['class']= lbl_enc.fit_transform(credit_g['target'])
df_credit.head()

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,<0,6.0,critical/other existing credit,radio/tv,1169.0,no known savings,>=7,4.0,male single,none,...,real estate,67.0,none,own,2.0,skilled,1.0,yes,yes,1
1,0<=X<200,48.0,existing paid,radio/tv,5951.0,<100,1<=X<4,2.0,female div/dep/mar,none,...,real estate,22.0,none,own,1.0,skilled,1.0,none,yes,0
2,no checking,12.0,critical/other existing credit,education,2096.0,<100,4<=X<7,2.0,male single,none,...,real estate,49.0,none,own,1.0,unskilled resident,2.0,none,yes,1
3,<0,42.0,existing paid,furniture/equipment,7882.0,<100,4<=X<7,2.0,male single,guarantor,...,life insurance,45.0,none,for free,1.0,skilled,2.0,none,yes,1
4,<0,24.0,delayed previously,new car,4870.0,<100,1<=X<4,3.0,male single,none,...,no known property,53.0,none,for free,2.0,skilled,2.0,none,yes,0


Generamos el conjunto de datos de entrenamiento y el de prueba

In [8]:
X_train, X_test, y_train, y_test = train_test_split(df_credit.drop(['class'],axis=1),df_credit['class'],train_size=0.80, random_state=6011, shuffle=True)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(800, 20)
(800,)
(200, 20)
(200,)


#### Codificando variables categóricas

In [9]:
!pip install category_encoders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting category_encoders
  Downloading category_encoders-2.6.1-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/81.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.1


In [10]:
from category_encoders import MEstimateEncoder

In [11]:
cat_cols = ['checking_status','credit_history','purpose','savings_status','employment','personal_status','other_parties','residence_since','property_magnitude','other_payment_plans','housing','job','own_telephone','foreign_worker']

In [12]:
mest_enc = MEstimateEncoder(cols=cat_cols)

In [13]:
X_train_cod1 = mest_enc.fit_transform(X_train, y_train)

In [14]:
X_train_cod1

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,residence_since,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker
621,0.874023,18.0,0.824947,0.607955,1530.0,0.622500,0.683036,3.0,0.723900,0.688523,0.672262,0.672126,32.0,0.568638,0.724373,2.0,0.692749,1.0,0.674349,0.680114
199,0.611944,18.0,0.671875,0.679789,4297.0,0.622500,0.737377,4.0,0.597384,0.688523,0.709500,0.550305,40.0,0.712270,0.724373,1.0,0.629808,1.0,0.707104,0.680114
360,0.611944,18.0,0.663586,0.556090,1239.0,0.803272,0.683036,4.0,0.723900,0.688523,0.681835,0.550305,61.0,0.712270,0.586596,1.0,0.692749,1.0,0.674349,0.680114
65,0.874023,27.0,0.663586,0.634375,5190.0,0.803272,0.737377,4.0,0.723900,0.688523,0.681835,0.672126,48.0,0.712270,0.724373,4.0,0.692749,2.0,0.707104,0.680114
981,0.874023,48.0,0.663586,0.638117,4844.0,0.622500,0.593750,3.0,0.723900,0.688523,0.672262,0.679136,33.0,0.568638,0.596391,1.0,0.629808,1.0,0.707104,0.680114
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
187,0.611944,16.0,0.824947,0.607955,1175.0,0.622500,0.593750,2.0,0.723900,0.688523,0.709500,0.679136,68.0,0.712270,0.586596,3.0,0.593750,1.0,0.707104,0.680114
478,0.611944,12.0,0.663586,0.638117,1037.0,0.681882,0.784052,3.0,0.723900,0.688523,0.681835,0.785278,39.0,0.712270,0.724373,1.0,0.724124,1.0,0.674349,0.680114
559,0.611944,18.0,0.824947,0.679789,1928.0,0.622500,0.558067,2.0,0.723900,0.688523,0.672262,0.785278,31.0,0.712270,0.724373,2.0,0.724124,1.0,0.674349,0.680114
305,0.874023,6.0,0.663586,0.679789,1543.0,0.876453,0.683036,4.0,0.597384,0.688523,0.672262,0.785278,33.0,0.712270,0.724373,1.0,0.692749,1.0,0.674349,0.680114


In [15]:
mest_enc.mapping

{'checking_status': checking_status
  1    0.874023
  2    0.611944
  3    0.477192
  4    0.722656
 -1    0.687500
 -2    0.687500
 dtype: float64,
 'credit_history': credit_history
  1    0.824947
  2    0.671875
  3    0.663586
  4    0.417187
  5    0.391071
 -1    0.687500
 -2    0.687500
 dtype: float64,
 'purpose': purpose
  1     0.607955
  2     0.679789
  3     0.556090
  4     0.634375
  5     0.638117
  6     0.764895
  7     0.810320
  8     0.568750
  9     0.835938
  10    0.607955
 -1     0.687500
 -2     0.687500
 dtype: float64,
 'savings_status': savings_status
  1    0.622500
  2    0.803272
  3    0.809949
  4    0.681882
  5    0.876453
 -1    0.687500
 -2    0.687500
 dtype: float64,
 'employment': employment
  1    0.683036
  2    0.737377
  3    0.593750
  4    0.784052
  5    0.558067
 -1    0.687500
 -2    0.687500
 dtype: float64,
 'personal_status': personal_status
  1    0.723900
  2    0.597384
  3    0.735445
  4    0.623214
 -1    0.687500
 -2    0.6875

In [16]:
X_test_cod1 = mest_enc.transform(X_test)

In [17]:
from sklearn.preprocessing import PolynomialFeatures

In [18]:
poly_gen = PolynomialFeatures(2)

In [19]:
poly_gen.fit_transform(X_train_cod1).shape

(800, 231)

In [20]:
X_train_cod = pd.DataFrame(poly_gen.fit_transform(X_train_cod1), columns=poly_gen.get_feature_names_out())
X_train_cod

Unnamed: 0,1,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,...,job^2,job num_dependents,job own_telephone,job foreign_worker,num_dependents^2,num_dependents own_telephone,num_dependents foreign_worker,own_telephone^2,own_telephone foreign_worker,foreign_worker^2
0,1.0,0.874023,18.0,0.824947,0.607955,1530.0,0.622500,0.683036,3.0,0.723900,...,0.479901,0.692749,0.467155,0.471148,1.0,0.674349,0.680114,0.454747,0.458634,0.462555
1,1.0,0.611944,18.0,0.671875,0.679789,4297.0,0.622500,0.737377,4.0,0.597384,...,0.396658,0.629808,0.445340,0.428341,1.0,0.707104,0.680114,0.499996,0.480911,0.462555
2,1.0,0.611944,18.0,0.663586,0.556090,1239.0,0.803272,0.683036,4.0,0.723900,...,0.479901,0.692749,0.467155,0.471148,1.0,0.674349,0.680114,0.454747,0.458634,0.462555
3,1.0,0.874023,27.0,0.663586,0.634375,5190.0,0.803272,0.737377,4.0,0.723900,...,0.479901,1.385498,0.489846,0.471148,4.0,1.414208,1.360227,0.499996,0.480911,0.462555
4,1.0,0.874023,48.0,0.663586,0.638117,4844.0,0.622500,0.593750,3.0,0.723900,...,0.396658,0.629808,0.445340,0.428341,1.0,0.707104,0.680114,0.499996,0.480911,0.462555
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,1.0,0.611944,16.0,0.824947,0.607955,1175.0,0.622500,0.593750,2.0,0.723900,...,0.352539,0.593750,0.419843,0.403817,1.0,0.707104,0.680114,0.499996,0.480911,0.462555
796,1.0,0.611944,12.0,0.663586,0.638117,1037.0,0.681882,0.784052,3.0,0.723900,...,0.524356,0.724124,0.488312,0.492487,1.0,0.674349,0.680114,0.454747,0.458634,0.462555
797,1.0,0.611944,18.0,0.824947,0.679789,1928.0,0.622500,0.558067,2.0,0.723900,...,0.524356,0.724124,0.488312,0.492487,1.0,0.674349,0.680114,0.454747,0.458634,0.462555
798,1.0,0.874023,6.0,0.663586,0.679789,1543.0,0.876453,0.683036,4.0,0.597384,...,0.479901,0.692749,0.467155,0.471148,1.0,0.674349,0.680114,0.454747,0.458634,0.462555


In [21]:
X_test_cod = pd.DataFrame(poly_gen.transform(X_test_cod1), columns=poly_gen.get_feature_names_out())

#### Modelo base

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [23]:
base_model = LogisticRegression( penalty='none', max_iter=1000, random_state=6011)
base_model.fit(X_train_cod,y_train)



In [24]:
print(f'Coeficientes mayores a cero: {(base_model.coef_>0).sum()}')

Coeficientes mayores a cero: 199


In [25]:
print(classification_report(y_test, base_model.predict(X_test_cod)))

              precision    recall  f1-score   support

           0       0.63      0.24      0.35        50
           1       0.79      0.95      0.86       150

    accuracy                           0.78       200
   macro avg       0.71      0.60      0.61       200
weighted avg       0.75      0.78      0.73       200



# Métodos Intrínsecos

## Regresion Ridge

In [26]:
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LogisticRegressionCV

In [27]:
ridge_classifier = LogisticRegressionCV(cv=2, max_iter=100, penalty='l2', solver='liblinear', random_state=6011)
ridge_classifier.fit(X_train_cod,y_train)

In [28]:
ridge_classifier.classes_

array([0, 1])

In [29]:
ridge_classifier.coef_

array([[ 2.34814244e-06,  7.02108252e-06, -5.08499543e-05,
         4.89841388e-06,  3.39891023e-06, -1.81871251e-03,
         3.08778064e-06,  2.86397081e-06,  5.96761952e-06,
         2.28763023e-06,  2.38232781e-06,  1.79728622e-06,
         3.00631565e-06,  8.95614215e-05,  2.55470191e-06,
         2.52087213e-06,  7.07800256e-06,  1.82806802e-06,
         1.00394608e-06,  1.79504957e-06,  2.23917392e-06,
         8.52322058e-06,  6.87946461e-05,  6.95015033e-06,
         6.07883724e-06,  7.75374631e-04,  5.76727968e-06,
         5.72377702e-06,  2.44426177e-05,  5.26211769e-06,
         5.30083156e-06,  4.95505167e-06,  5.78620975e-06,
         2.50870451e-04,  5.59823316e-06,  5.47567812e-06,
         1.09758023e-05,  5.03714497e-06,  6.56231739e-06,
         4.92793021e-06,  5.19260778e-06, -1.88986860e-03,
        -1.70470618e-05, -6.81458053e-06,  2.04620788e-05,
         2.78522046e-06, -1.20122658e-05, -1.86574992e-04,
        -1.84261388e-05, -2.00698057e-05, -2.76095135e-0

In [30]:
print(f'Coeficientes mayores a cero: {(ridge_classifier.coef_>0).sum()}')

Coeficientes mayores a cero: 203


In [31]:
ridge_classifier.C_

array([0.00599484])

In [32]:
print(f'Mejor valor para lambda: {1/ridge_classifier.C_}')

Mejor valor para lambda: [166.81005372]


In [33]:
print('================ Resultados Ridge Classifier =================\n')
print(classification_report(y_test, ridge_classifier.predict(X_test_cod)))


              precision    recall  f1-score   support

           0       0.56      0.40      0.47        50
           1       0.82      0.89      0.85       150

    accuracy                           0.77       200
   macro avg       0.69      0.65      0.66       200
weighted avg       0.75      0.77      0.76       200



## Regresión LASSO

In [34]:
lasso_classifier = LogisticRegressionCV(cv=2, penalty='l1', max_iter=100, solver='liblinear', random_state=6011)
lasso_classifier.fit(X_train_cod,y_train)
# 21 sec



In [35]:
lasso_classifier.classes_

array([0, 1])

In [36]:
lasso_classifier.coef_

array([[ 0.00000000e+00,  0.00000000e+00, -4.51882087e-01,
         0.00000000e+00,  0.00000000e+00, -1.83352862e-03,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00, -4.33901140e-01,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  1.83041275e-01,  0.00000000e+00,
         0.00000000e+00, -5.41904266e-04,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         8.98106469e-02,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00, -1.54779551e-03,
        -3.62326024e-01,  3.11098936e-02,  2.40859918e-05,
         2.00821859e-01,  1.02283097e-01, -1.46884716e-02,
         0.00000000e+00,  0.00000000e+00,  1.05112561e-0

In [37]:
print(f'Coeficientes mayores a cero: {(lasso_classifier.coef_>0).sum()}')

Coeficientes mayores a cero: 30


In [38]:
lasso_classifier.C_

array([0.35938137])

In [39]:
print(f'Mejor valor para lambda: {1/lasso_classifier.C_}')

Mejor valor para lambda: [2.7825594]


In [40]:
print('================ Resultados LASSO Logistic Regressor =================\n')
print(classification_report(y_test, lasso_classifier.predict(X_test_cod)))


              precision    recall  f1-score   support

           0       0.57      0.62      0.60        50
           1       0.87      0.85      0.86       150

    accuracy                           0.79       200
   macro avg       0.72      0.73      0.73       200
weighted avg       0.80      0.79      0.79       200



## Elastic Net (L1 y L2 regularizer combinados)

In [41]:
elastic_classifier = LogisticRegression( penalty='elasticnet', solver='saga', max_iter=100, l1_ratio=0.5, random_state=6011)
elastic_classifier.fit(X_train_cod,y_train)



In [42]:
elastic_classifier = LogisticRegression(C=0.0001, penalty='elasticnet', solver='saga', max_iter=100, l1_ratio=0.5, random_state=6011)
elastic_classifier.fit(X_train_cod,y_train)



In [43]:
elastic_classifier.classes_

array([0, 1])

In [44]:
elastic_classifier.coef_

array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 4.98301911e-10, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 1.05555493e-12, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 4.61757862e-10, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 4.12317916e-11, 0.00000000e+00, 0.00000000e+00,
        6.67893866e-09, 0.00000000e+00, 0.00000000e+00, 1.17191012e-12,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        1.24893009e-10, 0.00000000e+00, 0.00000000e+00, 0.000000

In [45]:
print(f'Coeficientes mayores a cero: {(elastic_classifier.coef_>0).sum()}')

Coeficientes mayores a cero: 29


In [46]:
elastic_classifier.C

0.0001

In [47]:
print(f'Mejor valor para lambda: {1/elastic_classifier.C}')

Mejor valor para lambda: 10000.0


In [48]:
print('================ Resultados de la Elastic Net =================\n')
print(classification_report(y_test, elastic_classifier.predict(X_test_cod)))


              precision    recall  f1-score   support

           0       0.00      0.00      0.00        50
           1       0.75      1.00      0.86       150

    accuracy                           0.75       200
   macro avg       0.38      0.50      0.43       200
weighted avg       0.56      0.75      0.64       200



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Árboles de Decisión

In [49]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

In [50]:
dt = DecisionTreeClassifier()
dt.fit(X_train_cod, y_train)

In [51]:
dt.feature_importances_

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.00620606, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.00659123, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.01570909, 0.        ,
       0.0080881 , 0.14872521, 0.00561755, 0.        , 0.00551196,
       0.        , 0.        , 0.        , 0.03155642, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00387879, 0.01206988, 0.        ,
       0.        , 0.        , 0.02188903, 0.        , 0.        ,
       0.        , 0.03660315, 0.00184078, 0.        , 0.00930909,
       0.00465455, 0.        , 0.0097467 , 0.        , 0.        ,
       0.        , 0.01994834, 0.0724463 , 0.        , 0.00908212,
       0.        , 0.        , 0.        , 0.        , 0.04732802,
       0.        , 0.00555372, 0.        , 0.00487378, 0.     

In [52]:
np.array(dt.feature_importances_).sum()

0.9999999999999998

In [53]:
dt.get_depth()

15

In [54]:
print(f'Importancias mayores a cero: {(dt.feature_importances_>0).sum()}')

Importancias mayores a cero: 78


In [55]:
print('================ Resultados Decision Tree =================\n')
print(classification_report(y_test, dt.predict(X_test_cod)))


              precision    recall  f1-score   support

           0       0.41      0.58      0.48        50
           1       0.84      0.73      0.78       150

    accuracy                           0.69       200
   macro avg       0.63      0.65      0.63       200
weighted avg       0.73      0.69      0.70       200



## Ensamblados de árboles

In [56]:
rf = RandomForestClassifier()
rf.fit(X_train_cod, y_train)

In [57]:
rf.feature_importances_

array([0.        , 0.00989786, 0.00306819, 0.00216456, 0.00404909,
       0.00563471, 0.00107658, 0.00119637, 0.00146869, 0.00149616,
       0.0005221 , 0.00173337, 0.00116795, 0.00439751, 0.0007423 ,
       0.0005829 , 0.00079686, 0.00101943, 0.00026098, 0.00095076,
       0.00015588, 0.00581333, 0.00385711, 0.0151038 , 0.01838758,
       0.00695825, 0.0156637 , 0.01134295, 0.00363367, 0.00834582,
       0.0071097 , 0.00796204, 0.01428231, 0.01114811, 0.00959409,
       0.01072526, 0.00397334, 0.00922655, 0.00261692, 0.00429794,
       0.00820423, 0.00420623, 0.00582147, 0.0046914 , 0.00757952,
       0.00652568, 0.00612687, 0.0079216 , 0.00329559, 0.00353846,
       0.00542293, 0.00560721, 0.01000432, 0.00493863, 0.00512086,
       0.00492237, 0.00687372, 0.00502728, 0.00649631, 0.00457613,
       0.00186857, 0.00837481, 0.00719787, 0.00591286, 0.00704318,
       0.00365068, 0.0057426 , 0.00440407, 0.00462872, 0.00992412,
       0.00616739, 0.00557593, 0.00363271, 0.00124538, 0.00662

In [58]:
print(f'Importancias mayores a cero: {(rf.feature_importances_>0).sum()}')

Importancias mayores a cero: 230


In [59]:
print('================ Resultados Random Forest =================\n')
print(classification_report(y_test, rf.predict(X_test_cod)))


              precision    recall  f1-score   support

           0       0.56      0.62      0.59        50
           1       0.87      0.84      0.85       150

    accuracy                           0.79       200
   macro avg       0.72      0.73      0.72       200
weighted avg       0.79      0.79      0.79       200



In [60]:
gbt = GradientBoostingClassifier( max_depth=2)
gbt.fit(X_train_cod, y_train)

In [61]:
gbt.feature_importances_

array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.39347352e-02,
       1.45182677e-03, 2.21691766e-02, 1.32013404e-01, 1.86747129e-02,
       0.00000000e+00, 1.56628888e-02, 0.00000000e+00, 0.00000000e+00,
       6.32481421e-02, 3.18511703e-02, 1.27724793e-02, 1.11494926e-05,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 5.47754538e-03, 0.00000000e+00,
       1.75061732e-02, 2.56302319e-02, 0.00000000e+00, 3.88373701e-02,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.13251584e-03,
       1.41513432e-02, 0.00000000e+00, 0.00000000e+00, 3.06649711e-03,
      

In [62]:
print(f'Importancias mayores a cero: {(gbt.feature_importances_>0).sum()}')

Importancias mayores a cero: 95


In [63]:
print('================ Resultados Gradient Boosting Trees =================\n')
print(classification_report(y_test, gbt.predict(X_test_cod)))


              precision    recall  f1-score   support

           0       0.57      0.58      0.57        50
           1       0.86      0.85      0.86       150

    accuracy                           0.79       200
   macro avg       0.71      0.72      0.72       200
weighted avg       0.79      0.79      0.79       200



# Wrapper Methods

In [64]:
from sklearn.feature_selection import SequentialFeatureSelector, RFECV, RFE
from sklearn.neighbors import KNeighborsClassifier

## Stepwise selection

### Forward

In [65]:
rf = RandomForestClassifier(n_estimators=25)

In [66]:
n_features = 5

In [67]:
forward_stepwise = SequentialFeatureSelector(rf,n_features_to_select=n_features, direction='forward', cv=2 )

In [68]:
%%time
forward_stepwise.fit(X_train_cod1, y_train) #X_train_cod1 -> 20 features  duration:7.37s

CPU times: user 10.1 s, sys: 64.1 ms, total: 10.2 s
Wall time: 10.3 s


In [69]:
X_train_cod1.shape

(800, 20)

In [70]:
forward_stepwise.get_support()

array([ True, False,  True, False, False,  True, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False,  True])

In [71]:
X_train_frw = forward_stepwise.transform(X_train_cod1)
X_test_frw = forward_stepwise.transform(X_test_cod1)

In [72]:
X_train_frw.shape

(800, 5)

In [73]:
rf_frw = RandomForestClassifier(n_estimators=25)
rf_frw.fit(X_train_frw, y_train)

In [74]:
print('================ Resultados Forward Stepwise Random Forest =================\n')
print(classification_report(y_test, rf_frw.predict(X_test_frw)))


              precision    recall  f1-score   support

           0       0.51      0.58      0.54        50
           1       0.85      0.81      0.83       150

    accuracy                           0.76       200
   macro avg       0.68      0.70      0.69       200
weighted avg       0.77      0.76      0.76       200



### Backward

In [75]:
rf = RandomForestClassifier(n_estimators=25)

In [76]:
n_features = 5

In [77]:
backward_stepwise = SequentialFeatureSelector(rf,n_features_to_select=n_features, direction='backward', cv=2)

In [78]:
%%time
backward_stepwise.fit(X_train_cod1, y_train) #X_train_cod1 -> 20 features duration: 17s

CPU times: user 24.5 s, sys: 125 ms, total: 24.6 s
Wall time: 24.9 s


In [79]:
backward_stepwise.get_support()

array([ True, False, False, False, False,  True, False, False,  True,
       False, False,  True, False, False, False, False, False, False,
        True, False])

In [80]:
X_train_bkw = backward_stepwise.transform(X_train_cod1)
X_test_bkw = backward_stepwise.transform(X_test_cod1)

In [81]:
rf_bkw = RandomForestClassifier(n_estimators=25)
rf_bkw.fit(X_train_bkw, y_train)

In [82]:
print('================ Resultados Backward Stepwise Random Forest =================\n')
print(classification_report(y_test, rf_bkw.predict(X_test_bkw)))


              precision    recall  f1-score   support

           0       0.50      0.58      0.54        50
           1       0.85      0.81      0.83       150

    accuracy                           0.75       200
   macro avg       0.68      0.69      0.68       200
weighted avg       0.76      0.75      0.76       200



## Recursive Feature Elimination

In [83]:
rf = RandomForestClassifier(n_estimators=25)

In [84]:
n_features = 5

In [85]:
rfe = RFECV(rf,min_features_to_select=n_features, cv=2)

In [86]:
%%time 
rfe.fit(X_train_cod1, y_train) #X_train_cod1 -> 20 features duration:1.73s

CPU times: user 2.74 s, sys: 21.1 ms, total: 2.77 s
Wall time: 2.8 s


In [87]:
rfe.get_support()

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True, False, False, False,  True, False,
       False, False])

In [88]:
X_train_rfe = rfe.transform(X_train_cod1)
X_test_ref = rfe.transform(X_test_cod1)

In [89]:
X_train_rfe.shape

(800, 13)

In [90]:
rf_rfe = RandomForestClassifier(n_estimators=25)
rf_rfe.fit(X_train_rfe, y_train)

In [91]:
print('================ Resultados Recursive Feature Elimination Random Forests =================\n')
print(classification_report(y_test, rf_rfe.predict(X_test_ref)))


              precision    recall  f1-score   support

           0       0.57      0.54      0.56        50
           1       0.85      0.87      0.86       150

    accuracy                           0.79       200
   macro avg       0.71      0.70      0.71       200
weighted avg       0.78      0.79      0.78       200



sin cross validation

In [92]:
rf = RandomForestClassifier(n_estimators=25)

In [93]:
n_features = 5

In [94]:
rfe = RFE(rf,n_features_to_select=n_features)

In [95]:
%%time 
rfe.fit(X_train_cod1, y_train) #X_train_cod1 -> 20 features duration:1.73s

CPU times: user 1.1 s, sys: 5.99 ms, total: 1.11 s
Wall time: 1.12 s


In [96]:
rfe.get_support()

array([ True,  True, False,  True,  True, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False])

In [97]:
X_train_rfe = rfe.transform(X_train_cod1)
X_test_ref = rfe.transform(X_test_cod1)

In [98]:
X_train_rfe.shape

(800, 5)

In [99]:
rf_rfe = RandomForestClassifier(n_estimators=25)
rf_rfe.fit(X_train_rfe, y_train)

In [100]:
print('================ Resultados Recursive Feature Elimination Random Forests =================\n')
print(classification_report(y_test, rf_rfe.predict(X_test_ref)))


              precision    recall  f1-score   support

           0       0.37      0.42      0.39        50
           1       0.80      0.76      0.78       150

    accuracy                           0.68       200
   macro avg       0.58      0.59      0.59       200
weighted avg       0.69      0.68      0.68       200



# Filtros

In [101]:
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif

## Filtro utilizando Chi2

In [102]:
chi2_best = SelectKBest(chi2, k=5)
chi2_best.fit_transform(X_train_cod1, y_train)

array([[8.74023438e-01, 1.80000000e+01, 1.53000000e+03, 3.00000000e+00,
        3.20000000e+01],
       [6.11944444e-01, 1.80000000e+01, 4.29700000e+03, 4.00000000e+00,
        4.00000000e+01],
       [6.11944444e-01, 1.80000000e+01, 1.23900000e+03, 4.00000000e+00,
        6.10000000e+01],
       ...,
       [6.11944444e-01, 1.80000000e+01, 1.92800000e+03, 2.00000000e+00,
        3.10000000e+01],
       [8.74023438e-01, 6.00000000e+00, 1.54300000e+03, 4.00000000e+00,
        3.30000000e+01],
       [8.74023438e-01, 1.20000000e+01, 1.18500000e+03, 3.00000000e+00,
        2.70000000e+01]])

In [103]:
chi2_best.fit_transform(X_train_cod1, y_train).shape

(800, 5)

In [104]:
chi2_best.get_support()

array([ True,  True, False, False,  True, False, False,  True, False,
       False, False, False,  True, False, False, False, False, False,
       False, False])

In [105]:
X_train_chi2 = chi2_best.transform(X_train_cod1)
X_test_chi2 = chi2_best.transform(X_test_cod1)

In [106]:
rf_chi2 = RandomForestClassifier(n_estimators=25)
rf_chi2.fit(X_train_chi2, y_train)

In [107]:
print('================ Resultados Chi2 Filter Random Forests =================\n')
print(classification_report(y_test, rf_chi2.predict(X_test_chi2)))


              precision    recall  f1-score   support

           0       0.47      0.52      0.50        50
           1       0.83      0.81      0.82       150

    accuracy                           0.73       200
   macro avg       0.65      0.66      0.66       200
weighted avg       0.74      0.73      0.74       200



## Filtro utilizando ANOVA F-value

In [108]:
fclass_best = SelectKBest(f_classif, k=15)
fclass_best.fit_transform(X_train_cod1, y_train)

array([[ 0.87402344, 18.        ,  0.82494703, ...,  0.72437284,
         0.69274902,  0.68011364],
       [ 0.61194444, 18.        ,  0.671875  , ...,  0.72437284,
         0.62980769,  0.68011364],
       [ 0.61194444, 18.        ,  0.66358568, ...,  0.58659639,
         0.69274902,  0.68011364],
       ...,
       [ 0.61194444, 18.        ,  0.82494703, ...,  0.72437284,
         0.7241242 ,  0.68011364],
       [ 0.87402344,  6.        ,  0.66358568, ...,  0.72437284,
         0.69274902,  0.68011364],
       [ 0.87402344, 12.        ,  0.82494703, ...,  0.72437284,
         0.69274902,  0.68011364]])

In [109]:
fclass_best.fit_transform(X_train_cod1, y_train).shape

(800, 15)

In [110]:
fclass_best.get_support()

array([ True,  True,  True,  True,  True,  True,  True, False,  True,
        True, False,  True,  True,  True,  True, False,  True, False,
       False,  True])

In [111]:
X_train_fclass = fclass_best.transform(X_train_cod1)
X_test_fclass = fclass_best.transform(X_test_cod1)

In [112]:
rf_fclass = RandomForestClassifier(n_estimators=25)
rf_fclass.fit(X_train_fclass, y_train)

In [113]:
print('================ Resultados F-value Filter Random Forests =================\n')
print(classification_report(y_test, rf_fclass.predict(X_test_fclass)))


              precision    recall  f1-score   support

           0       0.57      0.54      0.56        50
           1       0.85      0.87      0.86       150

    accuracy                           0.79       200
   macro avg       0.71      0.70      0.71       200
weighted avg       0.78      0.79      0.78       200



## Filtro utilizando mutual information

In [114]:
mutual_best = SelectKBest(mutual_info_classif, k=5)
mutual_best.fit_transform(X_train_cod1, y_train)

array([[0.87402344, 0.82494703, 0.56863839, 1.        , 0.67434896],
       [0.61194444, 0.671875  , 0.71226959, 1.        , 0.70710404],
       [0.61194444, 0.66358568, 0.71226959, 1.        , 0.67434896],
       ...,
       [0.61194444, 0.82494703, 0.71226959, 1.        , 0.67434896],
       [0.87402344, 0.66358568, 0.71226959, 1.        , 0.67434896],
       [0.87402344, 0.82494703, 0.71226959, 1.        , 0.67434896]])

In [115]:
mutual_best.fit_transform(X_train_cod1, y_train).shape

(800, 5)

In [116]:
mutual_best.get_support()

array([ True,  True, False, False, False, False, False, False,  True,
       False, False,  True, False,  True, False, False, False, False,
       False, False])

In [117]:
X_train_mutual = mutual_best.transform(X_train_cod1)
X_test_mutual = mutual_best.transform(X_test_cod1)

In [118]:
rf_mutual = RandomForestClassifier(n_estimators=25)
rf_mutual.fit(X_train_mutual, y_train)

In [119]:
print('================ Resultados Mutual Information Filter Random Forests =================\n')
print(classification_report(y_test, rf_mutual.predict(X_test_mutual)))


              precision    recall  f1-score   support

           0       0.42      0.48      0.45        50
           1       0.82      0.78      0.80       150

    accuracy                           0.70       200
   macro avg       0.62      0.63      0.62       200
weighted avg       0.72      0.70      0.71       200

