##**Selección de características - SequentialFeatureSelector**##



###Instrucciones:

El notebook se puede ejecutar linealmente con el archivo train.csv, el cual se encuentra en la carpeta del drive llamada DATASET o se puede consultar en el siguiente link de la competencia de Kaggle: https://www.kaggle.com/c/petfinder-adoption-prediction/data

##**Instalación e importe de librerías**##

In [None]:
#Instalando lazypredict
!pip install lazypredict

In [None]:
#Instalando versión de pandas 
!pip install pandas==1.1.0

In [None]:
#Versión de la librería de pandas
pd.__version__

'1.1.0'

In [None]:
#Instalando versión de folium
!pip install folium==0.2.1

In [None]:
#Instalando versión de imgaug
!pip install imgaug==0.2.5

In [None]:
#Importar librerías
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_theme(style="darkgrid")
import warnings
warnings.filterwarnings("ignore")
import IPython
import sys
import joblib
sys.modules['sklearn.externals.joblib'] = joblib
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn import svm
import lazypredict
from lazypredict.Supervised import LazyClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import lightgbm as lgb

##**Lectura de archivos**##

In [None]:
#Importando el drive al colab
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
#Importando el dataset de la carpeta del drive
!ls '/content/gdrive/My Drive/MONOGRAFIA/DATASET'

 breed_labels.csv		 petfinder-adoption-prediction.zip
 BreedLabels.csv		 PetFinder-BreedLabels.csv
'Clasificación imagenes.ipynb'	 PetFinder-ColorLabels.csv
 color_labels.csv		 PetFinder-StateLabels.csv
 ColorLabels.csv		 state_labels.csv
'Copia de BreedLabels.csv'	 StateLabels.csv
'Copia de ColorLabels.csv'	 test
'Copia de state_labels.csv'	 test_sentiment
'Copia de StateLabels.csv'	 train
 fc9cf8b8d-1.jpg		 train_images
 ImagenesMuestra		 train_metadata
 Imagenes_Org			 train_sentiment


In [None]:
#Lectura del archivo de datos para el entrenamiento
data_set = pd.read_csv('/content/gdrive/My Drive/MONOGRAFIA/DATASET/train/train.csv') 

In [None]:
#Haciendo una copia de los datos de entrenamiento para modificar las variables
data_set_train = data_set.copy()

In [None]:
#Definiendo la variable objetivo
y = data_set_train[['AdoptionSpeed']]

In [None]:
#Definiendo la variable x
x = data_set_train[['Type','Age','Breed1','Gender','Color1','Color2','Color3','MaturitySize','FurLength','Vaccinated','Dewormed','Sterilized','Health','Quantity','Fee','State','PhotoAmt']]

##**Selección de características con el modelo de KNeighborsClassifier**##

In [None]:
# Sequential Forward Selection with KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

index_variables = []
score = []
for i in range(17):
  sfs = SFS(knn, 
            k_features= i+1, 
            forward=True, 
            floating=False, 
            scoring='accuracy',
            cv=4,
            verbose=2)
            #n_jobs=-1)  
  sfs = sfs.fit(x, y)

  # print('\nSequential Forward Selection (k=3):')
  # print(sfs.k_feature_idx_)
  # print('CV Score:')
  score.append(sfs.k_score_)
  index_variables.append(sfs.subsets_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  17 out of  17 | elapsed:   22.0s finished

[2021-11-13 02:35:09] Features: 1/1 -- score: 0.27752936179637233[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  17 out of  17 | elapsed:   21.8s finished

[2021-11-13 02:35:30] Features: 1/2 -- score: 0.27752936179637233[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:   22.3s finished

[2021-11-13 02:35:53] Features: 2/2 -- score: 0.28146659101978955[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  

In [None]:
#Score del modelo KNeighborsClassifier de selección de características
score

[0.27752936179637233,
 0.28146659101978955,
 0.2702582125777831,
 0.29813868187688897,
 0.3167476819859184,
 0.3276869207099837,
 0.3289539786205528,
 0.3292871517783611,
 0.339958869857291,
 0.34089277240206073,
 0.34382738634251236,
 0.34402752865011604,
 0.34015902995690345,
 0.34195959904498185,
 0.3408257321126972,
 0.33322253063285745,
 0.32548551545442356]

In [None]:
##Indice de las variables y su respectivo score del modelo KNeighborsClassifier de selección de características
index_variables [14]

[{1: {'avg_score': 0.278929094716969,
   'cv_scores': array([0.295012  , 0.25320171, 0.30549626, 0.2620064 ]),
   'feature_idx': (11,),
   'feature_names': ('Sterilized',)}},
 {1: {'avg_score': 0.278929094716969,
   'cv_scores': array([0.295012  , 0.25320171, 0.30549626, 0.2620064 ]),
   'feature_idx': (11,),
   'feature_names': ('Sterilized',)},
  2: {'avg_score': 0.28353295492814445,
   'cv_scores': array([0.27393972, 0.25240128, 0.2977588 , 0.31003202]),
   'feature_idx': (2, 11),
   'feature_names': ('Breed1', 'Sterilized')}},
 {1: {'avg_score': 0.278929094716969,
   'cv_scores': array([0.295012  , 0.25320171, 0.30549626, 0.2620064 ]),
   'feature_idx': (11,),
   'feature_names': ('Sterilized',)},
  2: {'avg_score': 0.28353295492814445,
   'cv_scores': array([0.27393972, 0.25240128, 0.2977588 , 0.31003202]),
   'feature_idx': (2, 11),
   'feature_names': ('Breed1', 'Sterilized')},
  3: {'avg_score': 0.3047426485554455,
   'cv_scores': array([0.29821286, 0.32710779, 0.29108858, 0.30

In [None]:
#Indice de las variables del modelo KNeighborsClassifier de selección de características
sfs.k_feature_idx_

(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)

In [None]:
#variables utilizadas para la selección de características
sfs.k_feature_names_

('Type',
 'Age',
 'Breed1',
 'Gender',
 'Color1',
 'Color2',
 'Color3',
 'MaturitySize',
 'FurLength',
 'Vaccinated',
 'Dewormed',
 'Sterilized',
 'Health',
 'Quantity',
 'Fee',
 'State',
 'PhotoAmt')

In [None]:
#Las 12 características elegidas del modelo de selección de características con KNN
('Age',
'Breed1',
'MaturitySize',
'FurLength',
'Vaccinated',
'Dewormed',
'Sterilized',
'Health',
'Quantity',
'Fee',
'State',
'PhotoAmt')

('Age',
 'Breed1',
 'MaturitySize',
 'FurLength',
 'Vaccinated',
 'Dewormed',
 'Sterilized',
 'Health',
 'Quantity',
 'Fee',
 'State',
 'PhotoAmt')

##**Preprocesamiento de los datos**##

In [None]:
# Generamos variables de tipo dummy a las variables categoricas
breed1_dummy = pd.get_dummies(data_set_train['Breed1'],prefix='Breed1')
#color1_dummy = pd.get_dummies(data_set_train['Color1'],prefix='Color1')
MaturitySize_dummy = pd.get_dummies(data_set_train['MaturitySize'],prefix='MaturitySize')
FurLength_dummy = pd.get_dummies(data_set_train['FurLength'],prefix='FurLength')
Vaccinated_dummy = pd.get_dummies(data_set_train['Vaccinated'],prefix='Vaccinated')
Dewormed_dummy = pd.get_dummies(data_set_train['Dewormed'],prefix='Dewormed')
Sterilized_dummy = pd.get_dummies(data_set_train['Sterilized'],prefix='Sterilized')
Health_dummy = pd.get_dummies(data_set_train['Health'],prefix='Health')
State_dummy = pd.get_dummies(data_set_train['State'],prefix='State')

In [None]:
# unimos las variables de tipy dummy a la informacion original del archivo d
data_set_train = pd.concat([data_set_train,breed1_dummy],axis=1) # axis = 1 Columnas
#data_set_train = pd.concat([data_set_train,color1_dummy],axis=1)
data_set_train = pd.concat([data_set_train,MaturitySize_dummy],axis=1)
data_set_train = pd.concat([data_set_train,FurLength_dummy],axis=1)
data_set_train = pd.concat([data_set_train,Sterilized_dummy],axis=1)
data_set_train = pd.concat([data_set_train,Vaccinated_dummy],axis=1)
data_set_train = pd.concat([data_set_train,Dewormed_dummy],axis=1)
data_set_train = pd.concat([data_set_train,Health_dummy],axis=1)
data_set_train = pd.concat([data_set_train,State_dummy],axis=1)

In [None]:
#Eliminando las variables que no se necesitan para entrenar el modelo
data_set_final = data_set_train.drop(['Name', 'Gender', 'Sterilized', 'FurLength', 'Vaccinated','Dewormed','Health','Fee', 'Breed1','Breed2','Color1','Color2','Color3','MaturitySize', 'State','RescuerID','VideoAmt','Description','PetID','AdoptionSpeed'], axis=1)
data_set_final

Unnamed: 0,Type,Age,Quantity,PhotoAmt,Breed1_0,Breed1_1,Breed1_3,Breed1_5,Breed1_7,Breed1_10,Breed1_11,Breed1_15,Breed1_16,Breed1_17,Breed1_18,Breed1_19,Breed1_20,Breed1_21,Breed1_23,Breed1_24,Breed1_25,Breed1_26,Breed1_31,Breed1_32,Breed1_39,Breed1_42,Breed1_44,Breed1_49,Breed1_50,Breed1_56,Breed1_58,Breed1_60,Breed1_61,Breed1_64,Breed1_65,Breed1_69,Breed1_70,Breed1_71,Breed1_72,Breed1_75,...,Breed1_301,Breed1_302,Breed1_303,Breed1_304,Breed1_305,Breed1_306,Breed1_307,MaturitySize_1,MaturitySize_2,MaturitySize_3,MaturitySize_4,FurLength_1,FurLength_2,FurLength_3,Sterilized_1,Sterilized_2,Sterilized_3,Vaccinated_1,Vaccinated_2,Vaccinated_3,Dewormed_1,Dewormed_2,Dewormed_3,Health_1,Health_2,Health_3,State_41324,State_41325,State_41326,State_41327,State_41330,State_41332,State_41335,State_41336,State_41342,State_41345,State_41361,State_41367,State_41401,State_41415
0,2,3,1,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
1,2,1,1,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2,1,1,1,7.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
3,1,4,1,8.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,1,1,1,3.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14988,2,2,4,3.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
14989,2,60,2,3.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
14990,2,2,5,5.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
14991,2,9,1,3.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


In [None]:
#Variable objetivo
y_pet = data_set_train[['AdoptionSpeed']]
y_pet

Unnamed: 0,AdoptionSpeed
0,2
1,0
2,3
3,2
4,2
...,...
14988,2
14989,4
14990,3
14991,4


In [None]:
#Información sobre el dataset final
data_set_final.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14993 entries, 0 to 14992
Columns: 213 entries, Type to State_41415
dtypes: float64(1), int64(3), uint8(209)
memory usage: 3.4 MB


In [None]:
#Definiendo la variable X, y
X = data_set_final
y = y_pet

##**Implementación de modelos clásicos de Machine Learning**##

In [None]:
#Separando los datos en entrenamiento y test
#Entrenando el modelo con lazyClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.3,random_state=42)

clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
models

100%|██████████| 29/29 [07:50<00:00, 16.23s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBClassifier,0.41,0.33,,0.39,37.97
LGBMClassifier,0.41,0.32,,0.39,2.57
RandomForestClassifier,0.38,0.32,,0.38,3.99
BaggingClassifier,0.36,0.31,,0.36,1.27
AdaBoostClassifier,0.39,0.31,,0.37,2.1
ExtraTreesClassifier,0.35,0.3,,0.35,4.84
NearestCentroid,0.31,0.3,,0.31,0.2
ExtraTreeClassifier,0.34,0.29,,0.34,0.2
DecisionTreeClassifier,0.34,0.29,,0.34,0.29
LinearDiscriminantAnalysis,0.36,0.29,,0.34,0.76


##**Los 2 primeros mejores modelos con hiperparámetros**## 

##**1. XGBoostClassifier**## 

In [None]:
#Entrenando con el modelo XGBoostClassifier
xgb1 = XGBClassifier(
  learning_rate =0.1,
  n_estimators=100,
  max_depth=3,
  min_child_weight=1,
  gamma=0,
  silent=True,
  subsample=0.8,
  colsample_bytree=0.8,
  objective= 'binary:logistic',
  nthread=4,
  scale_pos_weight=1,
  seed=27)

xgb1.fit(X_train, y_train)

Parameters: { scale_pos_weight, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, nthread=4, num_parallel_tree=1,
              objective='multi:softprob', random_state=27, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, seed=27, silent=True,
              subsample=0.8, tree_method='exact', validate_parameters=1,
              verbosity=None)

In [None]:
#Resultados de las  predicciones
predict = xgb1.predict(X_test)
predict

array([4, 2, 4, ..., 4, 2, 4])

In [None]:
#Resultados de la precisión del modelo XGBoostClassifier
accuracy=accuracy_score(predict, y_test)
print('LightGBM Model accuracy score: {0:0.4f}'.format(accuracy_score(y_test, predict)))

LightGBM Model accuracy score: 0.4002


##**2. LGBMClassifier**## 

In [None]:
##Entrenando con el modelo LGBMClassifier
clf = lgb.LGBMClassifier(colsample_bytree=0.45, learning_rate=0.1, max_depth=8,
               min_child_weight=20.0, n_estimators=100, num_leaves=5,
               random_state=1, reg_lambda=2.0, subsample=0.99,
               subsample_freq=6)
clf.fit(X_train, y_train)

LGBMClassifier(colsample_bytree=0.45, max_depth=8, min_child_weight=20.0,
               num_leaves=5, random_state=1, reg_lambda=2.0, subsample=0.99,
               subsample_freq=6)

In [None]:
#Resultados de las  predicciones
y_pred=clf.predict(X_test)
y_pred

array([4, 2, 4, ..., 4, 2, 4])

In [None]:
#Resultados de la precisión del modelo LGBMClassifier
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(y_pred, y_test)
print('LightGBM Model accuracy score: {0:0.4f}'.format(accuracy_score(y_test, y_pred)))

LightGBM Model accuracy score: 0.4008
