This Kaggle challenge, proposed by Santander Bank, focuses on identifying dissatisfied customers early on. The idea is to predict the probability of a customer being dissatisfied (TARGET = 1) or satisfied (TARGET = 0) using a dataset with hundreds of anonymized numerical features. This would allow the bank to take proactive steps to improve customer satisfaction before they decide to leave.

### Data:
- **train.csv**: Contains the training set, including the "TARGET" column, which indicates whether the customer is dissatisfied (1) or satisfied (0).
- **test.csv**: Contains the test set, but without the "TARGET" column.
- **sample_submission.csv**: A sample file showing the correct format for predictions.

### Objective:
Predict the probability that customers in the test set are dissatisfied (TARGET = 1).

### Evaluation:
Predictions are evaluated using the area under the ROC curve (AUC), which measures how well the model distinguishes between satisfied and dissatisfied customers. For each ID in the test set, you must predict the probability of the customer being dissatisfied.

In summary, the approach is to use anonymized numerical data to build a model that predicts customer dissatisfaction, evaluating the model's performance through AUC.

<!-- Este desafío de Kaggle, planteado por el banco Santander, se enfoca en identificar a los clientes insatisfechos de forma temprana. La idea es predecir la probabilidad de que un cliente esté insatisfecho (TARGET = 1) o satisfecho (TARGET = 0) usando un conjunto de datos con cientos de características numéricas anonimizadas. Esto permitiría al banco tomar medidas proactivas para mejorar la satisfacción del cliente antes de que decidan irse.

### Datos:
- **train.csv**: Contiene el conjunto de entrenamiento, incluyendo la columna "TARGET", que indica si el cliente está insatisfecho (1) o satisfecho (0).
- **test.csv**: Contiene el conjunto de prueba, pero sin la columna "TARGET".
- **sample_submission.csv**: Es un archivo de ejemplo que muestra el formato correcto para las predicciones.

### Objetivo:
Predecir la probabilidad de que los clientes en el conjunto de prueba estén insatisfechos (TARGET = 1).

### Evaluación:
Las predicciones se evalúan usando el área bajo la curva ROC (AUC), que mide qué tan bien el modelo separa a los clientes satisfechos de los insatisfechos. Para cada ID del conjunto de prueba, se debe predecir la probabilidad de que el cliente esté insatisfecho.

En resumen, el enfoque es usar datos numéricos anonimizados para crear un modelo que prediga la probabilidad de insatisfacción de los clientes, evaluando el rendimiento del modelo a través del AUC. -->

-----

This code is designed to solve a classification problem using the XGBoost algorithm, optimizing model performance through preprocessing, dimensionality reduction, and feature selection techniques. Below is a description of each part of the code:

1. **Library Imports**:
   - `numpy` and `pandas` are standard libraries for data handling.
   - `matplotlib.pyplot` is used for visualizing graphs.
   - `xgboost` is the main library for the XGBoost classification algorithm.
   - `sklearn` provides tools for cross-validation and feature selection.

   ```python
   import numpy as np
   import pandas as pd
   import matplotlib.pyplot as plt
   import matplotlib
   matplotlib.use("Agg")  # Needed to save plots instead of displaying them
   from sklearn import cross_validation
   import xgboost as xgb
   from sklearn.metrics import roc_auc_score
   ```

2. **Data Loading**:
   - Training and test data are loaded from CSV files.
   - `index_col=0` is set to use the first column as the index for each row.

   ```python
   training = pd.read_csv("../input/train.csv", index_col=0)
   test = pd.read_csv("../input/test.csv", index_col=0)
   ```

3. **Data Preprocessing**:
   - Anomalous values in the `var3` column are replaced with the most common value (2), and new features are added.
   - A new column that counts zeros per row is added as a feature, and PCA (Principal Component Analysis) is applied to reduce dimensionality.

   ```python
   training = training.replace(-999999, 2)
   X = training.iloc[:, :-1]  # Features (all columns except TARGET)
   y = training.TARGET  # Labels

   X['n0'] = (X == 0).sum(axis=1)  # New feature counting zeros per row

   from sklearn.preprocessing import normalize
   from sklearn.decomposition import PCA
   X_normalized = normalize(X, axis=0)
   pca = PCA(n_components=2)
   X_pca = pca.fit_transform(X_normalized)
   X['PCA1'] = X_pca[:, 0]
   X['PCA2'] = X_pca[:, 1]
   ```

4. **Feature Selection**:
   - Relevant features are selected using two methods: `chi2` and `f_classif`, both from `sklearn`. Only features selected by both methods are kept.

   ```python
   from sklearn.feature_selection import SelectPercentile, f_classif, chi2
   from sklearn.preprocessing import Binarizer, scale

   X_bin = Binarizer().fit_transform(scale(X))
   selectChi2 = SelectPercentile(chi2, percentile=75).fit(X_bin, y)
   selectF_classif = SelectPercentile(f_classif, percentile=75).fit(X, y)
   
   chi2_selected = selectChi2.get_support()
   f_classif_selected = selectF_classif.get_support()
   selected = chi2_selected & f_classif_selected
   features = [f for f, s in zip(X.columns, selected) if s]
   ```

5. **Dataset Splitting**:
   - The data is split into training and test sets using stratified cross-validation to maintain class proportions.

   ```python
   X_train, X_test, y_train, y_test = cross_validation.train_test_split(X[features], y, random_state=1301, stratify=y, test_size=0.4)
   ```

6. **XGBoost Model Setup and Training**:
   - Several hyperparameters are tuned, and the model is trained using internal validation to measure AUC (Area Under the ROC Curve).
   - Hyperparameters such as `max_depth`, `subsample`, and `learning_rate` are fine-tuned to improve model performance.

   ```python
   clf = xgb.XGBClassifier(missing=9999999999, max_depth=5, n_estimators=1000, learning_rate=0.1, nthread=4, subsample=1.0, colsample_bytree=0.5, min_child_weight=3, scale_pos_weight=ratio, seed=1301)
   clf.fit(X_train, y_train, early_stopping_rounds=50, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)])
   ```

7. **Prediction and Evaluation**:
   - Predictions are made on the test set, and AUC is calculated to evaluate overall model performance.

   ```python
   print('Overall AUC:', roc_auc_score(y, clf.predict_proba(X[features], ntree_limit=clf.best_iteration)[:,1]))
   ```

8. **Test Data Preparation and Final Prediction**:
   - The same PCA and normalization transformations are applied to the test data before making predictions.
   - The prediction is saved to a CSV file for submission.

   ```python
   test['n0'] = (test == 0).sum(axis=1)
   test_normalized = normalize(test, axis=0)
   test_pca = pca.fit_transform(test_normalized)
   test['PCA1'] = test_pca[:, 0]
   test['PCA2'] = test_pca[:, 1]
   sel_test = test[features]
   y_pred = clf.predict_proba(sel_test, ntree_limit=clf.best_iteration)
   submission = pd.DataFrame({"ID": test.index, "TARGET": y_pred[:, 1]})
   submission.to_csv("submission.csv", index=False)
   ```

9. **Feature Importance**:
   - Finally, the 15 most important features according to the XGBoost model are visualized, and the plot is saved.

   ```python
   ts = pd.Series(clf.booster().get_fscore()).sort_values()[-15:]
   featp = ts.plot(kind='barh', figsize=(6, 10))
   plt.title('XGBoost Feature Importance')
   fig_featp = featp.get_figure()
   fig_featp.savefig('feature_importance_xgb.png', bbox_inches='tight', pad_inches=1)
   ```

<!-- Este código está diseñado para resolver un problema de clasificación utilizando el algoritmo XGBoost, optimizando el rendimiento del modelo mediante técnicas de preprocesamiento, reducción de dimensionalidad y selección de características. A continuación, se describe cada parte del código:

1. **Importación de librerías**:
   - `numpy` y `pandas` son librerías estándar para manejo de datos.
   - `matplotlib.pyplot` se utiliza para la visualización de gráficos.
   - `xgboost` es la librería principal para el algoritmo de clasificación XGBoost.
   - `sklearn` proporciona herramientas para la validación cruzada y selección de características.
   
   ```python
   import numpy as np
   import pandas as pd
   import matplotlib.pyplot as plt
   import matplotlib
   matplotlib.use("Agg")  # Necesario para guardar gráficos en lugar de mostrarlos
   from sklearn import cross_validation
   import xgboost as xgb
   from sklearn.metrics import roc_auc_score
   ```

2. **Carga de datos**:
   - Se cargan los datos de entrenamiento y prueba desde archivos CSV.
   - Se define `index_col=0` para que la primera columna sea el índice de cada fila.

   ```python
   training = pd.read_csv("../input/train.csv", index_col=0)
   test = pd.read_csv("../input/test.csv", index_col=0)
   ```

3. **Preprocesamiento de datos**:
   - Se reemplazan valores anómalos en la columna `var3` con el valor más común (2), y se agregan nuevas características.
   - Se añade una columna que cuenta los ceros por fila como una característica adicional, y se aplica PCA (análisis de componentes principales) para reducir la dimensionalidad.

   ```python
   training = training.replace(-999999, 2)
   X = training.iloc[:, :-1]  # Características (todas las columnas excepto TARGET)
   y = training.TARGET  # Etiquetas

   X['n0'] = (X == 0).sum(axis=1)  # Nueva característica que cuenta ceros por fila

   from sklearn.preprocessing import normalize
   from sklearn.decomposition import PCA
   X_normalized = normalize(X, axis=0)
   pca = PCA(n_components=2)
   X_pca = pca.fit_transform(X_normalized)
   X['PCA1'] = X_pca[:, 0]
   X['PCA2'] = X_pca[:, 1]
   ```

4. **Selección de características**:
   - Se seleccionan características relevantes utilizando dos métodos: `chi2` y `f_classif`, ambos de `sklearn`. Luego se eligen solo las características seleccionadas por ambos métodos.

   ```python
   from sklearn.feature_selection import SelectPercentile, f_classif, chi2
   from sklearn.preprocessing import Binarizer, scale

   X_bin = Binarizer().fit_transform(scale(X))
   selectChi2 = SelectPercentile(chi2, percentile=75).fit(X_bin, y)
   selectF_classif = SelectPercentile(f_classif, percentile=75).fit(X, y)
   
   chi2_selected = selectChi2.get_support()
   f_classif_selected = selectF_classif.get_support()
   selected = chi2_selected & f_classif_selected
   features = [f for f, s in zip(X.columns, selected) if s]
   ```

5. **División del conjunto de datos**:
   - Se dividen los datos en conjunto de entrenamiento y prueba utilizando validación cruzada estratificada para mantener la proporción de las clases.

   ```python
   X_train, X_test, y_train, y_test = cross_validation.train_test_split(X[features], y, random_state=1301, stratify=y, test_size=0.4)
   ```

6. **Configuración y entrenamiento del modelo XGBoost**:
   - Se ajustan varios hiperparámetros y se entrena el modelo usando validación interna para medir el AUC (Área bajo la curva ROC).
   - Se realiza ajuste fino de hiperparámetros, como `max_depth`, `subsample`, y `learning_rate` para mejorar el rendimiento del modelo.

   ```python
   clf = xgb.XGBClassifier(missing=9999999999, max_depth=5, n_estimators=1000, learning_rate=0.1, nthread=4, subsample=1.0, colsample_bytree=0.5, min_child_weight=3, scale_pos_weight=ratio, seed=1301)
   clf.fit(X_train, y_train, early_stopping_rounds=50, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)])
   ```

7. **Predicción y evaluación**:
   - Se predice sobre el conjunto de prueba y se calcula el AUC para evaluar el rendimiento global del modelo.

   ```python
   print('Overall AUC:', roc_auc_score(y, clf.predict_proba(X[features], ntree_limit=clf.best_iteration)[:,1]))
   ```

8. **Preparación de datos de prueba y generación de la predicción final**:
   - Se aplican las mismas transformaciones de PCA y normalización a los datos de prueba antes de realizar la predicción.
   - Se guarda la predicción en un archivo CSV para su envío.

   ```python
   test['n0'] = (test == 0).sum(axis=1)
   test_normalized = normalize(test, axis=0)
   test_pca = pca.fit_transform(test_normalized)
   test['PCA1'] = test_pca[:, 0]
   test['PCA2'] = test_pca[:, 1]
   sel_test = test[features]
   y_pred = clf.predict_proba(sel_test, ntree_limit=clf.best_iteration)
   submission = pd.DataFrame({"ID": test.index, "TARGET": y_pred[:, 1]})
   submission.to_csv("submission.csv", index=False)
   ```

9. **Importancia de características**:
   - Finalmente, se visualizan las 15 características más importantes según el modelo XGBoost y se guarda la gráfica.

   ```python
   ts = pd.Series(clf.booster().get_fscore()).sort_values()[-15:]
   featp = ts.plot(kind='barh', figsize=(6, 10))
   plt.title('XGBoost Feature Importance')
   fig_featp = featp.get_figure()
   fig_featp.savefig('feature_importance_xgb.png', bbox_inches='tight', pad_inches=1)
   ```
 -->
 
 
 ------

In [10]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use("Agg") #Needed to save figures

In [None]:
from sklearn import cross_validation
import xgboost as xgb
from sklearn.metrics import roc_auc_score

In [12]:

training = pd.read_csv("/kaggle/input/santander-customer-satisfaction/train.csv", index_col=0)
test = pd.read_csv("/kaggle/input/santander-customer-satisfaction/test.csv", index_col=0)

In [13]:
training

Unnamed: 0_level_0,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,imp_op_var40_ult1,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.170000,0
3,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.030000,0
4,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.770000,0
8,2,37,0.0,195.0,195.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.970000,0
10,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
151829,2,48,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,60926.490000,0
151830,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,118634.520000,0
151835,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,74028.150000,0
151836,2,25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,84278.160000,0


In [14]:
print(training.shape)
print(test.shape)

(76020, 370)
(75818, 369)


In [16]:
training.describe()

Unnamed: 0,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,imp_op_var40_ult1,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
count,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,...,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0
mean,-1523.199277,33.212865,86.208265,72.363067,119.529632,3.55913,6.472698,0.412946,0.567352,3.160715,...,7.935824,1.365146,12.21558,8.784074,31.505324,1.858575,76.026165,56.614351,117235.8,0.039569
std,39033.462364,12.956486,1614.757313,339.315831,546.266294,93.155749,153.737066,30.604864,36.513513,95.268204,...,455.887218,113.959637,783.207399,538.439211,2013.125393,147.786584,4040.337842,2852.579397,182664.6,0.194945
min,-999999.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5163.75,0.0
25%,2.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67870.61,0.0
50%,2.0,28.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,106409.2,0.0
75%,2.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,118756.3,0.0
max,238.0,105.0,210000.0,12888.03,21024.81,8237.82,11073.57,6600.0,6600.0,8237.82,...,50003.88,20385.72,138831.63,91778.73,438329.22,24650.01,681462.9,397884.3,22034740.0,1.0


In [None]:
# summary: COMPLETE workflow

# Replace -999999 in var3 column with most common value 2 
# See https://www.kaggle.com/cast42/santander-customer-satisfaction/debugging-var3-999999
# for details
training = training.replace(-999999,2)


# Replace 9999999999 with NaN
# See https://www.kaggle.com/c/santander-customer-satisfaction/forums/t/19291/data-dictionary/111360#post111360
# training = training.replace(9999999999, np.nan)
# training.dropna(inplace=True)
# Leads to validation_0-auc:0.839577

X = training.iloc[:,:-1]
y = training.TARGET

# Add zeros per row as extra feature
X['n0'] = (X == 0).sum(axis=1)
# # Add log of var38
# X['logvar38'] = X['var38'].map(np.log1p)
# # Encode var36 as category
# X['var36'] = X['var36'].astype('category')
# X = pd.get_dummies(X)

# Add PCA components as features
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA

X_normalized = normalize(X, axis=0)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_normalized)
X['PCA1'] = X_pca[:,0]
X['PCA2'] = X_pca[:,1]

from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_classif,chi2
from sklearn.preprocessing import Binarizer, scale

p = 86 # 308 features validation_1-auc:0.848039
p = 80 # 284 features validation_1-auc:0.848414
p = 77 # 267 features validation_1-auc:0.848000
p = 75 # 261 features validation_1-auc:0.848642
# p = 73 # 257 features validation_1-auc:0.848338
# p = 70 # 259 features validation_1-auc:0.848588
# p = 69 # 238 features validation_1-auc:0.848547
# p = 67 # 247 features validation_1-auc:0.847925
# p = 65 # 240 features validation_1-auc:0.846769
# p = 60 # 222 features validation_1-auc:0.848581

X_bin = Binarizer().fit_transform(scale(X))
selectChi2 = SelectPercentile(chi2, percentile=p).fit(X_bin, y)
selectF_classif = SelectPercentile(f_classif, percentile=p).fit(X, y)

chi2_selected = selectChi2.get_support()
chi2_selected_features = [ f for i,f in enumerate(X.columns) if chi2_selected[i]]
print('Chi2 selected {} features {}.'.format(chi2_selected.sum(),
   chi2_selected_features))
f_classif_selected = selectF_classif.get_support()
f_classif_selected_features = [ f for i,f in enumerate(X.columns) if f_classif_selected[i]]
print('F_classif selected {} features {}.'.format(f_classif_selected.sum(),
   f_classif_selected_features))
selected = chi2_selected & f_classif_selected
print('Chi2 & F_classif selected {} features'.format(selected.sum()))
features = [ f for f,s in zip(X.columns, selected) if s]
print (features)

X_sel = X[features]

X_train, X_test, y_train, y_test = \
  cross_validation.train_test_split(X_sel, y, random_state=1301, stratify=y, test_size=0.4)

# xgboost parameter tuning with p = 75
# recipe: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/forums/t/19083/best-practices-for-parameter-tuning-on-models/108783#post108783

ratio = float(np.sum(y == 1)) / np.sum(y==0)
# Initial parameters for the parameter exploration
# clf = xgb.XGBClassifier(missing=9999999999,
#                 max_depth = 10,
#                 n_estimators=1000,
#                 learning_rate=0.1, 
#                 nthread=4,
#                 subsample=1.0,
#                 colsample_bytree=0.5,
#                 min_child_weight = 5,
#                 scale_pos_weight = ratio,
#                 seed=4242)

# gives : validation_1-auc:0.845644
# max_depth=8 -> validation_1-auc:0.846341
# max_depth=6 -> validation_1-auc:0.845738
# max_depth=7 -> validation_1-auc:0.846504
# subsample=0.8 -> validation_1-auc:0.844440
# subsample=0.9 -> validation_1-auc:0.844746
# subsample=1.0,  min_child_weight=8 -> validation_1-auc:0.843393
# min_child_weight=3 -> validation_1-auc:0.848534
# min_child_weight=1 -> validation_1-auc:0.846311
# min_child_weight=4 -> validation_1-auc:0.847994
# min_child_weight=2 -> validation_1-auc:0.847934
# min_child_weight=3, colsample_bytree=0.3 -> validation_1-auc:0.847498
# colsample_bytree=0.7 -> validation_1-auc:0.846984
# colsample_bytree=0.6 -> validation_1-auc:0.847856
# colsample_bytree=0.5, learning_rate=0.05 -> validation_1-auc:0.847347
# max_depth=8 -> validation_1-auc:0.847352
# learning_rate = 0.07 -> validation_1-auc:0.847432
# learning_rate = 0.2 -> validation_1-auc:0.846444
# learning_rate = 0.15 -> validation_1-auc:0.846889
# learning_rate = 0.09 -> validation_1-auc:0.846680
# learning_rate = 0.1 -> validation_1-auc:0.847432
# max_depth=7 -> validation_1-auc:0.848534
# learning_rate = 0.05 -> validation_1-auc:0.847347
# 

clf = xgb.XGBClassifier(missing=9999999999,
                max_depth = 5,
                n_estimators=1000,
                learning_rate=0.1, 
                nthread=4,
                subsample=1.0,
                colsample_bytree=0.5,
                min_child_weight = 3,
                scale_pos_weight = ratio,
                reg_alpha=0.03,
                seed=1301)
                
clf.fit(X_train, y_train, early_stopping_rounds=50, eval_metric="auc",
        eval_set=[(X_train, y_train), (X_test, y_test)])
        
print('Overall AUC:', roc_auc_score(y, clf.predict_proba(X_sel, ntree_limit=clf.best_iteration)[:,1]))

test['n0'] = (test == 0).sum(axis=1)
# test['logvar38'] = test['var38'].map(np.log1p)
# # Encode var36 as category
# test['var36'] = test['var36'].astype('category')
# test = pd.get_dummies(test)
test_normalized = normalize(test, axis=0)
pca = PCA(n_components=2)
test_pca = pca.fit_transform(test_normalized)
test['PCA1'] = test_pca[:,0]
test['PCA2'] = test_pca[:,1]
sel_test = test[features]    
y_pred = clf.predict_proba(sel_test, ntree_limit=clf.best_iteration)

submission = pd.DataFrame({"ID":test.index, "TARGET":y_pred[:,1]})
submission.to_csv("submission.csv", index=False)

mapFeat = dict(zip(["f"+str(i) for i in range(len(features))],features))
ts = pd.Series(clf.booster().get_fscore())
#ts.index = ts.reset_index()['index'].map(mapFeat)
ts.sort_values()[-15:].plot(kind="barh", title=("features importance"))

featp = ts.sort_values()[-15:].plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(6, 10))
plt.title('XGBoost Feature Importance')
fig_featp = featp.get_figure()
fig_featp.savefig('feature_importance_xgb.png', bbox_inches='tight', pad_inches=1)