# Which customers are happy customers?      
  
From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. What's more, unhappy customers rarely voice their dissatisfaction before leaving.

[Santander Bank](https://www.santanderbank.com/us/personal) is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late.

In this competition, you'll work with hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience.  

![](https://kaggle2.blob.core.windows.net/competitions/kaggle/4986/media/santander_custsat_red.png)

## Set the environments for the experiment

In [1]:
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np

## Read data from the given files

In [3]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [4]:
X = train.iloc[:,:-1]
Y = train.TARGET

## Setting features to use

In [5]:
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_regression

In [6]:
# Select features in the top 30% by f_regression 
top = 30
f_regression_selected = SelectPercentile(f_regression, 
                                         percentile=top)\
                        .fit(X, Y)\
                        .get_support()         # get a mask

In [7]:
selected = f_regression_selected
features = [f for f,s in zip(X.columns, selected)]

In [8]:
# Extract only selected features
X_sel = X[features]

## Learning the model

In [9]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_jobs=-1, n_estimators=1000)
model.fit(train[features], train["TARGET"])
pred = model.predict_proba(test[features])

## Making a data frame to submit

In [10]:
submit = pd.DataFrame()
submit["ID"] = test["ID"]
submit["TARGET"] = pd.DataFrame(pred)[1]

## Saving the result data frame to the submission file.

In [11]:
from time import strftime, localtime

current_time = strftime("%Y.%m.%d %H.%M.%S", localtime())

submit.to_csv("RandomForestClassifier %s.csv" % current_time)

## Scoring function  

Submissions are evaluated on [area under the ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) between the predicted probability and the observed target.
The score is measured by 5-fold cross validation.

In [12]:
# dataframe to Ndarray
X = X.as_matrix()
Y = Y.as_matrix()
X_sel = X_sel.as_matrix()

In [13]:
from sklearn import cross_validation
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import StratifiedKFold

list_auc = list()

for train_idx, test_idx in StratifiedKFold(Y, n_folds=5, shuffle=True):
    X_train, X_test = X_sel[train_idx], X_sel[test_idx]
    Y_train, Y_test = Y[train_idx], Y[test_idx]
    r = model.fit(X_train, Y_train)
    auc = roc_auc_score(Y_test, model.predict_proba(X_test)[:,1])
    list_auc.append(auc)

score = sum(list_auc)/len(list_auc)

In [14]:
print(score)

0.794303294138
