# Соревнование

В ходе рекламной кампании банк обзванивает потенциальных клиентов и предлагает им различные продукты. Необходимо определить, согласится ли клиент на предложение банка.

В обучающей выборке дан набор клиентов с информацией о них и их ответы. В тестовой выборке дана только информация о клиентах и необходимо определить, согласится ли клиент или нет. Так как порог для выбора того или иного ответа может быть разным, вам нужно выдать просто значения решающей функции, и далее тестирование будет происходить по метрике AUC-ROC.

In [None]:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
 related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone') 
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
 other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
 social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric) 
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

In [24]:
import pandas as pd
import numpy as np

Загрузим и подготовим данные:

In [25]:
X = pd.read_csv('train_data.csv')
X.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
0,26,student,single,high.school,no,no,no,telephone,jun,mon,901,1,999,0,nonexistent,1.4,94.465,-41.8,4.961,5228.1
1,46,admin.,married,university.degree,no,yes,no,cellular,aug,tue,208,2,999,0,nonexistent,1.4,93.444,-36.1,4.963,5228.1
2,49,blue-collar,married,basic.4y,unknown,yes,yes,telephone,jun,tue,131,5,999,0,nonexistent,1.4,94.465,-41.8,4.864,5228.1
3,31,technician,married,university.degree,no,no,no,cellular,jul,tue,404,1,999,0,nonexistent,-2.9,92.469,-33.6,1.044,5076.2
4,42,housemaid,married,university.degree,no,yes,no,telephone,nov,mon,85,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8


In [26]:
y_train = pd.read_csv('train_target.csv', header = None)
y_train.head()

Unnamed: 0,0
0,1
1,0
2,0
3,0
4,0


Переведем категориальные признаки в бинарные:

In [27]:
X_num = X.select_dtypes(exclude=['object'])
X_num.head()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
0,26,901,1,999,0,1.4,94.465,-41.8,4.961,5228.1
1,46,208,2,999,0,1.4,93.444,-36.1,4.963,5228.1
2,49,131,5,999,0,1.4,94.465,-41.8,4.864,5228.1
3,31,404,1,999,0,-2.9,92.469,-33.6,1.044,5076.2
4,42,85,1,999,0,-0.1,93.2,-42.0,4.191,5195.8


In [28]:
X_non_num = X.select_dtypes(include=['object'])
X_non_num.head()

Unnamed: 0,job,marital,education,default,housing,loan,contact,month,day_of_week,poutcome
0,student,single,high.school,no,no,no,telephone,jun,mon,nonexistent
1,admin.,married,university.degree,no,yes,no,cellular,aug,tue,nonexistent
2,blue-collar,married,basic.4y,unknown,yes,yes,telephone,jun,tue,nonexistent
3,technician,married,university.degree,no,no,no,cellular,jul,tue,nonexistent
4,housemaid,married,university.degree,no,yes,no,telephone,nov,mon,nonexistent


In [29]:
binom = pd.get_dummies(X_non_num)
binom.head()

Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [30]:
X_train = pd.concat([X_num, binom], axis=1)
X_train.head()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,26,901,1,999,0,1.4,94.465,-41.8,4.961,5228.1,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,46,208,2,999,0,1.4,93.444,-36.1,4.963,5228.1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,49,131,5,999,0,1.4,94.465,-41.8,4.864,5228.1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,31,404,1,999,0,-2.9,92.469,-33.6,1.044,5076.2,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,42,85,1,999,0,-0.1,93.2,-42.0,4.191,5195.8,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [31]:
X_test = pd.read_csv('test_data.csv')
X_test=X_test.drop(X_test.columns[[0]], axis=1)
X_test.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
0,49,admin.,divorced,university.degree,no,no,no,cellular,aug,tue,126,1,999,0,nonexistent,1.4,93.444,-36.1,4.968,5228.1
1,31,management,single,university.degree,no,no,no,cellular,aug,mon,1099,2,999,0,nonexistent,1.4,93.444,-36.1,4.965,5228.1
2,36,services,divorced,university.degree,no,no,no,cellular,jul,mon,407,1,999,0,nonexistent,1.4,93.918,-42.7,4.96,5228.1
3,26,blue-collar,single,basic.9y,no,yes,no,cellular,jul,tue,109,1,999,0,nonexistent,1.4,93.918,-42.7,4.962,5228.1
4,41,services,divorced,basic.9y,no,no,no,telephone,jun,thu,147,1,999,0,nonexistent,1.4,94.465,-41.8,4.961,5228.1


In [32]:
X_num_test = X_test.select_dtypes(exclude=['object'])
X_num_test.head()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
0,49,126,1,999,0,1.4,93.444,-36.1,4.968,5228.1
1,31,1099,2,999,0,1.4,93.444,-36.1,4.965,5228.1
2,36,407,1,999,0,1.4,93.918,-42.7,4.96,5228.1
3,26,109,1,999,0,1.4,93.918,-42.7,4.962,5228.1
4,41,147,1,999,0,1.4,94.465,-41.8,4.961,5228.1


In [33]:
X_non_num_test = X_test.select_dtypes(include=['object'])
X_non_num_test.head()

Unnamed: 0,job,marital,education,default,housing,loan,contact,month,day_of_week,poutcome
0,admin.,divorced,university.degree,no,no,no,cellular,aug,tue,nonexistent
1,management,single,university.degree,no,no,no,cellular,aug,mon,nonexistent
2,services,divorced,university.degree,no,no,no,cellular,jul,mon,nonexistent
3,blue-collar,single,basic.9y,no,yes,no,cellular,jul,tue,nonexistent
4,services,divorced,basic.9y,no,no,no,telephone,jun,thu,nonexistent


In [34]:
binom_test = pd.get_dummies(X_non_num_test)
binom_test.head()

Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [35]:
X_test = pd.concat([X_num_test, binom_test], axis=1)
X_test.head()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,49,126,1,999,0,1.4,93.444,-36.1,4.968,5228.1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,31,1099,2,999,0,1.4,93.444,-36.1,4.965,5228.1,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,36,407,1,999,0,1.4,93.918,-42.7,4.96,5228.1,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,26,109,1,999,0,1.4,93.918,-42.7,4.962,5228.1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,41,147,1,999,0,1.4,94.465,-41.8,4.961,5228.1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [36]:
X_train["Target"]= y_train
X_test["Target"]= np.nan
X_train.head()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success,Target
0,26,901,1,999,0,1.4,94.465,-41.8,4.961,5228.1,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1
1,46,208,2,999,0,1.4,93.444,-36.1,4.963,5228.1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0
2,49,131,5,999,0,1.4,94.465,-41.8,4.864,5228.1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0
3,31,404,1,999,0,-2.9,92.469,-33.6,1.044,5076.2,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0
4,42,85,1,999,0,-0.1,93.2,-42.0,4.191,5195.8,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0


In [37]:
Data=X_train.append(X_test)
Data.head()

Unnamed: 0,Target,age,campaign,cons.conf.idx,cons.price.idx,contact_cellular,contact_telephone,day_of_week_fri,day_of_week_mon,day_of_week_thu,...,month_may,month_nov,month_oct,month_sep,nr.employed,pdays,poutcome_failure,poutcome_nonexistent,poutcome_success,previous
0,1.0,26,1,-41.8,94.465,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,5228.1,999,0.0,1.0,0.0,0
1,0.0,46,2,-36.1,93.444,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,5228.1,999,0.0,1.0,0.0,0
2,0.0,49,5,-41.8,94.465,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,5228.1,999,0.0,1.0,0.0,0
3,0.0,31,1,-33.6,92.469,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,5076.2,999,0.0,1.0,0.0,0
4,0.0,42,1,-42.0,93.2,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,5195.8,999,0.0,1.0,0.0,0


In [38]:
X_train= Data[Data['Target'].notnull()]
y_train=Data[Data['Target'].notnull()]['Target']

Найдем признаки, влияющие на конечный результат сильнее всего:

In [39]:
corr = abs(X_train.corr('pearson')["Target"])
corr

Target                           1.000000
age                              0.032164
campaign                         0.071472
cons.conf.idx                    0.051229
cons.price.idx                   0.136918
contact_cellular                 0.146781
contact_telephone                0.146781
day_of_week_fri                  0.004539
day_of_week_mon                  0.020609
day_of_week_thu                  0.010291
day_of_week_tue                  0.009137
day_of_week_wed                  0.005832
default_no                       0.102057
default_unknown                  0.101981
default_yes                      0.003719
duration                         0.408124
education_basic.4y               0.009280
education_basic.6y               0.025658
education_basic.9y               0.048707
education_high.school            0.004045
education_illiterate             0.010091
education_professional.course    0.002344
education_university.degree      0.047182
education_unknown                0

Удалим Target, чтобы единичная корреляция не скажала значение средней корреляюции, которую будем искать далее:

In [40]:
corr1= corr[1:]
corr1.head()

age                 0.032164
campaign            0.071472
cons.conf.idx       0.051229
cons.price.idx      0.136918
contact_cellular    0.146781
Name: Target, dtype: float64

In [41]:
cor_mean=np.mean(corr1)
cor_mean

0.07570834693573818

In [42]:
badcorr=[]
for i in range(len(corr)):
    if  corr[i]< cor_mean:
        badcorr.append(X_train.columns.values[i])
nocorr=badcorr
print(nocorr)

['age', 'campaign', 'cons.conf.idx', 'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu', 'day_of_week_tue', 'day_of_week_wed', 'default_yes', 'education_basic.4y', 'education_basic.6y', 'education_basic.9y', 'education_high.school', 'education_illiterate', 'education_professional.course', 'education_university.degree', 'education_unknown', 'housing_no', 'housing_unknown', 'housing_yes', 'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid', 'job_management', 'job_self-employed', 'job_services', 'job_technician', 'job_unemployed', 'job_unknown', 'loan_no', 'loan_unknown', 'loan_yes', 'marital_divorced', 'marital_married', 'marital_single', 'marital_unknown', 'month_aug', 'month_jul', 'month_jun', 'month_nov', 'poutcome_failure']


Итак, у нас есть новые датафрейм с наиболее важными признаками (т.е. корреляция которых с Target больше средней корреляции) :

In [43]:
newdata=Data.drop(Data[nocorr], axis=1)

In [44]:
X_train1= newdata[newdata['Target'].notnull()]
y_train=newdata[newdata['Target'].notnull()]['Target']

In [45]:
X_train2=X_train1.drop(['Target'],axis=1)
X_train=newdata[:27595].drop(['Target'],axis=1)
X_test1= newdata[27595:]
X_test=X_test1.drop(['Target'],axis=1)

In [46]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.grid_search import GridSearchCV



In [47]:
forest = RandomForestClassifier(max_depth=2, random_state=14)
forest.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=14,
            verbose=0, warm_start=False)

In [48]:
y_pred= forest.predict(X_test)
y_pred

array([ 0.,  0.,  0., ...,  0.,  0.,  0.])

Переведем в вероятности:

In [49]:
y_pred_probab=forest.predict_proba(X_test)
y_pred_probab

array([[ 0.95044498,  0.04955502],
       [ 0.77152978,  0.22847022],
       [ 0.95044498,  0.04955502],
       ..., 
       [ 0.95507378,  0.04492622],
       [ 0.92293821,  0.07706179],
       [ 0.70967171,  0.29032829]])

Запустим Random Forest Classifier на наших данных и найдем оптимальные параметры для регрессора:

In [50]:
from sklearn.cross_validation import train_test_split

In [51]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_train, y_train, test_size=0.4, random_state=14)

In [52]:
forest.fit(X_train2,y_train2)
y_pred_train2=forest.predict(X_test2)

In [53]:
score= roc_auc_score(y_test2, y_pred_train2)
score

0.56585251841525808

In [55]:
def accuracy_on_param(n_est,max_dep):
    param_grid=dict(n_estimators=n_est,max_depth=max_dep)
    cv=StratifiedShuffleSplit(y_train,n_iter=5,test_size=0.4,random_state=1234)
    grid=GridSearchCV(RandomForestClassifier(),param_grid=param_grid,cv=cv)
    grid.fit(X_train,y_train)
    print(grid.best_params_,grid.best_score_)

In [56]:
accuracy_on_param(np.arange(10,100,5),np.arange(10,100,5))

({'n_estimators': 95, 'max_depth': 10}, 0.9152201485776409)


Теперь можем получить предсказания, используя оптимальные параметры:

In [1]:
forest1 = RandomForestClassifier(max_depth=10, n_estimators= 95, random_state=14)
forest1.fit(X_train,y_train)

NameError: name 'RandomForestClassifier' is not defined

В вероятностях:

In [59]:
y_pred_probab=forest1.predict_proba(X_test)
y_pred_probab

array([[ 0.98613324,  0.01386676],
       [ 0.41348753,  0.58651247],
       [ 0.96649979,  0.03350021],
       ..., 
       [ 0.99591351,  0.00408649],
       [ 0.94284835,  0.05715165],
       [ 0.70741403,  0.29258597]])

In [60]:
y_pred_probab2=y_pred_probab[:,1]
y_pred_probab2

array([ 0.01386676,  0.58651247,  0.03350021, ...,  0.00408649,
        0.05715165,  0.29258597])

In [61]:
dataframe=pd.DataFrame(data=y_pred_probab2,columns=['Prediction'])

Созданим файл для выдачи ответов в нужной форме:

In [62]:
dataframe.to_csv('Sub.csv',index=True, index_label='Id')

In [63]:
y_pred1=forest1.predict(X_test)
y_pred1.reshape(-1,1)

array([[ 0.],
       [ 1.],
       [ 0.],
       ..., 
       [ 0.],
       [ 0.],
       [ 0.]])

Проверим качество:

In [64]:
forest1.fit(X_train2,y_train2)
y_pred_train3=forest1.predict(X_test2)
score= roc_auc_score(y_test2, y_pred_train3)
score

0.71829545919001037