# <center> Предсказание пола клиента </center>

### Необходимо выявить пол клиента, основываясь на его транзакционных исторических данных. В роли метрики качества выступает [ROC AUC](https://dyakonov.org/2017/07/28/auc-roc-%D0%BF%D0%BB%D0%BE%D1%89%D0%B0%D0%B4%D1%8C-%D0%BF%D0%BE%D0%B4-%D0%BA%D1%80%D0%B8%D0%B2%D0%BE%D0%B9-%D0%BE%D1%88%D0%B8%D0%B1%D0%BE%D0%BA/), который и нужно будет максимизировать.

## Описание файлов
- transactions.csv - исторические транзакции банковских клиентов
- gender.csv - информация по полу для части клиентов (null - для тестовых)
- tr_mcc_codes.csv - mcc-коды транзакций
- tr_types.csv - типы транзакций

## Описание полей
### transactions.csv
- customer_id - идентификатор клиента
- tr_datetime - день и время совершения транзакции (дни нумеруются с начала данных)
- mcc_code - mcc-код транзакции
- tr_type - тип транзакции
- amount - сумма транзакции в условных единицах; со знаком "+" — начисление средств клиенту, "-" — списание средств
- term_id - идентификатор терминала

### gender.csv
- customer_id - идентификатор клиента
- gender - пол клиента (пустые значения - тестовые клиенты)

### tr_mcc_codes.csv
- mcc_code - mcc-код транзакции
- mcc_description - описание mcc-кода транзакции

### tr_types.csv
- tr_type - тип транзакции
- tr_description - описание типа транзакции

## Задачи:
- Разработать модель бинарной классификации для определения пола клиента. Никаких ограничений к модели - может быть что угодно от KNN до трансформеров. Главное, чтобы ROC AUC на отложенном тесте получился выше 77.5%.
- Интерпретировать результаты модели: важность входящих в нее переменных, демонстрация на нескольких примерах, почему получился соответствующий прогноз. Последнее позволит понять, какой пол к какому из таргетов (0/1) принадлежит. Опять же, полная свобода выбора подходов! Полезные ключевые слова: gain, permutation importance, SHAP. 
- Конвертировать результаты в отчет без кода (идеально - напрямую в [html](https://stackoverflow.com/questions/49907455/hide-code-when-exporting-jupyter-notebook-to-html))

#### P.S. Не забываем про [PEP8](https://www.python.org/dev/peps/pep-0008/)!

In [1]:
#adding requirements
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import numpy as np
import pandas as pd

In [3]:
tr_mcc_codes = pd.read_csv("data/tr_mcc_codes.csv", sep=";", index_col="mcc_code")
tr_types = pd.read_csv("data/tr_types.csv", sep=";", index_col="tr_type")

transactions = pd.read_csv("data/transactions.csv", index_col="customer_id")
gender = pd.read_csv("data/gender.csv", index_col="customer_id")

# Data exploration

In [4]:
# import matplotlib.pyplot as plt

# # Plot % missing values
# (transactions.isnull().sum(axis = 0) / transactions.shape[0] * 100).plot.bar()
# plt.title('% Missing')
# plt.xlabel('Feature')
# plt.ylabel('%')
# plt.show()

In [5]:
# #check the classes distribution
# import seaborn as sns; sns.set()
# import matplotlib.pyplot as plt 

# x = gender['gender'].value_counts().values
# sns.barplot([0,1],x)
# plt.title('Target variable count');
# print(x)

# Let's check the data

In [6]:
# print('>>>>transactions\n' , transactions.head(10), '\n')
# print('>>>>tr_mcc_codes\n' , tr_mcc_codes.head(10), '\n')
# print('>>>>tr_types\n' , tr_types.head(10), '\n')
# print('>>>>gender\n' , gender.head(10), '\n')

# Data exploration


In [7]:
# from sklearn.decomposition import PCA
# from sklearn.cluster import KMeans
# import numpy as np
# import matplotlib.pyplot as plt
# pca = PCA(2)
# df = pca.fit_transform(X)
 
# kmeans = KMeans(n_clusters= 2)
# label = kmeans.fit_predict(df)
# u_labels = np.unique(label)

# # for i in u_labels:
# #     plt.scatter(df[label == i , 0] , df[label == i , 1] , label = i)
# # plt.legend()
# # plt.show()

In [8]:
# import seaborn as sns; sns.set()
# corr = Data.corr()
# print(corr)
# sns.set(rc={'figure.figsize':(11.7,8.27)})
# sns.heatmap(corr);



# Let’s start 


In [7]:
def train_feature_engineering(Data, gender):

    Data = pd.merge(Data, gender, left_on='customer_id', right_on='customer_id', how='inner')
    #feature aggregating
    agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].sum() > 0, name = 'agg_sign_id', index = Data.index)
    Data['agg_sign_id'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].sum(), name = 'agg_sum_id', index = Data.index)
    Data['agg_sum_id'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].count(), name = 'agg_count_id', index = Data.index)
    Data['agg_count_id'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].mean(), name = 'agg_mean_id', index = Data.index)
    Data['agg_mean_id'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].max(), name = 'agg_max_id', index = Data.index)
    Data['agg_max_id'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].min(), name = 'agg_min_id', index = Data.index)
    Data['agg_min_id'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['mcc_code'].count(), name = 'agg_count_mcc', index = Data.index)
    Data['agg_count_mcc'] = agg_count_id

    keys_1 = dict(Data.groupby(Data['mcc_code'])['amount'].count())
    agg_count_id = Data['mcc_code'].apply(lambda x: keys_1[x])
    Data['agg_count_mcc_unique'] = agg_count_id

    keys_2 = dict(Data.groupby(Data['mcc_code'])['amount'].sum())
    agg_count_id = Data['mcc_code'].apply(lambda x: keys_2[x])
    Data['agg_sum_mcc'] = agg_count_id
    
    keys_3 = dict(Data.groupby(Data['mcc_code'])['amount'].nunique())
    agg_count_id = Data['mcc_code'].apply(lambda x: keys_3[x])
    Data['agg_sum_mcc_unqiue_type'] = agg_count_id

    mcc_amount_key = dict(Data.set_index(['mcc_code', 'gender']).groupby(level=[0,1])['amount'].mean())
    agg_count_id = Data['mcc_code'].apply(lambda x: max(mcc_amount_key[(x, 1.0)], mcc_amount_key[(x, 0.0)]))
    Data['mcc_amount_key'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['tr_datetime'].nunique(), name = 'days_unique', index = Data.index)
    Data['days_unique'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['tr_datetime'].count(), name = 'days_all', index = Data.index)
    Data['days_all'] = agg_count_id
 
    keys_mcc = dict(Data.groupby(Data['mcc_code']).apply(lambda x: x['gender'] == 1.0).groupby(level=['mcc_code']).mean())
    agg_count_id = Data['mcc_code'].apply(lambda x: keys_mcc[x])
    Data['mcc_gender'] = agg_count_id

    keys_type = dict(Data.groupby(Data['tr_type']).apply(lambda x: x['gender'] == 1.0).groupby(level=['tr_type']).mean())
    agg_count_id = Data['tr_type'].apply(lambda x: keys_type[x])
    Data['tr_type_gender'] = agg_count_id


    from sklearn.decomposition import PCA
    from sklearn.cluster import KMeans
    import numpy as np
    import matplotlib.pyplot as plt
    pca = PCA(2)
    df = pca.fit_transform(Data.dropna)
    
    kmeans = KMeans(n_clusters= 2)
    label = kmeans.fit_predict(df)
    u_labels = np.unique(label)

    # fill_value = Data['mcc_code'].apply(lambda x: int(keys_mcc[x] > 0.5)) 
    # Data['gender'].fillna(fill_value, inplace = True)

    #print(Data.isnull().sum(axis = 0))

    Data.dropna(inplace = True)

    return Data, keys_mcc, keys_type, mcc_amount_key, keys_1, keys_2, keys_3 #, my_scaler

def test_feature_engineering(Data, keys_mcc, keys_type, mcc_amount_key, keys_1, keys_2, keys_3):

    #feature aggregating
    agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].sum() > 0, name = 'agg_sign_id', index = Data.index)
    Data['agg_sign_id'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].sum(), name = 'agg_sum_id', index = Data.index)
    Data['agg_sum_id'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].count(), name = 'agg_count_id', index = Data.index)
    Data['agg_count_id'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].mean(), name = 'agg_mean_id', index = Data.index)
    Data['agg_mean_id'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].max(), name = 'agg_max_id', index = Data.index)
    Data['agg_max_id'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].min(), name = 'agg_min_id', index = Data.index)
    Data['agg_min_id'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['mcc_code'].count(), name = 'agg_count_mcc', index = Data.index)
    Data['agg_count_mcc'] = agg_count_id

    agg_count_id = Data['mcc_code'].apply(lambda x: keys_1[x])
    Data['agg_count_mcc_unique'] = agg_count_id

    agg_count_id = Data['mcc_code'].apply(lambda x: keys_2[x])
    Data['agg_sum_mcc'] = agg_count_id
    
    agg_count_id = Data['mcc_code'].apply(lambda x: keys_3[x])
    Data['agg_sum_mcc_unique_type'] = agg_count_id

    agg_count_id = Data['mcc_code'].apply(lambda x: max(mcc_amount_key[(x, 1.0)], mcc_amount_key[(x, 0.0)]))
    Data['mcc_amount_key'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['tr_datetime'].nunique(), name = 'days_unique', index = Data.index)
    Data['days_unique'] = agg_count_id

    agg_count_id = pd.Series(Data.groupby(Data.index)['tr_datetime'].count(), name = 'days_all', index = Data.index)
    Data['days_all'] = agg_count_id
 
    agg_count_id = Data['mcc_code'].apply(lambda x: keys_mcc[x])
    Data['mcc_gender'] = agg_count_id

    agg_count_id = Data['tr_type'].apply(lambda x: keys_type[x] if x in keys_type else 0)
    Data['tr_type_gender'] = agg_count_id
    Data.dropna(inplace = True)
    
    return Data


In [2]:
# def train_feature_engineering(Data, gender):

#     Data = pd.merge(Data, gender, left_on='customer_id', right_on='customer_id', how='inner')
#     #feature aggregating
#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].sum() > 0, name = 'agg_sign_id', index = Data.index)
#     Data['agg_sign_id'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].sum(), name = 'agg_sum_id', index = Data.index)
#     Data['agg_sum_id'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].count(), name = 'agg_count_id', index = Data.index)
#     Data['agg_count_id'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].mean(), name = 'agg_mean_id', index = Data.index)
#     Data['agg_mean_id'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].max(), name = 'agg_max_id', index = Data.index)
#     Data['agg_max_id'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].min(), name = 'agg_min_id', index = Data.index)
#     Data['agg_min_id'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['mcc_code'].count(), name = 'agg_count_mcc', index = Data.index)
#     Data['agg_count_mcc'] = agg_count_id

#     keys = dict(Data.groupby(Data['mcc_code'])['amount'].count())
#     agg_count_id = Data['mcc_code'].apply(lambda x: keys[x])
#     Data['agg_count_mcc_unique'] = agg_count_id

#     keys = dict(Data.groupby(Data['mcc_code'])['amount'].sum())
#     agg_count_id = Data['mcc_code'].apply(lambda x: keys[x])
#     Data['agg_sum_mcc'] = agg_count_id
    
#     keys = dict(Data.groupby(Data['mcc_code'])['amount'].nunique())
#     agg_count_id = Data['mcc_code'].apply(lambda x: keys[x])
#     Data['agg_sum_mcc_unqiue_type'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['tr_datetime'].nunique(), name = 'days_unique', index = Data.index)
#     Data['days_unique'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['tr_datetime'].count(), name = 'days_all', index = Data.index)
#     Data['days_all'] = agg_count_id
 
#     keys_mcc = dict(Data.groupby(Data['mcc_code']).apply(lambda x: x['gender'] == 1.0).groupby(level=['mcc_code']).mean())
#     agg_count_id = Data['mcc_code'].apply(lambda x: keys_mcc[x])
#     Data['mcc_gender'] = agg_count_id

#     keys_type = dict(Data.groupby(Data['tr_type']).apply(lambda x: x['gender'] == 1.0).groupby(level=['tr_type']).mean())
#     agg_count_id = Data['tr_type'].apply(lambda x: keys_type[x])
#     Data['tr_type_gender'] = agg_count_id

#     fill_value = Data['mcc_code'].apply(lambda x: int(keys_mcc[x] > 0.5)) 
#     Data['gender'].fillna(fill_value, inplace = True)

#     return Data, keys_mcc, keys_type #, my_scaler

# def test_feature_engineering(Data, keys_mcc, keys_type):

#     #feature aggregating
#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].sum() > 0, name = 'agg_sign_id', index = Data.index)
#     Data['agg_sign_id'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].sum(), name = 'agg_sum_id', index = Data.index)
#     Data['agg_sum_id'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].count(), name = 'agg_count_id', index = Data.index)
#     Data['agg_count_id'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].mean(), name = 'agg_mean_id', index = Data.index)
#     Data['agg_mean_id'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].max(), name = 'agg_max_id', index = Data.index)
#     Data['agg_max_id'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].min(), name = 'agg_min_id', index = Data.index)
#     Data['agg_min_id'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['mcc_code'].count(), name = 'agg_count_mcc', index = Data.index)
#     Data['agg_count_mcc'] = agg_count_id

#     keys = dict(Data.groupby(Data['mcc_code'])['amount'].count())
#     agg_count_id = Data['mcc_code'].apply(lambda x: keys[x])
#     Data['agg_count_mcc_unique'] = agg_count_id

#     keys = dict(Data.groupby(Data['mcc_code'])['amount'].sum())
#     agg_count_id = Data['mcc_code'].apply(lambda x: keys[x])
#     Data['agg_sum_mcc'] = agg_count_id
    
#     keys = dict(Data.groupby(Data['mcc_code'])['amount'].nunique())
#     agg_count_id = Data['mcc_code'].apply(lambda x: keys[x])
#     Data['agg_sum_mcc_unqiue_type'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['tr_datetime'].nunique(), name = 'days_unique', index = Data.index)
#     Data['days_unique'] = agg_count_id

#     agg_count_id = pd.Series(Data.groupby(Data.index)['tr_datetime'].count(), name = 'days_all', index = Data.index)
#     Data['days_all'] = agg_count_id

#     agg_count_id = Data['mcc_code'].apply(lambda x: keys_mcc[x])
#     Data['mcc_gender'] = agg_count_id

#     agg_count_id = Data['tr_type'].apply(lambda x: keys_type[x] if x in keys_type else 0)
#     Data['tr_type_gender'] = agg_count_id

 
#     Data.dropna(inplace = True)
    
#     # #scaling
#     # indexes = Data.index   
#     # columns = Data.columns
#     # Data = np.nan_to_num(Data)
#     # Data = pd.DataFrame(my_scaler.transform(Data), index=indexes, columns=columns)

#     return Data


In [13]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
pca = PCA(2)
df = pca.fit_transform(train.dropna)

kmeans = KMeans(n_clusters= 2)
label = kmeans.fit_predict(df)
u_labels = np.unique(label)
label

TypeError: float() argument must be a string or a number, not 'method'

In [3]:
# df = train.set_index([train.index, 'mcc_code' ,'gender'])
# dict(df.groupby(level=[1,2])['amount'].mean())

In [9]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import numpy as np

def preprocess_pipeline(drop = True, verbose = False):

    split_ratio = 0.99
    
    tr_mcc_codes = pd.read_csv("data/tr_mcc_codes.csv", sep=";", index_col="mcc_code")
    tr_types = pd.read_csv("data/tr_types.csv", sep=";", index_col="tr_type")

    transactions = pd.read_csv("data/transactions.csv", index_col="customer_id")
    gender = pd.read_csv("data/gender.csv", index_col="customer_id")

    transactions.drop(columns=['term_id'], inplace=True)
    transactions['tr_datetime'] = transactions['tr_datetime'].apply(lambda x: int(x.split(' ')[0]))

    ind = transactions.index.unique()[int(len(transactions.index.unique())*split_ratio)]
    train = transactions[transactions.index >= ind]
    test = transactions[transactions.index < ind]

    print(train.shape, test.shape)

    train, keys_mcc, keys_type, mcc_amount_key, keys_1, keys_2, keys_3 = train_feature_engineering(train, gender)
    test = test_feature_engineering(test, keys_mcc, keys_type, mcc_amount_key, keys_1, keys_2, keys_3)
    
    #train = pd.merge(train, gender, left_on='customer_id', right_on='customer_id', how='inner')
    test = pd.merge(test, gender, left_on='customer_id', right_on='customer_id', how='inner')
    
    train = train[train['gender'].notna()]
    test = test[test['gender'].notna()]

    return train, test

train, test = preprocess_pipeline(verbose = True)
print(train.shape, test.shape )


(6739028, 4) (110318, 4)
(3663755, 20) (66959, 20)


In [163]:
# x_train.columns, x_test.columns

In [106]:
# #clustering
# from sklearn.cluster import KMeans
# kmeans = KMeans(n_clusters=2)
# train["Cluster"] = kmeans.fit_predict(train)
# train["Cluster"] = train["Cluster"].astype("int64")

# agg_count_id = pd.Series(train['mcc_code'.agg('Cluster')].count(), name = 'mcc_gender', index = train.index)
# Data = pd.concat([agg_count_id, train], axis = 1)

In [107]:
# for sex in train.index
# train.loc[(1.0, 6011), :]['amount'].mean()

In [108]:
# # Data = pd.read_csv("data/transactions.csv", index_col="customer_id")
# # gender = pd.read_csv("data/gender.csv", index_col="customer_id")
# # Data = pd.merge(Data, gender, left_on='customer_id', right_on='customer_id', how='inner')
# # group = Data.groupby(Data['mcc_code']).apply(lambda x: x['gender'] == 1.0).groupby(level=['mcc_code']).mean()
# # group


# #train.set_index(['gender', 'mcc_code'], inplace = True)
# train.groupby(train.index).head()

In [10]:
y_train = train['gender']
x_train = train.drop(columns = ['gender'])

y_test = test['gender']
x_test = test.drop(columns = ['gender'])

# from sklearn.model_selection import train_test_split
# x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)


# XGBClassifier

In [167]:
len(x_test.index.unique())/(len(x_test.index.unique()) + len(x_train.index.unique())), len(x_test.index.unique())

(0.011799163179916318, 141)

In [12]:
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, accuracy_score

print('>>>>>>>>>', 'XGBClassifier')

model = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1.0, gamma=1, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.02, max_delta_step=0, max_depth = 4,
              min_child_weight=5, monotone_constraints='()',
              n_estimators=100, n_jobs=1, nthread=1, num_parallel_tree=1,
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              subsample=0.8, tree_method='exact',
              validate_parameters=1, verbosity=None)
              
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
roc_auc_test = roc_auc_score(y_test, y_pred)
y_pred = model.predict(x_train)
roc_auc_train = roc_auc_score(y_train, y_pred)

print('mean accuracy score on test: ' + str(accuracy))
print('roc_auc_score on test: '+ str(roc_auc_test))
print('roc_auc_score on train: '+ str(roc_auc_train))

>>>>>>>>> XGBClassifier
mean accuracy score on test: 0.7044758732955988
roc_auc_score on test: 0.6977149793373514
roc_auc_score on train: 0.6647218609548196


In [168]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

print('>>>>>>>>>', 'LogisticRegression')
lr = LogisticRegression(max_iter=1000)
lr.fit(x_train, y_train)
print("mean accuracy score on test: " + str(lr.score(x_test, y_test)))
print("mean accuracy score on train: "+ str(lr.score(x_train, y_train)))

y_pred = lr.predict(x_test)
print('roc_auc_score on test', roc_auc_score(y_test, y_pred))
y_pred = lr.predict(x_train)
print('roc_auc_score on train', roc_auc_score(y_train, y_pred))


>>>>>>>>> LogisticRegression
mean accuracy score on test: 0.5329081975537269
mean accuracy score on train: 0.6610591265460927
roc_auc_score on test 0.5497004046752714
roc_auc_score on train 0.5293924102792655


# Grid search

In [6]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from datetime import datetime
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier

def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

x_train.shape, x_test.shape, y_train.shape, y_test.shape

NameError: name 'x_train' is not defined

In [5]:
params = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [5, 6, 7, 8, 9]
        }

xgb = XGBClassifier(learning_rate=0.02, n_estimators=100, objective='binary:logistic',
                    silent=True, nthread=1)

folds = 5
param_comb = 3
skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)
random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring='roc_auc', n_jobs=4, cv=skf.split(x_train,y_train), verbose=3, random_state=1001 )

# Here we go
start_time = timer(None) 
random_search.fit(x_train, y_train)
timer(start_time) 

NameError: name 'XGBClassifier' is not defined

In [14]:
print('\n All results:')
print(random_search.cv_results_)
print('\n Best estimator:')
print(random_search.best_estimator_)
print('\n Best normalized gini score for %d-fold search with %d parameter combinations:' % (folds, param_comb))
print(random_search.best_score_ * 2 - 1)
print('\n Best hyperparameters:')
print(random_search.best_params_)
results = pd.DataFrame(random_search.cv_results_)
results.to_csv('xgb-random-grid-search-results-01.csv', index=False)


 All results:
{'mean_fit_time': array([217.52383785, 244.17107964, 401.51170998, 182.82168097,
       218.50906525]), 'std_fit_time': array([ 3.78032963,  4.67087168, 10.44192738,  6.15033259,  7.7312049 ]), 'mean_score_time': array([1.55832853, 1.84535103, 1.89148493, 1.16086679, 1.07255349]), 'std_score_time': array([0.36971757, 0.85082337, 0.36618341, 0.12015103, 0.00763959]), 'param_subsample': masked_array(data=[0.6, 0.8, 0.8, 1.0, 0.6],
             mask=[False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_min_child_weight': masked_array(data=[1, 10, 5, 1, 1],
             mask=[False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_max_depth': masked_array(data=[6, 6, 8, 5, 5],
             mask=[False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_gamma': masked_array(data=[5, 1, 1, 0.5, 1],
             mask=[False, False, False, False, False],
       fill_va

In [59]:
# from sklearn.ensemble import GradientBoostingClassifier

# print('>>>>>>>>>', 'GradientBoostingClassifier')
# clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=3, random_state=0).fit(x_train, y_train)
# print("mean accuracy score on test: " + str(clf.score(x_test, y_test)))
# print('roc_auc_score', roc_auc_score(y_test, y_pred))

>>>>>>>>> GradientBoostingClassifier
mean accuracy score on test: 0.687398109241263
roc_auc_score 0.49999602154049105


In [60]:
# from sklearn.svm import SVC
# from sklearn.ensemble import BaggingClassifier

# clf = BaggingClassifier(base_estimator=SVC(), n_estimators=10, random_state=0).fit(x_train, y_train)
# y_pred = clf.predict(x_test)
# print('roc_auc_score', roc_auc_score(y_test, y_pred))

In [None]:
# from sklearn.ensemble import RandomForestClassifier
# clf = RandomForestClassifier(max_depth=2, random_state=0, criterion = 'entropy').fit(x_train, y_train)
# y_pred = clf.predict(x_test)
# print('roc_auc_score', roc_auc_score(y_test, y_pred))

roc_auc_score 0.5144229010465293


In [None]:
# from sklearn import svm

# clf = svm.SVC()
# clf.fit(x_train, y_train)
# y_pred = clf.predict(x_test)
# print('roc_auc_score', roc_auc_score(y_test, y_pred))


In [None]:
# from sklearn.impute import SimpleImputer

# def count(Data, ind, x):
#     y = []
#     for mcc in x:
#         y.append((Data[Data.index == ind]['mcc_code'] == mcc).sum())
#     return y


# def preprocess_pipeline(drop = True, verbose = False):
    
#     tr_mcc_codes = pd.read_csv("data/tr_mcc_codes.csv", sep=";", index_col="mcc_code")
#     tr_types = pd.read_csv("data/tr_types.csv", sep=";", index_col="tr_type")

#     transactions = pd.read_csv("data/transactions.csv", index_col="customer_id")
#     gender = pd.read_csv("data/gender.csv", index_col="customer_id")

#     transactions.drop(columns=['term_id'], inplace=True)
#     transactions['tr_datetime'] = transactions['tr_datetime'].apply(lambda x: int(x.split(' ')[0]))
#     Data = pd.merge(transactions, gender, left_on='customer_id', right_on='customer_id', how='inner')

#     #feature aggregating
#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].count(), name = 'agg_count_id', index = Data.index)
#     Data = pd.concat([agg_count_id, Data], axis = 1)

#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].mean(), name = 'agg_mean_id', index = Data.index)
#     Data = pd.concat([agg_count_id, Data], axis = 1)

#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].max(), name = 'agg_max_id', index = Data.index)
#     Data = pd.concat([agg_count_id, Data], axis = 1)

#     agg_count_id = pd.Series(Data.groupby(Data.index)['amount'].std(), name = 'agg_std_id', index = Data.index)
#     Data = pd.concat([agg_count_id, Data], axis = 1)

#     agg_count_id = pd.Series(Data.groupby(Data.index)['mcc_code'].count(), name = 'agg_count_mcc', index = Data.index)
#     Data = pd.concat([agg_count_id, Data], axis = 1)

#     agg_count_id = pd.Series(Data.groupby(Data.index)['mcc_code'].nunique(), name = 'agg_count_mcc_unique', index = Data.index)
#     Data = pd.concat([agg_count_id, Data], axis = 1)

#     agg_count_id = pd.Series(Data.groupby(Data.index)['tr_datetime'].nunique(), name = 'days_unique', index = Data.index)
#     Data = pd.concat([agg_count_id, Data], axis = 1)

#     #keys = {}
    
#     # group = Data.groupby(Data['mcc_code']).apply(lambda x: x['gender'] == 1.0).groupby(level=['mcc_code']).mean()
#     # for i in range(group.shape[0]):
#     #     keys[group.index[i]] = group.iloc[i]
    
#     # agg_count_id = pd.Series(Data['mcc_code'].apply(lambda x: keys[x]), name = 'mcc_gender', index = Data.index)
#     # Data = pd.concat([agg_count_id, Data], axis = 1)

#     Data = Data[Data['gender'].notna()]
    
#     # columns  = [str(mcc) for mcc in Data['mcc_code'].unique()]
#     # tempdata = pd.DataFrame(np.zeros((len(Data.index.unique()), len(Data['mcc_code'].unique()))), index=Data.index.unique(), columns=columns)
#     # print(tempdata.shape)
#     # indexes = Data.index.unique()
#     # mcc_codes = Data['mcc_code'].unique()
#     # for j, ind in enumerate(indexes):
#     #     tempdata.apply(lambda x: count(Data, ind, mcc_codes), axis = 1)

#     # Data = pd.concat([tempdata, Data], axis = 1)

#     # agg_count_id = pd.Series(Data.groupby(Data['mcc_code']).  index.count(), name = 'agg_count_mcc', index = Data.index)
#     # Data = pd.concat([agg_count_id, Data], axis = 1)

#     y = Data['gender']
#     x = Data.drop(columns = ['gender'])

#     indexes = x.index   
#     columns = x.columns
#     x = np.nan_to_num(x)
#     x = pd.DataFrame(StandardScaler().fit_transform(x), index=indexes, columns=columns)

#     Data.index

#     x_train = Data .iloc[:int(len(Data)*0.7)]
#     x_test = Data.iloc[int(len(Data)*0.7):]
#     x_train.index[pd.merge(x_test, x_train, left_on='customer_id', right_on='customer_id', how='inner').index].drop(inplace = True)
    
#     y_train = x_train['gender']
#     x_train.drop(columns=['gender'], inplace = True)

#     y_test = x_test['gender']
#     x_test.drop(columns=['gender'], inplace = True)

#     # from sklearn.model_selection import train_test_split
#     # x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42, stratify = y, shuffle = True)

#     X_train_plus = x_train.copy()
#     X_valid_plus = x_test.copy()
    
#     from sklearn.impute import SimpleImputer
#     my_imputer = SimpleImputer()
#     cols_with_missing = list(X_train_plus[X_train_plus.isna()].columns)

#     for col in cols_with_missing:
#         X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
#         X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

#     # Imputation
#     my_imputer = SimpleImputer()
#     imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
#     imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

#     # Imputation removed column names; put them back
#     imputed_X_train_plus.columns = X_train_plus.columns
#     imputed_X_valid_plus.columns = X_valid_plus.columns

#     x_train = imputed_X_train_plus
#     x_test = imputed_X_valid_plus

#     if verbose:
#         print(x_train.head())
#     return x_train, x_test, y_train, y_test

# x_train, x_test, y_train, y_test = preprocess_pipeline(verbose = True)