# «Олимпиада Кружкового движения Национальной технологической инициативы» по профилю «Искусственный интеллект» 2019/20 учебный год

**Задача II.1.0.1. (100 баллов)**

Можно ли узнать возраст клиента на основе информации о его расходах по карте?

Мы подготовили задачу на базе реальных банковских транзакций. Совершенствуя свои продукты, банк использует информацию о пользователях, в том числе и возраст. Это помогает сделать персонализированные продукты, которые удовлетворяют реальные потребности клиентов. Но всегда ли календарный возраст соответствует образу жизни (и покупок) человека?

Ваша задача — по информации о расходах клиента банка предсказать, в какую из возрастных групп он попадает. Даны обучающие train данные для построения признаков и обучения моделей, и тестовые test данные для проверки алгоритмов.

Это специальным образом подготовленная и анонимизированная информация, на которой можно обучать модели, сохраняя полную безопасность реальных данных клиентов. Решением задачи являются предсказания алгоритмов на тестовых данных.


**ДАННЫЕ**

Для решения задачи участникам была предоставлена информация о транзакциях клиентов банка, объемом около 27 000 000 миллионов записей.

Каждая запись описывает одну банковскую транзакцию. Для каждого из ≈20 000 тестовых id, участникам было нужно с помощью обученной модели предсказать, в какую из возрастных групп попадает клиент.

***Были подготовлены два набора данных:***

- Обучающий `transactions_train.csv`, в котором для каждой транзакции известна дата, сумма, тип и id клиента;
-Тестовый `transactions_test.csv`, содержащий те же поля:
  - сlient_id – уникальный номер клиента;
  - trans_date – дата транзакции (представляет из себя просто номер дня в хронологическом порядке, начиная от заданной даты);
  - small_group – группа транзакций, характеризующих тип транзакции (например, продуктовые магазины, одежда, заправки, детские товары и т.п.);
  - amount_rur – сумма транзакции (для анонимизации данные суммы были трансформированы без потери структуры).

На базе данных файлов можно строить различные признаки, которые характеризуют возрастные группы.

Целевая переменная для обучающего датасета находится в файле `train_target.csv`. В нем содержится информация о Клиенте и метка возрастной группы, к которой он относится:
- client_id – уникальный номер Клиента (соответствует client_id из файла transactions_train.csv);
- bins – метка возраста. В файлe test.csv тебе необходимо предсказать для указанных client_id соответствующую метку группы возраста.

Участникам также был предоставлен информационный файл small_group_description.csv, который содержит расшифровку типов транзакций.

Свое решение можно протестировать на соревновании на kaggle (оттуда же можно взять данные):

https://www.kaggle.com/c/clients-age-group/data

In [None]:
# Библиотека позволяет загрузить файлы с локального компьютера
# в облако colab
from google.colab import files
uploaded = files.upload()

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
import re
import pandas as pd
import os

url = os.chdir(r"./gdrive/MyDrive/Я-профи подготовка по машинному обучению/data")
os.getcwd()

'/content/gdrive/MyDrive/Я-профи подготовка по машинному обучению/data'

In [4]:
os.listdir(os.getcwd())

['transactions_test.csv',
 'train_target.csv',
 'test.csv',
 'transactions_train.csv',
 'small_group_description.csv']

In [12]:
# Библиотека pandas предназначена для работы с csv файлами и DataFrame
import pandas as pd

transactions_train = pd.read_csv("transactions_train.csv")
train_target = pd.read_csv("train_target.csv")

train_target.head()

Unnamed: 0,client_id,bins
0,24662,2
1,1046,0
2,34089,2
3,34848,1
4,47076,3


In [13]:
transactions_train.head()

Unnamed: 0,client_id,trans_date,small_group,amount_rur
0,33172,6,4,71.463
1,33172,6,35,45.017
2,33172,8,11,13.887
3,33172,9,11,15.983
4,33172,10,11,21.341


In [8]:
# shape - возвращает размерность датафрейма (кол-во строк, кол-во столбцов)
transactions_train.shape

(26450577, 4)

In [9]:
# unique() - возвращает список уникальных значений в столбце client_id
# len() - длинна этого списка
# видим, что всего клиентов 30 000, в то время как всего строк 26450577
len(transactions_train['client_id'].unique())

30000

In [15]:
transactions_train[transactions_train['client_id'] == 33172]
# 772 всего транзацкий у клиента 33172, т е 772 строчки соответсвует этому клиенту

Unnamed: 0,client_id,trans_date,small_group,amount_rur
0,33172,6,4,71.463
1,33172,6,35,45.017
2,33172,8,11,13.887
3,33172,9,11,15.983
4,33172,10,11,21.341
...,...,...,...,...
767,33172,717,11,8.195
768,33172,718,11,16.604
769,33172,727,82,13.189
770,33172,729,83,54.038


Как мы можем заметить в таблице данных каждый клиент представлен несколько раз (каждое упоминание клиента - это его отдельная транзацкция).

Но одна строка для алгоритма классификации - это один объект. Целесообразно подумать как из нескольких строчек по каждому клиенту сделать одну.

Самый простой способ посчитать агрегациионные признаки по каждому клиенту: сумма всех покупок (sum), средняя сумма всех покупок (mean), медиана суммы всех покупок (std), максимальная и минимальная сумма всех покупок.

In [16]:
# Создадим новый датафрейм agg_features, в котором будет 30 000 (что соответствует кол-ву уникальных клиентов)
# каждому клиенту будет соответствовать его агрегационные признаки
agg_features = transactions_train.groupby('client_id')['amount_rur'].agg(['sum','mean','std','min','max']).reset_index()

agg_features

Unnamed: 0,client_id,sum,mean,std,min,max
0,4,28404.121,39.450168,73.511624,0.043,1341.802
1,6,15720.739,21.535259,26.200397,0.045,315.781
2,7,53630.036,69.379089,253.261383,0.043,4505.971
3,10,34419.365,48.752642,63.191701,0.045,654.893
4,11,26789.404,32.991877,107.395139,0.388,2105.058
...,...,...,...,...,...,...
29995,49993,24247.544,26.911814,73.592787,0.211,1315.470
29996,49995,27951.156,28.845362,64.723186,0.690,1243.601
29997,49996,80900.345,71.089934,125.642727,0.458,1657.546
29998,49997,13293.115,18.591769,38.841011,0.432,858.240


In [17]:
agg_features[agg_features['client_id'] == 33172]
# теперь видим, что клиенту 33172 соответсвует 1 строчка, а не 772

Unnamed: 0,client_id,sum,mean,std,min,max
19871,33172,46194.217,59.837069,142.537535,0.916,2111.162


In [20]:
counter_df_train = transactions_train.groupby(['client_id','small_group'])['amount_rur'].count()
# Посчитаем для каждого клиента количество транзакций по каждой категории,
# т.е. сколько покупок совершил клиент по категории "Одежда", сколько по категории "Продукты питания" и т д
# и запишем результаты в новый датафрейм cat_counts_train

cat_counts_train = counter_df_train.reset_index().pivot(index='client_id', \
                                                      columns='small_group',values='amount_rur')

cat_counts_train.head()

small_group,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,195,196,197,198,199,200,202,203
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
4,,447.0,1.0,44.0,93.0,,,,1.0,13.0,1.0,19.0,,,,37.0,,,5.0,,2.0,,,3.0,6.0,,,,1.0,,,,2.0,,9.0,,12.0,,,3.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6,2.0,397.0,,172.0,10.0,,,,,6.0,3.0,9.0,,,,103.0,,,1.0,,2.0,,,,,,,,3.0,,1.0,,,,2.0,,,5.0,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
7,2.0,79.0,5.0,27.0,19.0,1.0,,2.0,1.0,39.0,,187.0,136.0,5.0,,15.0,3.0,,64.0,3.0,,3.0,2.0,7.0,15.0,55.0,,,6.0,,,1.0,1.0,2.0,8.0,5.0,16.0,5.0,,3.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
10,12.0,309.0,1.0,71.0,65.0,,,,3.0,19.0,,58.0,,,,40.0,,,6.0,2.0,,,,3.0,26.0,11.0,,,6.0,,1.0,2.0,,,8.0,,27.0,,,12.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
11,2.0,423.0,,59.0,23.0,3.0,,,,10.0,4.0,107.0,,1.0,,7.0,1.0,,22.0,,1.0,1.0,18.0,3.0,14.0,17.0,3.0,,,,,,1.0,,2.0,10.0,14.0,,,1.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [23]:
cat_counts_train.isna().sum()

small_group
0      14203
1          1
2       7402
3         76
4         34
       ...  
198    29993
199    29999
200    29999
202    29998
203    29999
Length: 202, dtype: int64

In [24]:
cat_counts_train = cat_counts_train.fillna(0)
# заполним пропуски 0

In [25]:
cat_counts_train.columns=['small_group_'+str(i) for i in cat_counts_train.columns]
# допишем префикс с названиям столбцов

In [26]:
cat_counts_train.head()

Unnamed: 0_level_0,small_group_0,small_group_1,small_group_2,small_group_3,small_group_4,small_group_5,small_group_6,small_group_7,small_group_8,small_group_9,small_group_10,small_group_11,small_group_12,small_group_13,small_group_14,small_group_15,small_group_16,small_group_17,small_group_18,small_group_19,small_group_20,small_group_21,small_group_22,small_group_23,small_group_24,small_group_25,small_group_26,small_group_27,small_group_28,small_group_29,small_group_30,small_group_31,small_group_32,small_group_33,small_group_34,small_group_35,small_group_36,small_group_37,small_group_38,small_group_39,...,small_group_162,small_group_163,small_group_164,small_group_165,small_group_166,small_group_167,small_group_168,small_group_169,small_group_170,small_group_171,small_group_172,small_group_173,small_group_174,small_group_175,small_group_176,small_group_177,small_group_178,small_group_179,small_group_180,small_group_181,small_group_182,small_group_183,small_group_184,small_group_185,small_group_186,small_group_187,small_group_188,small_group_189,small_group_190,small_group_191,small_group_192,small_group_193,small_group_195,small_group_196,small_group_197,small_group_198,small_group_199,small_group_200,small_group_202,small_group_203
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
4,0.0,447.0,1.0,44.0,93.0,0.0,0.0,0.0,1.0,13.0,1.0,19.0,0.0,0.0,0.0,37.0,0.0,0.0,5.0,0.0,2.0,0.0,0.0,3.0,6.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,9.0,0.0,12.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2.0,397.0,0.0,172.0,10.0,0.0,0.0,0.0,0.0,6.0,3.0,9.0,0.0,0.0,0.0,103.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,5.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2.0,79.0,5.0,27.0,19.0,1.0,0.0,2.0,1.0,39.0,0.0,187.0,136.0,5.0,0.0,15.0,3.0,0.0,64.0,3.0,0.0,3.0,2.0,7.0,15.0,55.0,0.0,0.0,6.0,0.0,0.0,1.0,1.0,2.0,8.0,5.0,16.0,5.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,12.0,309.0,1.0,71.0,65.0,0.0,0.0,0.0,3.0,19.0,0.0,58.0,0.0,0.0,0.0,40.0,0.0,0.0,6.0,2.0,0.0,0.0,0.0,3.0,26.0,11.0,0.0,0.0,6.0,0.0,1.0,2.0,0.0,0.0,8.0,0.0,27.0,0.0,0.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11,2.0,423.0,0.0,59.0,23.0,3.0,0.0,0.0,0.0,10.0,4.0,107.0,0.0,1.0,0.0,7.0,1.0,0.0,22.0,0.0,1.0,1.0,18.0,3.0,14.0,17.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,10.0,14.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
# объединим все датафреймы в один
train = pd.merge(train_target, agg_features, on='client_id')
train = pd.merge(train, cat_counts_train.reset_index(), on='client_id')
train.head()

Unnamed: 0,client_id,bins,sum,mean,std,min,max,small_group_0,small_group_1,small_group_2,small_group_3,small_group_4,small_group_5,small_group_6,small_group_7,small_group_8,small_group_9,small_group_10,small_group_11,small_group_12,small_group_13,small_group_14,small_group_15,small_group_16,small_group_17,small_group_18,small_group_19,small_group_20,small_group_21,small_group_22,small_group_23,small_group_24,small_group_25,small_group_26,small_group_27,small_group_28,small_group_29,small_group_30,small_group_31,small_group_32,...,small_group_162,small_group_163,small_group_164,small_group_165,small_group_166,small_group_167,small_group_168,small_group_169,small_group_170,small_group_171,small_group_172,small_group_173,small_group_174,small_group_175,small_group_176,small_group_177,small_group_178,small_group_179,small_group_180,small_group_181,small_group_182,small_group_183,small_group_184,small_group_185,small_group_186,small_group_187,small_group_188,small_group_189,small_group_190,small_group_191,small_group_192,small_group_193,small_group_195,small_group_196,small_group_197,small_group_198,small_group_199,small_group_200,small_group_202,small_group_203
0,24662,2,30254.011,34.774725,72.037354,0.074,1227.314,0.0,174.0,2.0,64.0,33.0,0.0,0.0,0.0,1.0,3.0,0.0,92.0,365.0,0.0,0.0,11.0,0.0,0.0,20.0,0.0,0.0,4.0,3.0,3.0,9.0,16.0,4.0,0.0,4.0,0.0,0.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1046,0,42548.57,52.015367,106.540962,0.55,1210.506,1.0,187.0,61.0,47.0,13.0,1.0,0.0,0.0,2.0,8.0,1.0,27.0,3.0,0.0,1.0,79.0,0.0,0.0,142.0,0.0,2.0,0.0,4.0,2.0,5.0,4.0,3.0,0.0,6.0,1.0,0.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,34089,2,26842.816,34.325852,59.92745,0.043,782.641,0.0,372.0,0.0,72.0,37.0,10.0,0.0,0.0,0.0,17.0,0.0,47.0,9.0,0.0,0.0,49.0,15.0,1.0,6.0,0.0,2.0,2.0,1.0,5.0,26.0,21.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,34848,1,15773.126,16.16099,14.224936,0.043,109.59,0.0,359.0,1.0,0.0,41.0,0.0,0.0,0.0,0.0,38.0,0.0,116.0,0.0,0.0,0.0,306.0,0.0,0.0,45.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,47076,3,12488.375,15.92905,35.473591,0.432,541.165,0.0,378.0,0.0,150.0,44.0,0.0,0.0,0.0,0.0,122.0,0.0,33.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,8.0,31.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Теперь подгрузим тестовые данные для того, чтобы сделать предсказание. Проделаем с ними те же самые манипуляции, как и с обучающими данными.

In [49]:
transactions_test = pd.read_csv('transactions_test.csv')
test_id = pd.read_csv('test.csv')

print(transactions_test.shape)
print(test_id.shape)

(17667328, 4)
(20000, 1)


In [50]:
len(test_id['client_id'].unique())

20000

In [51]:
# проделаем все операции выше также и для тестового набора данных

agg_features_test = transactions_test.groupby('client_id')['amount_rur'].agg(['sum','mean','std','min','max']).reset_index()
print(agg_features_test.shape)

counter_df_test = transactions_test.groupby(['client_id','small_group'])['amount_rur'].count()
cat_counts_test=counter_df_test.reset_index().pivot(index='client_id', columns='small_group',values='amount_rur')
cat_counts_test=cat_counts_test.fillna(0)
cat_counts_test.columns=['small_group_'+str(i) for i in cat_counts_test.columns]

test = pd.merge(test_id,agg_features_test,on='client_id')
test = pd.merge(test,cat_counts_test.reset_index(),on='client_id')

(20000, 6)


В тесте не было некоторых категорий трат, поэтому для того, чтобы обучить модель, нам нужно объединить пространство признаков и train и test.

In [53]:
common_features = list(set(train.columns).intersection(set(test.columns)))

In [54]:
y_train = train['bins'] # вектор целевой переменной 
X_train = train[common_features]
X_test = test[common_features]

Классификаторы:

- Дерево решений
- Логистическая регрессия
- xbgboost
- метод опорных векторов

In [32]:
# подключаете библиотеку
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

x_train, x_test, Y_train, Y_test = train_test_split(X_train, y_train, test_size=0.3)

tree = DecisionTreeClassifier() # создаем модель

tree.fit(x_train, Y_train) # обучение

y_pred = tree.predict(x_test)

accuracy_score(Y_test, y_pred)
# 0 - плохо
# 1 - идеально

0.4612222222222222

In [None]:
y_final_pred = tree.predict(test)

In [None]:
from sklearn.linear_model import LogisticRegression

log = LogisticRegression() # создаем модель

log.fit(X_train, y_train) # обучение

y_pred = log.predict(X_test) # предсказание

Обучим xgboost на текущих признаках.

In [33]:
param={'objective':'multi:softprob','num_class':4,'n_jobs':4,'seed':42}

In [55]:
%%time
import xgboost as xgb

x_train, x_test, Y_train, Y_test = train_test_split(X_train, y_train, test_size=0.3)

model = xgb.XGBClassifier()
model.fit(x_train, Y_train)

y_pred = tree.predict(x_test)

accuracy_score(Y_test, y_pred)

CPU times: user 37.6 s, sys: 34.1 ms, total: 37.6 s
Wall time: 37.4 s


In [56]:
accuracy_score(Y_test, y_pred)

0.8332222222222222

In [57]:
# Сделаем предсказание для тестового набора данных
pred = model.predict(X_test)

In [58]:
test.shape

(20000, 208)

In [59]:
pred

array([0, 2, 0, ..., 2, 2, 3])

In [60]:
len(pred)

20000

Подготовим файл для отправки в систему

In [61]:
submission = pd.DataFrame({'client_id': test['client_id'], 'bins': pred})
submission.head()

Unnamed: 0,client_id,bins
0,28571,0
1,27046,2
2,13240,0
3,19974,3
4,10505,1


In [63]:
submission = submission[:6000]

In [64]:
submission.to_csv('submit_Adelya.csv', index=False)
files.download('submit_Adelya.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>