## Предсказание пола/возраста по посещениям сайтов

Для начала, подключим хранилище s3, содержащее исходные данные:

In [4]:
!chmod 600 .s3_passwd
!mkdir -p mnt
!s3fs hsevkhack mnt -o url=http://hb.vkcs.cloud -o use_path_request_style -o passwd_file=.s3_passwd -o ro

Убедимся, что в директории `mnt` содержатся соответствующие таблицы:

In [5]:
!ls mnt/

geo_dataframe.csv  requests  train_users.csv


Если по каким-то причинам директорию смонтировать не удалось, используйте код ниже, чтобы скачать данные к себе в локальное хранилище. В противном случае **пропустите следующую ячейку**.

In [6]:
import subprocess
!mkdir -p data
!wget https://hsehack.hb.ru-msk.vkcs.cloud/geo_dataframe.csv -P data
!wget https://hsehack.hb.ru-msk.vkcs.cloud/train_users.csv -P data
!mkdir -p data/requests
for i in range(30):
    print(f"Downloading part {i}...")
    subprocess.call(["wget", f"https://hsehack.hb.ru-msk.vkcs.cloud/requests/part_{i}.parquet", "-q", "-P", "data/requests"])

--2024-04-20 10:30:46--  https://hsehack.hb.ru-msk.vkcs.cloud/geo_dataframe.csv
Resolving hsehack.hb.ru-msk.vkcs.cloud (hsehack.hb.ru-msk.vkcs.cloud)... 95.163.53.117
Connecting to hsehack.hb.ru-msk.vkcs.cloud (hsehack.hb.ru-msk.vkcs.cloud)|95.163.53.117|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 63042 (62K) [text/csv]
Saving to: ‘data/geo_dataframe.csv.1’


2024-04-20 10:30:46 (25.8 MB/s) - ‘data/geo_dataframe.csv.1’ saved [63042/63042]

--2024-04-20 10:30:46--  https://hsehack.hb.ru-msk.vkcs.cloud/train_users.csv
Resolving hsehack.hb.ru-msk.vkcs.cloud (hsehack.hb.ru-msk.vkcs.cloud)... 95.163.53.117
Connecting to hsehack.hb.ru-msk.vkcs.cloud (hsehack.hb.ru-msk.vkcs.cloud)|95.163.53.117|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 66826127 (64M) [text/csv]
Saving to: ‘data/train_users.csv.1’


2024-04-20 10:30:48 (53.3 MB/s) - ‘data/train_users.csv.1’ saved [66826127/66826127]

Downloading part 0...
Downloading part 1...
Downl

## Загружаем данные

Используем Pandas для работы с таблицами. `geo_dataframe` содержит данные о геолокации - регионе и стране.

In [3]:
import pandas as pd
from collections import Counter
from tqdm import tqdm
import numpy as np

from sklearn.model_selection import train_test_split

Таблица `train_users` - это ваша обучающая выборка, содержащая пол и возраст пользователей. Аналогичная таблица для тестового датасета будет вам предоставлена за час до stop coding.

In [2]:
users = pd.read_csv('mnt/train_users.csv')
users

Unnamed: 0,user_id,gender,age
0,2,1,61
1,3,1,55
2,6,0,46
3,14,0,66
4,17,0,53
...,...,...,...
4999995,17588859,1,64
4999996,17588860,0,69
4999997,17588861,1,51
4999998,17588864,0,30


Для примера - вот распределение пользователей по полу:

Основные данные о посещениях пользователями сайтов содержатся в табличке `requests`, представленной в формате parquet. Вы можете загрузить только одну часть таблички, или же всю таблицу, если она поместиться в памяти:

In [4]:
df = pd.read_parquet("featured/part_0.parquet")
df = df.drop(["timestamp", "user_agent", "referer"], axis=1)

In [52]:
df.head()

Unnamed: 0,user_id,gender,age,geo_id,region_id,country_id,year,month,day,hour,minute,second,weekday,browser_family,os_family,brand,device_type,domain
0,2,1,61,708,7440,40,2024,4,2,0,21,37,1,Chrome,Android,Huawei,mobile,domain_1654
1,251,0,26,708,7440,40,2024,4,1,6,4,27,0,Chrome Mobile,Android,Generic_Android,mobile,domain_381
2,273,0,33,708,7440,40,2024,4,1,14,23,12,0,Chrome Mobile,Android,Generic_Android,mobile,domain_609
3,273,0,33,708,7440,40,2024,4,1,1,56,40,0,Chrome Mobile,Android,Generic_Android,mobile,www.domain_325
4,273,0,33,708,7440,40,2024,4,1,1,55,19,0,Chrome Mobile,Android,Generic_Android,mobile,www.domain_325


In [3]:
user_features = {
    "domain": 5,
    "device_type": 3,
    "brand": 2,
    "hour": 3,
    "weekday": 2
}

In [6]:
temp = df[["user_id", "gender", "age", "domain"]]

In [7]:
user_featured = df[["user_id", "gender", "age"]]

In [6]:
# grouped = temp.groupby('user_id')["domain"].apply(list).to_dict()
# for el in grouped:
#     counts = Counter(grouped[el])
#     grouped[el] = sorted(counts, key=lambda x: counts[x], reverse=True)[:5]
        
#     if len(grouped[el]) < 5:
#         grouped[el] += [np.nan] * (5 - len(grouped[el]))


for user_feature in tqdm(user_features):
    grouped = df.groupby('user_id')[user_feature].apply(list).to_dict()
    for el in grouped:
        counts = Counter(grouped[el])
        grouped_el = sorted(counts, key=lambda x: counts[x], reverse=True)[:user_features[user_feature]]
        if len(grouped[el]) < user_features[user_feature]:
            grouped_el += [np.nan] * (user_features[user_feature] - len(grouped[el]))
        grouped[el] = grouped_el
    user_featured = user_featured.merge(pd.DataFrame.from_dict(grouped, orient='index', columns=[f'{user_feature}_top{i}' for i in range(1, user_features[user_feature]+1)]), left_on='user_id', right_index=True)

100%|██████████| 5/5 [04:36<00:00, 55.28s/it]


In [7]:
temp1 = pd.DataFrame.from_dict(grouped, orient='index', columns=['t1', 't2', 't3', 't4', 't5'])

In [8]:
ddd = pd.merge(temp, temp1, left_on="user_id", right_index=True)

In [16]:
# grouped_device_types = temp.groupby('user_id')['device_type'].apply(list).to_dict()
# for el in tqdm(list(grouped_device_types.keys())):
#     counts = Counter(grouped_device_types[el])
#     grouped_device_types[el] = sorted(counts, key=lambda x: counts[x], reverse=True)[:3]
#     if len(grouped_device_types[el]) < 3:
#         grouped_device_types[el] += [np.nan] * (3 - len(grouped_device_types[el]))

100%|██████████| 2640892/2640892 [00:07<00:00, 365300.77it/s]


In [15]:
# print(len(temp1), len(grouped_device_types))

2640892 2640892


In [17]:
# temp_devices = pd.DataFrame.from_dict(grouped_device_types, orient='index', columns=['device_top1', 'device_top2', 'device_top3'])
# ddd = pd.merge(ddd, temp_devices, left_on='user_id', right_index=True)

In [5]:
user_featured = pd.read_parquet("for_training.pqt")

In [6]:
user_featured = user_featured.drop(["user_id"], axis=1)

In [10]:
user_featured

Unnamed: 0,gender,age,domain_top1,domain_top2,domain_top3,domain_top4,domain_top5,device_type_top1,device_type_top2,device_type_top3,brand_top1,brand_top2,hour_top1,hour_top2,hour_top3,weekday_top1,weekday_top2
0,1,61,domain_1654,,,,,mobile,,,Huawei,,0,,,1,
1,0,26,domain_381,,,,,mobile,,,Generic_Android,,6,,,0,
2,0,33,www.domain_325,domain_609,,,,mobile,,,Generic_Android,,1,14.0,,0,
3,0,33,www.domain_325,domain_609,,,,mobile,,,Generic_Android,,1,14.0,,0,
4,0,33,www.domain_325,domain_609,,,,mobile,,,Generic_Android,,1,14.0,,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12157406,0,75,domain_2042,,,,,PC,,,,,14,,,0,
12157407,0,47,domain_2238,,,,,mobile,,,Generic_Android,,11,,,0,
12157408,0,68,domain_2194,,,,,PC,,,,,18,,,2,
12157409,1,37,domain_1498,,,,,mobile,,,Generic_Android,,22,,,0,


## Дерзайте!

Вам необходимо построить предсказательную модель для прогнозирования пола и возраста пользователей по их посещениям. Таблица с тестовыми пользователями будет предоставлена вам за час до stop coding.

ВАЖНО:
* Таблицу с тестовыми данными нельзя использовать для обучения модели. Если жюри увидит, что вы использовали табличку с тестовыми данными - точность модели будет оцениваться в 0 баллов.
* В ходе выступления необходимо продемонстрировать точность модели на тестовых данных.

### Установка библиотек

In [None]:
from catboost import CatBoostClassifier

cat_cols = ["domain_top1", "domain_top2", "domain_top3", "domain_top4", "domain_top5", "device_type_top1", "device_type_top2", "device_type_top3", "brand_top1", "brand_top2"]

X_train, X_test, y_train, y_test = train_test_split(user_featured.drop(["gender", "age"], axis=1), user_featured[["gender", "age"]], train_size=0.8, stratify=user_featured["gender"])

for c in X_train.columns:
    X_train[c] = X_train[c].astype("str")
    X_test[c]  = X_test[c].astype("str")

cat = CatBoostClassifier(iterations=200, learning_rate=0.3, depth=7, cat_features=cat_cols)

cat.fit(X_train, y_train["gender"], eval_set=(X_test, y_test["gender"]), verbose=10)
print(i)

0:	learn: 0.6570226	test: 0.6563849	best: 0.6563849 (0)	total: 3.72s	remaining: 12m 20s


In [18]:
cat.get_feature_importance(prettified=True)

Unnamed: 0,Feature Id,Importances
0,t1,29.940011
1,t4,20.99511
2,t3,19.850417
3,t2,18.341735
4,t5,10.872727


In [19]:
from sklearn.metrics import classification_report

print(classification_report(y_test["gender"], cat.predict(X_test)))

              precision    recall  f1-score   support

           0       0.75      0.54      0.63   1122361
           1       0.68      0.84      0.75   1309122

    accuracy                           0.70   2431483
   macro avg       0.71      0.69      0.69   2431483
weighted avg       0.71      0.70      0.69   2431483



In [None]:
from catboost import CatBoostRegressor

cat_cols = ["domain_top1", "domain_top2", "domain_top3", "domain_top4", "domain_top5", "device_type_top1", "device_type_top2", "device_type_top3", "brand_top1", "brand_top2"]

X_train, X_test, y_train, y_test = train_test_split(user_featured.drop(["gender", "age"], axis=1), user_featured[["gender", "age"]], train_size=0.8, stratify=user_featured["gender"])

for c in X_train.columns:
    X_train[c] = X_train[c].astype("str")
    X_test[c]  = X_test[c].astype("str")

cat = CatBoostRegressor(iterations=200, learning_rate=0.3, depth=7, cat_features=cat_cols)

cat.fit(X_train, y_train["age"], eval_set=(X_test, y_test["age"]), verbose=10)
print(i)

In [30]:
import lightgbm as lgb
from lightgbm import LGBMClassifier
from sklearn.preprocessing import TargetEncoder

cat_cols = ["domain_top1", "domain_top2", "domain_top3", "domain_top4", "domain_top5", "device_type_top1", "device_type_top2", "device_type_top3", "brand_top1", "brand_top2"]
X_train, X_test, y_train, y_test = train_test_split(user_featured.drop(["gender", "age"], axis=1), user_featured[["gender", "age"]], train_size=0.8, stratify=user_featured["gender"])

train_data = lgb.Dataset(X_train, label=y_train, categorical_feature=cat_cols)
valid_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'max_depth': 7,
    'learning_rate': 0.3,
}

num_round = 200
bst = lgb.train(params, train_data, num_round,
                valid_sets=[train_data, valid_data])

ValueError: pandas dtypes must be int, float or bool.
Fields with bad pandas dtypes: domain_top1: object, domain_top2: object, domain_top3: object, domain_top4: object, domain_top5: object, device_type_top1: object, device_type_top2: object, device_type_top3: object, brand_top1: object, brand_top2: object

In [26]:
import numpy as np
from sklearn.preprocessing import TargetEncoder


X = np.array(X_train["domain_top1"]).reshape(1, -1)
y = y_train["gender"]
enc_auto = TargetEncoder(smooth="auto")
X_trans = enc_auto.fit_transform(X, y)

ValueError: Found input variables with inconsistent numbers of samples: [1, 9725928]

In [24]:
y.shape

(9725928,)

## Убираем за собой

В конце работы можем размонтировать директорию:

In [19]:
!umount mnt

umount: /home/datadisk/jupyter-vkhack/vkhack/gender_prediction/mnt: not mounted.
