## Предсказание пола/возраста по посещениям сайтов

Для начала, подключим хранилище s3, содержащее исходные данные:

In [4]:
!chmod 600 .s3_passwd
!mkdir -p mnt
!s3fs hsevkhack mnt -o url=http://hb.vkcs.cloud -o use_path_request_style -o passwd_file=.s3_passwd -o ro

Убедимся, что в директории `mnt` содержатся соответствующие таблицы:

In [5]:
!ls mnt/

geo_dataframe.csv  requests  train_users.csv


Если по каким-то причинам директорию смонтировать не удалось, используйте код ниже, чтобы скачать данные к себе в локальное хранилище. В противном случае **пропустите следующую ячейку**.

In [6]:
import subprocess
!mkdir -p data
!wget https://hsehack.hb.ru-msk.vkcs.cloud/geo_dataframe.csv -P data
!wget https://hsehack.hb.ru-msk.vkcs.cloud/train_users.csv -P data
!mkdir -p data/requests
for i in range(30):
    print(f"Downloading part {i}...")
    subprocess.call(["wget", f"https://hsehack.hb.ru-msk.vkcs.cloud/requests/part_{i}.parquet", "-q", "-P", "data/requests"])

--2024-04-20 10:30:46--  https://hsehack.hb.ru-msk.vkcs.cloud/geo_dataframe.csv
Resolving hsehack.hb.ru-msk.vkcs.cloud (hsehack.hb.ru-msk.vkcs.cloud)... 95.163.53.117
Connecting to hsehack.hb.ru-msk.vkcs.cloud (hsehack.hb.ru-msk.vkcs.cloud)|95.163.53.117|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 63042 (62K) [text/csv]
Saving to: ‘data/geo_dataframe.csv.1’


2024-04-20 10:30:46 (25.8 MB/s) - ‘data/geo_dataframe.csv.1’ saved [63042/63042]

--2024-04-20 10:30:46--  https://hsehack.hb.ru-msk.vkcs.cloud/train_users.csv
Resolving hsehack.hb.ru-msk.vkcs.cloud (hsehack.hb.ru-msk.vkcs.cloud)... 95.163.53.117
Connecting to hsehack.hb.ru-msk.vkcs.cloud (hsehack.hb.ru-msk.vkcs.cloud)|95.163.53.117|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 66826127 (64M) [text/csv]
Saving to: ‘data/train_users.csv.1’


2024-04-20 10:30:48 (53.3 MB/s) - ‘data/train_users.csv.1’ saved [66826127/66826127]

Downloading part 0...
Downloading part 1...
Downl

## Загружаем данные

Используем Pandas для работы с таблицами. `geo_dataframe` содержит данные о геолокации - регионе и стране.

In [6]:
import pandas as pd
from collections import Counter
from tqdm import tqdm
import numpy as np

from sklearn.model_selection import train_test_split

Таблица `train_users` - это ваша обучающая выборка, содержащая пол и возраст пользователей. Аналогичная таблица для тестового датасета будет вам предоставлена за час до stop coding.

In [7]:
users = pd.read_csv('mnt/train_users.csv')
users

Unnamed: 0,user_id,gender,age
0,2,1,61
1,3,1,55
2,6,0,46
3,14,0,66
4,17,0,53
...,...,...,...
4999995,17588859,1,64
4999996,17588860,0,69
4999997,17588861,1,51
4999998,17588864,0,30


Для примера - вот распределение пользователей по полу:

Основные данные о посещениях пользователями сайтов содержатся в табличке `requests`, представленной в формате parquet. Вы можете загрузить только одну часть таблички, или же всю таблицу, если она поместиться в памяти:

In [37]:
# from sklearn.neighbors import NearestNeighbors
# from sklearn.preprocessing import LabelEncoder
# from sklearn.model_selection import train_test_split

# df = pd.read_parquet('mnt/requests/part_0.parquet')
# req1 = req1.merge(users, left_on='user_id', right_on='user_id')

# le = LabelEncoder()

# le.fit(req1["referer"])
# req1["referer"] = le.transform(req1["referer"])

# le.fit(req1["user_agent"])
# req1["user_agent"] = le.transform(req1["user_agent"])

# for col in req1.columns:
#     req1[col] = req1[col].astype("int")
    
# # X_train, X_test, y_train, y_test = train_test_split(req1.drop(["gender", "age", "user_id"], axis=1), req1[["gender", "age"]], train_size=0.8, stratify=req1[["gender", "age"]])

# # # for col in X_train.columns:
# # #     X_train[col] = (X_train[col] - X_train[col].mean()) / X_train[col].std()
    
# # # for col in X_test.columns:
# # #     X_test[col] = (X_test[col] - X_test[col].mean()) / X_test[col].std()

# # X_train[["gender", "age"]] = y_train
# # X_test[["gender", "age"]] = y_test * 0 - 1

# # nn = NearestNeighbors(n_neighbors=7).fit(X_train)
# # neights = nn.kneighbors(X_test, 15, return_distance=False)

In [3]:
df = pd.read_parquet("featured/part_0.parquet")
df = df.drop(["timestamp", "user_agent", "referer"], axis=1)

In [52]:
df.head()

Unnamed: 0,user_id,gender,age,geo_id,region_id,country_id,year,month,day,hour,minute,second,weekday,browser_family,os_family,brand,device_type,domain
0,2,1,61,708,7440,40,2024,4,2,0,21,37,1,Chrome,Android,Huawei,mobile,domain_1654
1,251,0,26,708,7440,40,2024,4,1,6,4,27,0,Chrome Mobile,Android,Generic_Android,mobile,domain_381
2,273,0,33,708,7440,40,2024,4,1,14,23,12,0,Chrome Mobile,Android,Generic_Android,mobile,domain_609
3,273,0,33,708,7440,40,2024,4,1,1,56,40,0,Chrome Mobile,Android,Generic_Android,mobile,www.domain_325
4,273,0,33,708,7440,40,2024,4,1,1,55,19,0,Chrome Mobile,Android,Generic_Android,mobile,www.domain_325


In [3]:
user_features = {
    "domain": 3,
    "device_type": 2,
    "brand": 1,
    "hour": 3,
    "weekday": 2
}

In [4]:
temp = df[["user_id", "gender", "age", "domain", "device_type", "brand", "hour", "weekday"]]

NameError: name 'df' is not defined

In [5]:
user_featured = users[["user_id", "gender", "age"]]

NameError: name 'users' is not defined

In [15]:
# for user_feature in tqdm(user_features):
#     grouped = temp.groupby('user_id')[user_feature].apply(list).to_dict()
#     for el in grouped:
#         counts = Counter(grouped[el])
#         grouped_el = sorted(counts, key=lambda x: counts[x], reverse=True)[:5]
#         if len(grouped[el]) < 5:
#             grouped[el] += [np.nan] * (5 - len(grouped[el]))
#     temp_ = pd.DataFrame.from_dict(grouped, orient='index', columns=[f'{user_feature}_top{i}' for i in range(1, user_features[user_feature]+1)])
#     user_featured = user_featured.merge(user_featured, temp_, left_on='user_id', right_index=True)
#     del temp_
#     del grouped
user_features = {
    "domain": 3,
    "device_type": 2,
    "brand": 1,
    "hour": 3,
    "weekday": 2
}
user_featured = users[["user_id", "gender", "age"]]
all_dfs = []
for i in tqdm(range(30)):
    try:
        df = pd.read_parquet(f"featured/part_{i}.parquet")
    except FileNotFoundError: continue
    all_dfs.append(df[["user_id", "gender", "age", "domain", "device_type", "brand", "hour", "weekday"]])

temp = pd.concat(all_dfs)
    
for user_feature in user_features:
    grouped = temp.groupby('user_id')[user_feature].apply(list).to_dict()
    for el in tqdm(grouped):
        counts = Counter(grouped[el])
        grouped_el = sorted(counts, key=lambda x: counts[x], reverse=True)[:user_features[user_feature]]
        if len(grouped[el]) < user_features[user_feature]:
            grouped_el += [np.nan] * (user_features[user_feature] - len(grouped[el]))
        grouped[el] = grouped_el
    columns = [f'{user_feature}_top{i}' for i in range(1, user_features[user_feature]+1)]
    pre = pd.DataFrame.from_dict(grouped, orient='index', columns=columns)
    user_featured = user_featured.merge(pre, left_on='user_id', right_index=True)

100%|██████████| 30/30 [00:30<00:00,  1.02s/it]
100%|██████████| 4468805/4468805 [00:15<00:00, 290847.64it/s]
100%|██████████| 4468805/4468805 [00:13<00:00, 321357.94it/s]
100%|██████████| 4468805/4468805 [00:13<00:00, 330902.40it/s]
100%|██████████| 4468805/4468805 [00:15<00:00, 290718.61it/s]
100%|██████████| 4468805/4468805 [00:14<00:00, 313736.04it/s]


In [19]:
user_featured.to_parquet('4datasets.parquet')

In [8]:
user_featured = user_featured.replace([None], np.nan)

In [9]:
user_featured

Unnamed: 0,user_id,gender,age,domain_top1,domain_top2,domain_top3,device_type_top1,device_type_top2,brand_top1,hour_top1,hour_top2,hour_top3,weekday_top1,weekday_top2
0,2,1,61,domain_1654,,,mobile,,Huawei,0,,,1,
1,3,1,55,domain_2867,www.domain_78,,mobile,,Generic_Android,15,18.0,,2,1.0
2,6,0,46,domain_3194,domain_1834,www.domain_1123,mobile,,Generic_Android,23,8.0,19.0,0,1.0
3,14,0,66,domain_2238,,,PC,,,17,,,1,
4,17,0,53,domain_2285,www.domain_2582,www.domain_824,mobile,,Generic_Android,14,17.0,15.0,2,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4999994,17588855,1,50,domain_145,,domain_3019,mobile,,Generic_Android,0,18.0,13.0,0,2.0
4999996,17588860,0,69,domain_2238,domain_2042,domain_609,PC,,,1,16.0,12.0,1,0.0
4999997,17588861,1,51,domain_21,,,mobile,,Generic_Android,5,,,2,
4999998,17588864,0,30,domain_2194,,,PC,,,2,4.0,6.0,0,1.0


In [6]:
# temp1 = pd.DataFrame.from_dict(grouped_domains, orient='index', columns=['t1', 't2', 't3', 't4', 't5'])

In [7]:
# ddd = pd.merge(temp, temp1, left_on="user_id", right_index=True)

In [16]:
# grouped_device_types = temp.groupby('user_id')['device_type'].apply(list).to_dict()
# for el in tqdm(list(grouped_device_types.keys())):
#     counts = Counter(grouped_device_types[el])
#     grouped_device_types[el] = sorted(counts, key=lambda x: counts[x], reverse=True)[:3]
#     if len(grouped_device_types[el]) < 3:
#         grouped_device_types[el] += [np.nan] * (3 - len(grouped_device_types[el]))

100%|██████████| 2640892/2640892 [00:07<00:00, 365300.77it/s]


In [15]:
# print(len(temp1), len(grouped_device_types))

2640892 2640892


In [17]:
# temp_devices = pd.DataFrame.from_dict(grouped_device_types, orient='index', columns=['device_top1', 'device_top2', 'device_top3'])
# ddd = pd.merge(ddd, temp_devices, left_on='user_id', right_index=True)

In [44]:
user_featured = user_featured.drop(["user_id"] + list(user_features.keys()), axis=1)

KeyError: "['domain', 'device_type', 'brand', 'hour', 'weekday'] not found in axis"

In [None]:
user_featured

In [10]:
X_train, X_test, y_train, y_test = train_test_split(user_featured.drop(["gender", "age"], axis=1), user_featured[["gender", "age"]], train_size=0.8, stratify=user_featured["gender"])

In [43]:
X_train

Unnamed: 0,user_id,domain_top1,domain_top2,domain_top3,device_type_top1,device_type_top2,brand_top1,hour_top1,hour_top2,hour_top3,weekday_top1,weekday_top2
2076935,7279826,www.domain_1707,,,PC,,,12,,,2,
3621540,12728891,domain_1654,,,mobile,,Samsung,3,,,2,
728020,2549944,domain_2194,,,PC,,,5,15.0,,1,2.0
4313062,15174688,domain_1406,www.domain_403,,PC,,,4,,,2,
4021267,14143791,domain_2238,,,PC,,,10,,,0,
...,...,...,...,...,...,...,...,...,...,...,...,...
1888168,6677767,domain_11,,,PC,,,15,,,0,
4955815,17399069,domain_1180,,,PC,,,8,,,0,
4812317,16917419,domain_609,,,PC,,,9,,,0,
675966,2357748,domain_3357,,,mobile,,Samsung,13,,,1,


In [20]:
from tqdm import tqdm

pred_gender = []
pred_age = []

for a in tqdm(neights[:50000]):
    temp = X_train.iloc[a].groupby("gender").agg({"gender": "count", "age": "mean"})
    pred_gender.append(temp[temp["gender"] == temp["gender"].max()].index[0])
    pred_age.append(temp[temp["gender"] == temp["gender"].max()].age)

NameError: name 'neights' is not defined

## Дерзайте!

Вам необходимо построить предсказательную модель для прогнозирования пола и возраста пользователей по их посещениям. Таблица с тестовыми пользователями будет предоставлена вам за час до stop coding.

ВАЖНО:
* Таблицу с тестовыми данными нельзя использовать для обучения модели. Если жюри увидит, что вы использовали табличку с тестовыми данными - точность модели будет оцениваться в 0 баллов.
* В ходе выступления необходимо продемонстрировать точность модели на тестовых данных.

### Установка библиотек

In [27]:
X_train

Unnamed: 0,domain_top1,domain_top2,domain_top3,domain_top4,domain_top5,device_type_top1,device_type_top2,device_type_top3,brand_top1,brand_top2,hour_top1,hour_top2,hour_top3,hour_top4,hour_top5,hour_top6,hour_top7,weekday_top1,weekday_top2,weekday_top3
567788,domain_1081,domain_2042,domain_609,,,PC,,,,,12,8.0,,,,,,2,0.0,
11544885,domain_2998,domain_609,,,,PC,,,,,8,,,,,,,2,,
2244153,domain_2206,domain_609,,,,PC,,,,,7,12.0,10.0,,,,,0,1.0,
10265454,domain_609,,,,,PC,,,,,10,,,,,,,0,,
10077815,domain_2042,domain_2206,,,,PC,,,,,8,11.0,,,,,,2,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8874200,domain_2194,,,,,PC,,,,,11,10.0,,,,,,2,,
3774263,domain_2042,domain_609,www.domain_2395,,,mobile,,,Generic_Android,,6,4.0,7.0,5.0,,,,1,0.0,
5814635,domain_2194,domain_2206,domain_1180,domain_390,domain_573,PC,,,,,21,9.0,23.0,16.0,,,,6,0.0,1.0
7882427,domain_2042,domain_1081,domain_609,,,PC,,,,,14,19.0,,,,,,2,1.0,


In [11]:
from catboost import CatBoostClassifier
cat_cols = []
for user_feature in user_features:
    for n in range(1, user_features[user_feature]+1):
        cat_cols.append(f'{user_feature}_top{n}')


for c in X_train.columns:
    X_train[c] = X_train[c].astype("str")
    X_test[c]  = X_test[c].astype("str")

cat = CatBoostClassifier(iterations=100, learning_rate=0.997, depth=7, cat_features=cat_cols)

cat.fit(X_train, y_train["gender"], eval_set=(X_test, y_test["gender"]), verbose=10)

0:	learn: 0.6560463	test: 0.6558102	best: 0.6558102 (0)	total: 520ms	remaining: 51.5s
10:	learn: 0.6375161	test: 0.6371301	best: 0.6371301 (10)	total: 4.25s	remaining: 34.4s
20:	learn: 0.6363484	test: 0.6359639	best: 0.6359639 (20)	total: 7.71s	remaining: 29s
30:	learn: 0.6359353	test: 0.6356588	best: 0.6356588 (30)	total: 11.1s	remaining: 24.6s
40:	learn: 0.6357550	test: 0.6355643	best: 0.6355643 (40)	total: 14.4s	remaining: 20.7s
50:	learn: 0.6355897	test: 0.6354326	best: 0.6354319 (49)	total: 18.1s	remaining: 17.4s
60:	learn: 0.6354370	test: 0.6353438	best: 0.6353438 (60)	total: 23.6s	remaining: 15.1s
70:	learn: 0.6353458	test: 0.6353396	best: 0.6353396 (70)	total: 29s	remaining: 11.8s
80:	learn: 0.6352551	test: 0.6353108	best: 0.6353108 (80)	total: 34.5s	remaining: 8.1s
90:	learn: 0.6352099	test: 0.6353211	best: 0.6353104 (81)	total: 38.4s	remaining: 3.8s
99:	learn: 0.6351471	test: 0.6353001	best: 0.6352992 (98)	total: 41.3s	remaining: 0us

bestTest = 0.6352991541
bestIteration = 9

<catboost.core.CatBoostClassifier at 0x7f42b4e45130>

In [12]:
cat.get_feature_importance(prettified=True)

Unnamed: 0,Feature Id,Importances
0,domain_top1,38.748966
1,domain_top2,29.422937
2,domain_top3,14.319666
3,brand_top1,9.341919
4,hour_top3,2.17781
5,hour_top1,2.111354
6,hour_top2,1.427463
7,device_type_top2,1.150699
8,weekday_top2,0.357017
9,device_type_top1,0.327036


In [13]:
from sklearn.metrics import classification_report

print(classification_report(y_test["gender"], cat.predict(X_test)))

              precision    recall  f1-score   support

           0       0.63      0.57      0.60    443726
           1       0.61      0.68      0.64    450035

    accuracy                           0.62    893761
   macro avg       0.62      0.62      0.62    893761
weighted avg       0.62      0.62      0.62    893761



## Убираем за собой

В конце работы можем размонтировать директорию:

In [19]:
!umount mnt

umount: /home/datadisk/jupyter-vkhack/vkhack/gender_prediction/mnt: not mounted.
