# **Лабораторная работа №6:**
## "Разработка системы предсказания поведения на основании графовых моделей"

---

**Студент:** Кривцов Н.А.  
**Группа:** ИУ5-22М   

---

*Цель*: обучение работе с графовым типом данных и графовыми нейронными сетями.

*Задача*: подготовить графовый датасет из базы данных о покупках и построить модель предсказания совершения покупки.

---

## Графовые нейронные сети

**Графовые нейронные сети** - тип нейронной сети, которая напрямую работает со структурой графа. Типичным применениями GNN являются:
- Классификация узлов;
- Предсказание связей;
- Графовая классификация;
- Распознавание движений;
- Рекомендательные системы.

В данной лабораторной работе будет происходить работа над **графовыми сверточными сетями**. Отличаются они от сверточных нейронных сетей нефиксированной структурой, функция свертки не является .

Подробнее можно прочитать тут: https://towardsdatascience.com/understanding-graph-convolutional-networks-for-node-classification-a2bfdb7aba7b

Тут можно почитать современные подходы к использованию графовых сверточных сетей 
https://paperswithcode.com/method/gcn

---

## Датасет
В качестве базы данных предлагаем использовать датасет о покупках пользователей в одном магазине товаров RecSys Challenge 2015 (https://www.kaggle.com/datasets/chadgostopp/recsys-challenge-2015). 

Скачать датасет можно отсюда: https://drive.google.com/drive/folders/1gtAeXPTj-c0RwVOKreMrZ3bfSmCwl2y-?usp=sharing
(lite-версия является облегченной версией исходного датасета, рекомендуем использовать её)

Также рекомендуем загружать данные в виде архива и распаковывать через пакет zipfile или/и скачивать датасет в собственный Google Drive и примонтировать его в колаб.

---

### Установка библиотек, выгрузка исходных датасетов

In [1]:
# Slow method of installing pytorch geometric
# !pip install torch_geometric
# !pip install torch_sparse
# !pip install torch_scatter

# Install pytorch geometric
!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.11.0%2Bcu113.html
!pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.11.0%2Bcu113.html
!pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.11.0%2Bcu113.html
!pip install torch-geometric -f https://pytorch-geometric.com/whl/torch-1.11.0%2Bcu113.html
!pip install torch-scatter==2.0.8 -f https://data.pyg.org/whl/torch-1.11.0%2Bcu113.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://pytorch-geometric.com/whl/torch-1.11.0%2Bcu113.html
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://pytorch-geometric.com/whl/torch-1.11.0%2Bcu113.html
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://pytorch-geometric.com/whl/torch-1.11.0%2Bcu113.html
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://pytorch-geometric.com/whl/torch-1.11.0%2Bcu113.html
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://data.pyg.org/whl/torch-1.11.0%2Bcu113.html
Collecting torch-scatter==2.0.8
  Using cached torch_scatter-2.0.8.tar.gz (21 kB)
Building wheels for collected packages: to

In [2]:
import numpy as np
import pandas as pd
import pickle
import csv
import os

from sklearn.preprocessing import LabelEncoder

import torch

# PyG - PyTorch Geometric
from torch_geometric.data import Data, DataLoader, InMemoryDataset

from tqdm import tqdm


RANDOM_SEED = 125 #@param { type: "integer" }
BASE_DIR = '/content/' #@param { type: "string" }
np.random.seed(RANDOM_SEED) 

In [3]:
# Check if CUDA is available for colab
torch.cuda.is_available

<function torch.cuda.is_available>

In [5]:
# Unpack files from zip-file
import zipfile
with zipfile.ZipFile(BASE_DIR + 'yoochoose-data-lite.zip', 'r') as zip_ref:
    zip_ref.extractall(BASE_DIR)

### Анализ исходных данных

In [6]:
# Read dataset of items in store
df = pd.read_csv(BASE_DIR + 'yoochoose-clicks-lite.dat')
# df.columns = ['session_id', 'timestamp', 'item_id', 'category'] 
df.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,session_id,timestamp,item_id,category
0,9,2014-04-06T11:26:24.127Z,214576500,0
1,9,2014-04-06T11:28:54.654Z,214576500,0
2,9,2014-04-06T11:29:13.479Z,214576500,0
3,19,2014-04-01T20:52:12.357Z,214561790,0
4,19,2014-04-01T20:52:13.758Z,214561790,0


In [7]:
# Read dataset of purchases
buy_df = pd.read_csv(BASE_DIR + 'yoochoose-buys-lite.dat')
# buy_df.columns = ['session_id', 'timestamp', 'item_id', 'price', 'quantity']
buy_df.head()

Unnamed: 0,session_id,timestamp,item_id,price,quantity
0,420374,2014-04-06T18:44:58.314Z,214537888,12462,1
1,420374,2014-04-06T18:44:58.325Z,214537850,10471,1
2,489758,2014-04-06T09:59:52.422Z,214826955,1360,2
3,489758,2014-04-06T09:59:52.476Z,214826715,732,2
4,489758,2014-04-06T09:59:52.578Z,214827026,1046,1


In [8]:
# Filter out item session with length < 2
df['valid_session'] = df.session_id.map(df.groupby('session_id')['item_id'].size() > 2)
df = df.loc[df.valid_session].drop('valid_session',axis=1)
df.nunique()

session_id    1000000
timestamp     5557758
item_id         37644
category          275
dtype: int64

In [9]:
# Randomly sample a couple of them
NUM_SESSIONS = 50000 #@param { type: "integer" }
sampled_session_id = np.random.choice(df.session_id.unique(), NUM_SESSIONS, replace=False)
df = df.loc[df.session_id.isin(sampled_session_id)]
df.nunique()

session_id     50000
timestamp     279522
item_id        18732
category         103
dtype: int64

In [10]:
# Average length of session
df.groupby('session_id')['item_id'].size().mean()

5.5907

In [11]:
# Encode item and category id in item dataset so that ids will be in range (0,len(df.item.unique()))
item_encoder = LabelEncoder()
category_encoder = LabelEncoder()
df['item_id'] = item_encoder.fit_transform(df.item_id)
df['category']= category_encoder.fit_transform(df.category.apply(str))
df.head()

Unnamed: 0,session_id,timestamp,item_id,category
27,26,2014-04-06T16:42:55.741Z,3639,0
28,26,2014-04-06T16:44:58.482Z,11053,0
29,26,2014-04-06T16:45:11.344Z,7533,0
30,26,2014-04-06T16:46:19.569Z,4866,0
105,187,2014-04-02T18:05:22.418Z,2395,0


In [12]:
# Encode item and category id in purchase dataset
buy_df = buy_df.loc[buy_df.session_id.isin(df.session_id)]
buy_df['item_id'] = item_encoder.transform(buy_df.item_id)
buy_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,session_id,timestamp,item_id,price,quantity
0,420374,2014-04-06T18:44:58.314Z,1193,12462,1
1,420374,2014-04-06T18:44:58.325Z,1186,10471,1
57,396,2014-04-06T17:53:45.147Z,13004,523,1
105,351689,2014-04-03T07:29:02.313Z,12598,2092,1
141,420229,2014-04-02T18:51:54.172Z,14742,1883,2


In [13]:
# Get item dictionary with grouping by session
buy_item_dict = dict(buy_df.groupby('session_id')['item_id'].apply(list))
buy_item_dict

{396: [13004],
 5332: [1284, 9643],
 5717: [11286],
 8427: [694, 694],
 10019: [15409, 11598],
 11527: [11395],
 11718: [13042, 13044, 11952],
 14007: [772, 772],
 14484: [11938, 11940, 11992, 11938, 11940, 11992],
 19094: [5240, 10440, 5247],
 22427: [13029, 13017],
 22908: [15458],
 28667: [10764, 10761],
 28834: [4728, 13008, 1924, 8286, 1910, 1909, 14973],
 33131: [4366, 8255, 8255, 4366],
 35436: [12742, 12490],
 35699: [1668],
 37807: [12819, 12819],
 39237: [12692, 4146],
 41849: [13008],
 46922: [2982, 2971],
 47917: [14973, 4999, 3717],
 51736: [12679, 12819],
 56174: [12666, 10518],
 60584: [4881],
 63126: [12980, 12827],
 63799: [12793],
 64477: [12981],
 64682: [9682, 9683],
 65192: [9006],
 68052: [13053, 12971, 12683],
 69884: [12666],
 72136: [13053, 12971, 13033],
 74741: [15954, 12682],
 78072: [13958],
 78373: [15745, 15729, 8762, 15499],
 80051: [12820, 12667, 13908],
 83842: [15192],
 84542: [10568],
 86688: [14973],
 88254: [14973],
 89226: [14973, 12970, 12961, 12

### Сборка выборки для обучения

In [14]:
# Transform df into tensor data
def transform_dataset(df, buy_item_dict):
    data_list = []

    # Group by session
    grouped = df.groupby('session_id')
    for session_id, group in tqdm(grouped):    
        le = LabelEncoder()
        sess_item_id = le.fit_transform(group.item_id)
        group = group.reset_index(drop=True)
        group['sess_item_id'] = sess_item_id

        #get input features
        node_features = group.loc[group.session_id==session_id,
                                    ['sess_item_id','item_id','category']].sort_values('sess_item_id')[['item_id','category']].drop_duplicates().values
        node_features = torch.LongTensor(node_features).unsqueeze(1)
        target_nodes = group.sess_item_id.values[1:]
        source_nodes = group.sess_item_id.values[:-1]

        edge_index = torch.tensor([source_nodes,
                                target_nodes], dtype=torch.long)
        x = node_features

        #get result
        if session_id in buy_item_dict:
            positive_indices = le.transform(buy_item_dict[session_id])
            label = np.zeros(len(node_features))
            label[positive_indices] = 1
        else:
            label = [0] * len(node_features)

        y = torch.FloatTensor(label)

        data = Data(x=x, edge_index=edge_index, y=y)

        data_list.append(data)
    
    return data_list

# Pytorch class for creating datasets
class YooChooseDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(YooChooseDataset, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return []

    @property
    def processed_file_names(self):
        return [BASE_DIR+'yoochoose_click_binary_100000_sess.dataset']

    def download(self):
        pass
    
    def process(self):
        data_list = transform_dataset(df, buy_item_dict)
        
        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

In [15]:
# Prepare dataset
dataset = YooChooseDataset('./')

Processing...
100%|██████████| 50000/50000 [02:56<00:00, 283.83it/s]
Done!


### Разделение выборки

In [16]:
# train_test_split
dataset = dataset.shuffle()
one_tenth_length = int(len(dataset) * 0.1)
train_dataset = dataset[:one_tenth_length * 8]
val_dataset = dataset[one_tenth_length*8:one_tenth_length * 9]
test_dataset = dataset[one_tenth_length*9:]
len(train_dataset), len(val_dataset), len(test_dataset)

(40000, 5000, 5000)

In [17]:
# Load dataset into PyG loaders 
batch_size= 512
train_loader = DataLoader(train_dataset, batch_size=batch_size)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)



In [18]:
# Load dataset into PyG loaders 
num_items = df.item_id.max() +1
num_categories = df.category.max()+1
num_items , num_categories

(18732, 102)

### Настройка модели для обучения

In [19]:
embed_dim = 128
from torch_geometric.nn import GraphConv, TopKPooling, GatedGraphConv, SAGEConv, SGConv
from torch_geometric.nn import global_mean_pool as gap, global_max_pool as gmp
import torch.nn.functional as F

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # Model Structure
        self.conv1 = GraphConv(embed_dim * 2, 128)
        self.pool1 = TopKPooling(128, ratio=0.9)
        self.conv2 = GraphConv(128, 128)
        self.pool2 = TopKPooling(128, ratio=0.9)
        self.conv3 = GraphConv(128, 128)
        self.pool3 = TopKPooling(128, ratio=0.9)
        self.item_embedding = torch.nn.Embedding(num_embeddings=num_items, embedding_dim=embed_dim)
        self.category_embedding = torch.nn.Embedding(num_embeddings=num_categories, embedding_dim=embed_dim)        
        self.lin1 = torch.nn.Linear(256, 256)
        self.lin2 = torch.nn.Linear(256, 128)
        self.bn1 = torch.nn.BatchNorm1d(128)
        self.bn2 = torch.nn.BatchNorm1d(64)
        self.act1 = torch.nn.ReLU()
        self.act2 = torch.nn.ReLU()        
  
    # Forward step of a model
    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        
        item_id = x[:,:,0]
        category = x[:,:,1]
        

        emb_item = self.item_embedding(item_id).squeeze(1)
        emb_category = self.category_embedding(category).squeeze(1)
        
        x = torch.cat([emb_item, emb_category], dim=1)  
        # print(x.shape)
        x = F.relu(self.conv1(x, edge_index))
        # print(x.shape)
        r = self.pool1(x, edge_index, None, batch)
        # print(r)
        x, edge_index, _, batch, _, _ = self.pool1(x, edge_index, None, batch)
        x1 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = F.relu(self.conv2(x, edge_index))
     
        x, edge_index, _, batch, _, _ = self.pool2(x, edge_index, None, batch)
        x2 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = F.relu(self.conv3(x, edge_index))

        x, edge_index, _, batch, _, _ = self.pool3(x, edge_index, None, batch)
        x3 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = x1 + x2 + x3

        x = self.lin1(x)
        x = self.act1(x)
        x = self.lin2(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.act2(x)      
        
        outputs = []
        for i in range(x.size(0)):
            output = torch.matmul(emb_item[data.batch == i], x[i,:])

            outputs.append(output)
              
        x = torch.cat(outputs, dim=0)
        x = torch.sigmoid(x)
        
        return x

### Обучение нейронной сверточной сети

In [20]:
# Enable CUDA computing
device = torch.device('cuda')
model = Net().to(device)
# Choose optimizer and criterion for learning
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
crit = torch.nn.BCELoss()

In [21]:
# Train function
def train():
    model.train()

    loss_all = 0
    for data in train_loader:
        data = data.to(device)
        optimizer.zero_grad()
        output = model(data)

        label = data.y.to(device)
        loss = crit(output, label)
        loss.backward()
        loss_all += data.num_graphs * loss.item()
        optimizer.step()
    return loss_all / len(train_dataset)

In [22]:
# Evaluate result of a model
from sklearn.metrics import roc_auc_score
def evaluate(loader):
    model.eval()

    predictions = []
    labels = []

    with torch.no_grad():
        for data in loader:

            data = data.to(device)
            pred = model(data).detach().cpu().numpy()

            label = data.y.detach().cpu().numpy()
            predictions.append(pred)
            labels.append(label)

    predictions = np.hstack(predictions)
    labels = np.hstack(labels)
    
    return roc_auc_score(labels, predictions)

In [23]:
# Train a model
NUM_EPOCHS =  15 #@param { type: "integer" }
for epoch in tqdm(range(NUM_EPOCHS)):
    loss = train()
    train_acc = evaluate(train_loader)
    val_acc = evaluate(val_loader)    
    test_acc = evaluate(test_loader)
    print('Epoch: {:03d}, Loss: {:.5f}, Train Auc: {:.5f}, Val Auc: {:.5f}, Test Auc: {:.5f}'.
          format(epoch, loss, train_acc, val_acc, test_acc))

  7%|▋         | 1/15 [00:40<09:25, 40.41s/it]

Epoch: 000, Loss: 0.67669, Train Auc: 0.52441, Val Auc: 0.53001, Test Auc: 0.52498


 13%|█▎        | 2/15 [01:17<08:19, 38.46s/it]

Epoch: 001, Loss: 0.48399, Train Auc: 0.56859, Val Auc: 0.55746, Test Auc: 0.55136


 20%|██        | 3/15 [01:54<07:34, 37.86s/it]

Epoch: 002, Loss: 0.40272, Train Auc: 0.60059, Val Auc: 0.56321, Test Auc: 0.56091


 27%|██▋       | 4/15 [02:31<06:54, 37.64s/it]

Epoch: 003, Loss: 0.36485, Train Auc: 0.62781, Val Auc: 0.57539, Test Auc: 0.57249


 33%|███▎      | 5/15 [03:08<06:12, 37.27s/it]

Epoch: 004, Loss: 0.34037, Train Auc: 0.65900, Val Auc: 0.58490, Test Auc: 0.59073


 40%|████      | 6/15 [03:44<05:32, 36.99s/it]

Epoch: 005, Loss: 0.32349, Train Auc: 0.68062, Val Auc: 0.58279, Test Auc: 0.58951


 47%|████▋     | 7/15 [04:21<04:54, 36.75s/it]

Epoch: 006, Loss: 0.31069, Train Auc: 0.71464, Val Auc: 0.59796, Test Auc: 0.59507


 53%|█████▎    | 8/15 [04:57<04:15, 36.54s/it]

Epoch: 007, Loss: 0.29609, Train Auc: 0.73714, Val Auc: 0.60001, Test Auc: 0.60194


 60%|██████    | 9/15 [05:33<03:37, 36.32s/it]

Epoch: 008, Loss: 0.28586, Train Auc: 0.77141, Val Auc: 0.60433, Test Auc: 0.61670


 67%|██████▋   | 10/15 [06:09<03:01, 36.22s/it]

Epoch: 009, Loss: 0.27251, Train Auc: 0.78842, Val Auc: 0.61848, Test Auc: 0.62342


 73%|███████▎  | 11/15 [06:45<02:24, 36.18s/it]

Epoch: 010, Loss: 0.26538, Train Auc: 0.82290, Val Auc: 0.61437, Test Auc: 0.61963


 80%|████████  | 12/15 [07:21<01:48, 36.14s/it]

Epoch: 011, Loss: 0.24979, Train Auc: 0.85696, Val Auc: 0.61783, Test Auc: 0.62906


 87%|████████▋ | 13/15 [07:57<01:12, 36.06s/it]

Epoch: 012, Loss: 0.23585, Train Auc: 0.87090, Val Auc: 0.61519, Test Auc: 0.63079


 93%|█████████▎| 14/15 [08:33<00:35, 36.00s/it]

Epoch: 013, Loss: 0.22456, Train Auc: 0.89360, Val Auc: 0.61035, Test Auc: 0.62640


100%|██████████| 15/15 [09:09<00:00, 36.63s/it]

Epoch: 014, Loss: 0.20975, Train Auc: 0.91882, Val Auc: 0.61920, Test Auc: 0.63559





### Проверка результата с помощью примеров

In [24]:
# Подход №1 - из датасета
evaluate(DataLoader(test_dataset[40:60], batch_size=10))



0.5656565656565656

In [25]:
# Подход №2 - через создание сессии покупок
test_df = pd.DataFrame([
      [-1, 15219, 0],
      [-1, 15431, 0],
      [-1, 14371, 0],
      [-1, 15745, 0],
      [-2, 14594, 0],
      [-2, 16972, 11],
      [-2, 16943, 0],
      [-3, 17284, 0]
], columns=['session_id', 'item_id', 'category'])

test_data = transform_dataset(test_df, buy_item_dict)
test_data = DataLoader(test_data, batch_size=1)

with torch.no_grad():
    model.eval()
    for data in test_data:
        data = data.to(device)
        pred = model(data).detach().cpu().numpy()

        print(data, pred)

100%|██████████| 3/3 [00:00<00:00, 199.34it/s]

DataBatch(x=[1, 1, 2], edge_index=[2, 0], y=[1], batch=[1], ptr=[2]) [0.00087506]
DataBatch(x=[3, 1, 2], edge_index=[2, 2], y=[3], batch=[3], ptr=[2]) [0.000564   0.01551606 0.07247568]
DataBatch(x=[4, 1, 2], edge_index=[2, 3], y=[4], batch=[4], ptr=[2]) [2.5984054e-05 1.0836762e-06 1.7492797e-06 3.5187886e-06]



