##**Лабораторная работа №6**

**Выполнил: Курьнояв А.И.**


**Группа: ИУ5-21М**

---

*Цель*: обучение работе с графовым типом данных и графовыми нейронными сетями.

*Задача*: подготовить графовый датасет из базы данных о покупках и построить модель предсказания совершения покупки.

---

## Графовые нейронные сети

**Графовые нейронные сети** - тип нейронной сети, которая напрямую работает со структурой графа. Типичным применениями GNN являются:
- Классификация узлов;
- Предсказание связей;
- Графовая классификация;
- Распознавание движений;
- Рекомендательные системы.

В данной лабораторной работе будет происходить работа над **графовыми сверточными сетями**. Отличаются они от сверточных нейронных сетей нефиксированной структурой, функция свертки не является .

Подробнее можно прочитать тут: https://towardsdatascience.com/understanding-graph-convolutional-networks-for-node-classification-a2bfdb7aba7b

Тут можно почитать современные подходы к использованию графовых сверточных сетей 
https://paperswithcode.com/method/gcn

---

## Датасет
В качестве базы данных предлагаем использовать датасет о покупках пользователей в одном магазине товаров RecSys Challenge 2015 (https://www.kaggle.com/datasets/chadgostopp/recsys-challenge-2015). 

Скачать датасет можно отсюда: https://drive.google.com/drive/folders/1gtAeXPTj-c0RwVOKreMrZ3bfSmCwl2y-?usp=sharing
(lite-версия является облегченной версией исходного датасета, рекомендуем использовать её)

Также рекомендуем загружать данные в виде архива и распаковывать через пакет zipfile или/и скачивать датасет в собственный Google Drive и примонтировать его в колаб.

---

### Установка библиотек, выгрузка исходных датасетов

In [None]:
# Slow method of installing pytorch geometric
# !pip install torch_geometric
# !pip install torch_sparse
# !pip install torch_scatter

# Install pytorch geometric
!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.11.0%2Bcu113.html
!pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.11.0%2Bcu113.html
!pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.11.0%2Bcu113.html
!pip install torch-geometric -f https://pytorch-geometric.com/whl/torch-1.11.0%2Bcu113.html
!pip install torch-scatter==2.0.8 -f https://data.pyg.org/whl/torch-1.11.0%2Bcu113.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://pytorch-geometric.com/whl/torch-1.11.0%2Bcu113.html
Collecting torch-sparse
  Downloading https://data.pyg.org/whl/torch-1.11.0%2Bcu113/torch_sparse-0.6.13-cp37-cp37m-linux_x86_64.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 6.5 MB/s 
Installing collected packages: torch-sparse
Successfully installed torch-sparse-0.6.13
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://pytorch-geometric.com/whl/torch-1.11.0%2Bcu113.html
Collecting torch-cluster
  Downloading https://data.pyg.org/whl/torch-1.11.0%2Bcu113/torch_cluster-1.6.0-cp37-cp37m-linux_x86_64.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 9.1 MB/s 
[?25hInstalling collected packages: torch-cluster
Successfully installed torch-cluster-1.6.0
Looking in indexes: https://pypi.org/simple, https://us-python

In [None]:
import numpy as np
import pandas as pd
import pickle
import csv
import os

from sklearn.preprocessing import LabelEncoder

import torch

# PyG - PyTorch Geometric
from torch_geometric.data import Data, DataLoader, InMemoryDataset

from tqdm import tqdm


RANDOM_SEED =  7#@param { type: "integer" }
BASE_DIR = '/content/' #@param { type: "string" }
np.random.seed(RANDOM_SEED) 

In [None]:
# Check if CUDA is available for colab
torch.cuda.is_available

<function torch.cuda.is_available>

In [None]:
# Unpack files from zip-file
import zipfile
with zipfile.ZipFile(BASE_DIR + 'yoochoose-data-lite.zip', 'r') as zip_ref:
    zip_ref.extractall(BASE_DIR)

### Анализ исходных данных

In [None]:
# Read dataset of items in store
df = pd.read_csv(BASE_DIR + 'yoochoose-clicks-lite.dat')
df.columns = ['session_id', 'timestamp', 'item_id', 'category'] 
df.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,session_id,timestamp,item_id,category
0,9,2014-04-06T11:26:24.127Z,214576500,0
1,9,2014-04-06T11:28:54.654Z,214576500,0
2,9,2014-04-06T11:29:13.479Z,214576500,0
3,19,2014-04-01T20:52:12.357Z,214561790,0
4,19,2014-04-01T20:52:13.758Z,214561790,0


In [None]:
# Read dataset of purchases
buy_df = pd.read_csv(BASE_DIR + 'yoochoose-buys-lite.dat')
buy_df.columns = ['session_id', 'timestamp', 'item_id', 'price', 'quantity']
buy_df.head()

Unnamed: 0,session_id,timestamp,item_id,price,quantity
0,420374,2014-04-06T18:44:58.314Z,214537888,12462,1
1,420374,2014-04-06T18:44:58.325Z,214537850,10471,1
2,489758,2014-04-06T09:59:52.422Z,214826955,1360,2
3,489758,2014-04-06T09:59:52.476Z,214826715,732,2
4,489758,2014-04-06T09:59:52.578Z,214827026,1046,1


In [None]:
# Filter out item session with length < 2
df['valid_session'] = df.session_id.map(df.groupby('session_id')['item_id'].size() > 2)
df = df.loc[df.valid_session].drop('valid_session',axis=1)
df.nunique()

session_id    1000000
timestamp     5557758
item_id         37644
category          275
dtype: int64

In [None]:
# Randomly sample a couple of them
NUM_SESSIONS =  10000#@param { type: "integer" }
sampled_session_id = np.random.choice(df.session_id.unique(), NUM_SESSIONS, replace=False)
df = df.loc[df.session_id.isin(sampled_session_id)]
df.nunique()

session_id    10000
timestamp     56127
item_id       10099
category         40
dtype: int64

In [None]:
# Average length of session
df.groupby('session_id')['item_id'].size().mean()

5.6128

In [None]:
# Encode item and category id in item dataset so that ids will be in range (0,len(df.item.unique()))
item_encoder = LabelEncoder()
category_encoder = LabelEncoder()
df['item_id'] = item_encoder.fit_transform(df.item_id)
df['category']= category_encoder.fit_transform(df.category.apply(str))
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,session_id,timestamp,item_id,category
1147,1986,2014-04-02T15:57:08.961Z,6768,0
1148,1986,2014-04-02T15:59:05.847Z,6768,0
1149,1986,2014-04-02T15:59:27.500Z,6679,0
1216,2182,2014-04-01T19:13:13.130Z,2216,0
1217,2182,2014-04-01T19:17:35.106Z,6935,0


In [None]:
# Encode item and category id in purchase dataset
buy_df = buy_df.loc[buy_df.session_id.isin(df.session_id)]
buy_df['item_id'] = item_encoder.transform(buy_df.item_id)
buy_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,session_id,timestamp,item_id,price,quantity
34,141007,2014-04-01T17:36:22.260Z,6638,941,1
35,141007,2014-04-01T17:36:22.277Z,1840,523,1
61,70353,2014-04-06T10:55:06.086Z,7649,41783,1
201,421868,2014-04-07T07:26:16.286Z,6684,627,1
202,421868,2014-04-07T07:26:16.291Z,6685,732,1


In [None]:
# Get item dictionary with grouping by session
buy_item_dict = dict(buy_df.groupby('session_id')['item_id'].apply(list))
buy_item_dict

{7189: [7874],
 16131: [5293],
 41692: [5900, 7127],
 45803: [4065, 799, 5212, 1663],
 56414: [5920],
 61026: [6636, 2225, 6350],
 66709: [5310],
 70353: [7649],
 74281: [528, 7971],
 84542: [5534],
 115778: [6683, 6685, 6736, 6740, 6684, 6622],
 128039: [6606, 6607],
 130063: [6514, 2723, 1430, 6720, 6695, 6596, 6505, 7408],
 141007: [6638, 1840],
 154074: [8153],
 171421: [5016, 5404],
 176047: [7873],
 182097: [2530],
 188022: [8090],
 202037: [7127, 5890, 5867],
 212159: [2618, 5519, 5596],
 238478: [4290],
 245178: [5264],
 251513: [7108, 7107],
 253064: [4582],
 253149: [5885, 5885, 7221],
 255777: [6600, 6780, 6689, 5266],
 257602: [409],
 277862: [2194],
 293011: [6623, 6607],
 295229: [4315, 4142, 4121, 3443, 6546, 6184, 6548, 6739, 4135],
 313988: [8360, 8397, 8350, 6177],
 315852: [6511],
 327128: [5212, 8041, 8038, 8040, 8039, 8042],
 336891: [6548, 5587],
 342258: [5870, 5870],
 345964: [6721, 6720, 6599, 6721, 6599, 6720],
 353196: [3736, 6689, 6720, 6688, 6680],
 355006:

### Сборка выборки для обучения

In [None]:
# Transform df into tensor data
def transform_dataset(df, buy_item_dict):
    data_list = []

    # Group by session
    grouped = df.groupby('session_id')
    for session_id, group in tqdm(grouped):    
        le = LabelEncoder()
        sess_item_id = le.fit_transform(group.item_id)
        group = group.reset_index(drop=True)
        group['sess_item_id'] = sess_item_id

        #get input features
        node_features = group.loc[group.session_id==session_id,
                                    ['sess_item_id','item_id','category']].sort_values('sess_item_id')[['item_id','category']].drop_duplicates().values
        node_features = torch.LongTensor(node_features).unsqueeze(1)
        target_nodes = group.sess_item_id.values[1:]
        source_nodes = group.sess_item_id.values[:-1]

        edge_index = torch.tensor([source_nodes,
                                target_nodes], dtype=torch.long)
        x = node_features

        #get result
        if session_id in buy_item_dict:
            positive_indices = le.transform(buy_item_dict[session_id])
            label = np.zeros(len(node_features))
            label[positive_indices] = 1
        else:
            label = [0] * len(node_features)

        y = torch.FloatTensor(label)

        data = Data(x=x, edge_index=edge_index, y=y)

        data_list.append(data)
    
    return data_list

# Pytorch class for creating datasets
class YooChooseDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(YooChooseDataset, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return []

    @property
    def processed_file_names(self):
        return [BASE_DIR+'yoochoose_click_binary_100000_sess.dataset']

    def download(self):
        pass
    
    def process(self):
        data_list = transform_dataset(df, buy_item_dict)
        
        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

In [None]:
# Prepare dataset
dataset = YooChooseDataset('./')

Processing...
100%|██████████| 10000/10000 [00:34<00:00, 291.49it/s]
Done!


### Разделение выборки

In [None]:
# train_test_split
dataset = dataset.shuffle()
one_tenth_length = int(len(dataset) * 0.1)
train_dataset = dataset[:one_tenth_length * 8]
val_dataset = dataset[one_tenth_length*8:one_tenth_length * 9]
test_dataset = dataset[one_tenth_length*9:]
len(train_dataset), len(val_dataset), len(test_dataset)

(8000, 1000, 1000)

In [None]:
# Load dataset into PyG loaders 
batch_size= 512
train_loader = DataLoader(train_dataset, batch_size=batch_size)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)



In [None]:
# Load dataset into PyG loaders 
num_items = df.item_id.max() +1
num_categories = df.category.max()+1
num_items , num_categories

(10099, 39)

### Настройка модели для обучения

In [None]:
embed_dim = 128
from torch_geometric.nn import GraphConv, TopKPooling, GatedGraphConv, SAGEConv, SGConv
from torch_geometric.nn import global_mean_pool as gap, global_max_pool as gmp
import torch.nn.functional as F

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # Model Structure
        self.conv1 = GraphConv(embed_dim * 2, 128)
        self.pool1 = TopKPooling(128, ratio=0.9)
        self.conv2 = GraphConv(128, 128)
        self.pool2 = TopKPooling(128, ratio=0.9)
        self.conv3 = GraphConv(128, 128)
        self.pool3 = TopKPooling(128, ratio=0.9)
        self.item_embedding = torch.nn.Embedding(num_embeddings=num_items, embedding_dim=embed_dim)
        self.category_embedding = torch.nn.Embedding(num_embeddings=num_categories, embedding_dim=embed_dim)        
        self.lin1 = torch.nn.Linear(256, 256)
        self.lin2 = torch.nn.Linear(256, 128)
        self.bn1 = torch.nn.BatchNorm1d(128)
        self.bn2 = torch.nn.BatchNorm1d(64)
        self.act1 = torch.nn.ReLU()
        self.act2 = torch.nn.ReLU()        
  
    # Forward step of a model
    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        
        item_id = x[:,:,0]
        category = x[:,:,1]
        

        emb_item = self.item_embedding(item_id).squeeze(1)
        emb_category = self.category_embedding(category).squeeze(1)
        
        x = torch.cat([emb_item, emb_category], dim=1)  
        # print(x.shape)
        x = F.relu(self.conv1(x, edge_index))
        # print(x.shape)
        r = self.pool1(x, edge_index, None, batch)
        # print(r)
        x, edge_index, _, batch, _, _ = self.pool1(x, edge_index, None, batch)
        x1 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = F.relu(self.conv2(x, edge_index))
     
        x, edge_index, _, batch, _, _ = self.pool2(x, edge_index, None, batch)
        x2 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = F.relu(self.conv3(x, edge_index))

        x, edge_index, _, batch, _, _ = self.pool3(x, edge_index, None, batch)
        x3 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = x1 + x2 + x3

        x = self.lin1(x)
        x = self.act1(x)
        x = self.lin2(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.act2(x)      
        
        outputs = []
        for i in range(x.size(0)):
            output = torch.matmul(emb_item[data.batch == i], x[i,:])

            outputs.append(output)
              
        x = torch.cat(outputs, dim=0)
        x = torch.sigmoid(x)
        
        return x

### Обучение нейронной сверточной сети

In [None]:
# Enable CUDA computing
device = torch.device('cuda')
model = Net().to(device)
# Choose optimizer and criterion for learning
optimizer = torch.optim.Adam(model.parameters(), lr=0.002)
crit = torch.nn.BCELoss()

In [None]:
# Train function
def train():
    model.train()

    loss_all = 0
    for data in train_loader:
        data = data.to(device)
        optimizer.zero_grad()
        output = model(data)

        label = data.y.to(device)
        loss = crit(output, label)
        loss.backward()
        loss_all += data.num_graphs * loss.item()
        optimizer.step()
    return loss_all / len(train_dataset)

In [None]:
# Evaluate result of a model
from sklearn.metrics import roc_auc_score
def evaluate(loader):
    model.eval()

    predictions = []
    labels = []

    with torch.no_grad():
        for data in loader:

            data = data.to(device)
            pred = model(data).detach().cpu().numpy()

            label = data.y.detach().cpu().numpy()
            predictions.append(pred)
            labels.append(label)

    predictions = np.hstack(predictions)
    labels = np.hstack(labels)
    
    return roc_auc_score(labels, predictions)

In [None]:
# Train a model
NUM_EPOCHS =   30#@param { type: "integer" }
for epoch in tqdm(range(NUM_EPOCHS)):
    loss = train()
    train_acc = evaluate(train_loader)
    val_acc = evaluate(val_loader)    
    test_acc = evaluate(test_loader)
    print('Epoch: {:03d}, Loss: {:.5f}, Train Auc: {:.5f}, Val Auc: {:.5f}, Test Auc: {:.5f}'.
          format(epoch, loss, train_acc, val_acc, test_acc))

  3%|▎         | 1/30 [00:08<03:53,  8.05s/it]

Epoch: 000, Loss: 0.77482, Train Auc: 0.51656, Val Auc: 0.48422, Test Auc: 0.49989


  7%|▋         | 2/30 [00:15<03:32,  7.59s/it]

Epoch: 001, Loss: 0.71011, Train Auc: 0.51969, Val Auc: 0.49152, Test Auc: 0.48564


 10%|█         | 3/30 [00:22<03:20,  7.42s/it]

Epoch: 002, Loss: 0.68600, Train Auc: 0.53608, Val Auc: 0.52194, Test Auc: 0.45473


 13%|█▎        | 4/30 [00:29<03:12,  7.39s/it]

Epoch: 003, Loss: 0.64709, Train Auc: 0.54575, Val Auc: 0.51009, Test Auc: 0.48483


 17%|█▋        | 5/30 [00:37<03:03,  7.34s/it]

Epoch: 004, Loss: 0.60906, Train Auc: 0.57162, Val Auc: 0.53880, Test Auc: 0.51775


 20%|██        | 6/30 [00:44<02:56,  7.37s/it]

Epoch: 005, Loss: 0.57585, Train Auc: 0.57618, Val Auc: 0.55123, Test Auc: 0.50275


 23%|██▎       | 7/30 [00:51<02:48,  7.34s/it]

Epoch: 006, Loss: 0.54121, Train Auc: 0.60232, Val Auc: 0.55214, Test Auc: 0.48655


 27%|██▋       | 8/30 [00:59<02:40,  7.29s/it]

Epoch: 007, Loss: 0.51366, Train Auc: 0.63360, Val Auc: 0.57024, Test Auc: 0.49352


 30%|███       | 9/30 [01:07<02:37,  7.51s/it]

Epoch: 008, Loss: 0.49416, Train Auc: 0.64595, Val Auc: 0.55466, Test Auc: 0.49792


 33%|███▎      | 10/30 [01:14<02:29,  7.46s/it]

Epoch: 009, Loss: 0.47487, Train Auc: 0.66688, Val Auc: 0.55910, Test Auc: 0.49562


 37%|███▋      | 11/30 [01:21<02:20,  7.40s/it]

Epoch: 010, Loss: 0.45819, Train Auc: 0.69108, Val Auc: 0.57592, Test Auc: 0.50079


 40%|████      | 12/30 [01:28<02:11,  7.31s/it]

Epoch: 011, Loss: 0.45138, Train Auc: 0.70885, Val Auc: 0.57731, Test Auc: 0.51225


 43%|████▎     | 13/30 [01:35<02:04,  7.30s/it]

Epoch: 012, Loss: 0.43603, Train Auc: 0.72012, Val Auc: 0.56193, Test Auc: 0.50717


 47%|████▋     | 14/30 [01:43<01:56,  7.27s/it]

Epoch: 013, Loss: 0.42780, Train Auc: 0.74507, Val Auc: 0.56002, Test Auc: 0.51982


 50%|█████     | 15/30 [01:50<01:48,  7.25s/it]

Epoch: 014, Loss: 0.41499, Train Auc: 0.76193, Val Auc: 0.56542, Test Auc: 0.51138


 53%|█████▎    | 16/30 [01:57<01:41,  7.25s/it]

Epoch: 015, Loss: 0.40590, Train Auc: 0.78428, Val Auc: 0.56746, Test Auc: 0.52130


 57%|█████▋    | 17/30 [02:04<01:33,  7.23s/it]

Epoch: 016, Loss: 0.38678, Train Auc: 0.79328, Val Auc: 0.56112, Test Auc: 0.51715


 60%|██████    | 18/30 [02:12<01:26,  7.23s/it]

Epoch: 017, Loss: 0.38136, Train Auc: 0.81786, Val Auc: 0.56870, Test Auc: 0.51299


 63%|██████▎   | 19/30 [02:19<01:18,  7.17s/it]

Epoch: 018, Loss: 0.36825, Train Auc: 0.83982, Val Auc: 0.56547, Test Auc: 0.51840


 67%|██████▋   | 20/30 [02:26<01:11,  7.15s/it]

Epoch: 019, Loss: 0.35655, Train Auc: 0.85414, Val Auc: 0.57431, Test Auc: 0.51164


 70%|███████   | 21/30 [02:33<01:04,  7.11s/it]

Epoch: 020, Loss: 0.34592, Train Auc: 0.87007, Val Auc: 0.58486, Test Auc: 0.51821


 73%|███████▎  | 22/30 [02:40<00:56,  7.11s/it]

Epoch: 021, Loss: 0.34278, Train Auc: 0.88075, Val Auc: 0.58232, Test Auc: 0.51728


 77%|███████▋  | 23/30 [02:47<00:49,  7.09s/it]

Epoch: 022, Loss: 0.32791, Train Auc: 0.88459, Val Auc: 0.57686, Test Auc: 0.51932


 80%|████████  | 24/30 [02:54<00:42,  7.10s/it]

Epoch: 023, Loss: 0.32186, Train Auc: 0.90880, Val Auc: 0.58117, Test Auc: 0.51875


 83%|████████▎ | 25/30 [03:01<00:35,  7.13s/it]

Epoch: 024, Loss: 0.31486, Train Auc: 0.91513, Val Auc: 0.58093, Test Auc: 0.51364


 87%|████████▋ | 26/30 [03:08<00:28,  7.13s/it]

Epoch: 025, Loss: 0.30628, Train Auc: 0.92179, Val Auc: 0.57345, Test Auc: 0.53243


 90%|█████████ | 27/30 [03:15<00:21,  7.13s/it]

Epoch: 026, Loss: 0.29576, Train Auc: 0.91953, Val Auc: 0.58989, Test Auc: 0.52054


 93%|█████████▎| 28/30 [03:23<00:14,  7.12s/it]

Epoch: 027, Loss: 0.29564, Train Auc: 0.92852, Val Auc: 0.57635, Test Auc: 0.51246


 97%|█████████▋| 29/30 [03:30<00:07,  7.07s/it]

Epoch: 028, Loss: 0.28643, Train Auc: 0.92933, Val Auc: 0.58100, Test Auc: 0.50909


100%|██████████| 30/30 [03:37<00:00,  7.23s/it]

Epoch: 029, Loss: 0.28846, Train Auc: 0.93542, Val Auc: 0.57662, Test Auc: 0.50635



