# Financial asset recommender - Collaborative filtering with LightGCN

## Introduction

This notebook is aimed to serve as an introduction to the creation of a recommender that operates on a proprietary dataset of user-financial asset transactions in order to recommend assets that are similar to those that the user has invested in. It makes use of the Beta-RecSys library developed by the University of Glasgow, a comprehensive package used for training and evaluating powerful deep learning recommendation models. The library is available on GitHub at: https://github.com/beta-team/beta-recsys

To get started, we install the Beta-RecSys framework. This can be done through pip - alternative installation methods are provided on the GitHub repository.


In [2]:
!pip install beta-rec

Collecting beta-rec
  Using cached beta_rec-0.3.2-py3-none-any.whl (178 kB)
Collecting gputil==1.4.0
  Using cached GPUtil-1.4.0-py3-none-any.whl
Collecting mock==4.0.1
  Using cached mock-4.0.1-py3-none-any.whl (28 kB)
Collecting munch==2.5.0
  Using cached munch-2.5.0-py2.py3-none-any.whl (10 kB)
Collecting pandas==1.0.3
  Using cached pandas-1.0.3-cp37-cp37m-win_amd64.whl (8.7 MB)
Collecting aiofiles~=0.4.0
  Using cached aiofiles-0.4.0-py3-none-any.whl (9.2 kB)
Collecting nest-asyncio~=1.3.3
  Using cached nest_asyncio-1.3.3-py3-none-any.whl (4.7 kB)
Installing collected packages: pandas, nest-asyncio, munch, mock, gputil, aiofiles, beta-rec
  Attempting uninstall: pandas
    Found existing installation: pandas 1.3.4
    Uninstalling pandas-1.3.4:
      Successfully uninstalled pandas-1.3.4
  Attempting uninstall: nest-asyncio
    Found existing installation: nest-asyncio 1.5.1
    Uninstalling nest-asyncio-1.5.1:
      Successfully uninstalled nest-asyncio-1.5.1
Successfully insta

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
notebook 6.4.6 requires nest-asyncio>=1.5, but you have nest-asyncio 1.3.3 which is incompatible.
jupyter-client 7.1.0 requires nest-asyncio>=1.5, but you have nest-asyncio 1.3.3 which is incompatible.




## Dataset setup

We use a proprietary dataset of investment transactions provided by the National Bank of Greece. This can be downloaded through the Infinitech Marketplace.

Beta-RecSys supports a range of well-known interaction datasets used for recommendation models. However, this dataset being proprietary and not for public distribution means we will have to custom load its interactions into a format usable by the library. For this we then create a dataset object from this raw data, which can be processed as interactions by Beta-RecSys.

In [1]:
import sys, os, beta_rec
import pandas as pd
import numpy as np
from functools import partial

np_load_old = partial(np.load)
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

In [2]:
from beta_rec.data import BaseData
from beta_rec.datasets.dataset_base import DatasetBase
from beta_rec.utils.constants import (
    DEFAULT_ITEM_COL,
    DEFAULT_RATING_COL,
    DEFAULT_TIMESTAMP_COL,
    DEFAULT_USER_COL,
)

In [3]:
data = pd.read_csv(
            'C:\\Users\\Javier\\PycharmProjects\\pythonProject\\dataset\\nbg_instruments_large_v3.csv',
            skiprows=[0],
            engine="python",
            names=[
                DEFAULT_USER_COL,
                DEFAULT_ITEM_COL,
                DEFAULT_RATING_COL,
                DEFAULT_TIMESTAMP_COL,
            ],
        )
data

Unnamed: 0,col_user,col_item,col_rating,col_timestamp
0,-1000399803742468017,12541,1.0,2018-05-09
1,-1000476326435445839,11907,1.0,2018-02-01
2,-1000476326435445839,13233,1.0,2018-10-12
3,-1000476326435445839,11907,1.0,2018-10-12
4,-1000570356359468369,12917,1.0,2018-07-18
...,...,...,...,...
312999,998868334061509693,89,1.0,2021-05-07
313000,998868334061509693,268,1.0,2021-05-07
313001,998987671268040482,12885,1.0,2021-02-09
313002,998987671268040482,12885,1.0,2021-02-15


In [None]:
mapping = dict()
for index, row in data.iterrows():
    uid = row[DEFAULT_USER_COL]
    if not uid in mapping:
        mapping[uid] = mapping.__len__()

In [None]:
f = open('C:\\Users\\Javier\\PycharmProjects\\pythonProject\\dataset\\map.csv','w')
f.write("ID,SERIALIZED_ID")
for key in mapping.keys():
    f.write("\n" + str(key) + "\t" + str(mapping[key]))
f.close()

In [None]:
f = open('C:\\Users\\Javier\\PycharmProjects\\pythonProject\\dataset\\cleaned.csv','w')
f.write(DEFAULT_USER_COL + ',' + DEFAULT_ITEM_COL + ',' + DEFAULT_RATING_COL + ',' + DEFAULT_TIMESTAMP_COL)
for index, row in data.iterrows():
    uid = mapping[row[DEFAULT_USER_COL]]
    f.write("\n" + str(uid) + "," + str(row[DEFAULT_ITEM_COL]) + "," + str(row[DEFAULT_RATING_COL]) + "," + str(row[DEFAULT_TIMESTAMP_COL]))
f.close()

In [3]:
class NBGSampleInstTrans(DatasetBase):
    """NBGSample Instruments Transactions Dataset."""

    def __init__(self, dataset_name="nbg_marketplace_cf_1", min_u_c=0, min_i_c=0, root_dir=None):
        """Init NBGSampleInstTrans Class."""
        super().__init__(
            dataset_name=dataset_name,
            min_u_c=min_u_c,
            min_i_c=min_i_c,
            root_dir=root_dir,
        )

    def preprocess(self):
        """Preprocess the raw file.

        Preprocess the file, convert it to a dataframe consisting of the user-item
        interactions and save it in the processed directory.
        """
        file_name = 'C:\\Users\\Javier\\PycharmProjects\\pythonProject\\dataset\\cleaned.csv'

        data = pd.read_csv(
            file_name,
            skiprows=[0],
            engine="python",
            names=[
                DEFAULT_USER_COL,
                DEFAULT_ITEM_COL,
                DEFAULT_RATING_COL,
                DEFAULT_TIMESTAMP_COL,
            ],
        )
        data[DEFAULT_TIMESTAMP_COL] =  pd.to_datetime(data[DEFAULT_TIMESTAMP_COL], format='%Y-%m-%d')

        self.save_dataframe_as_npz(
            #
            data,
           os.path.join(self.processed_path, f"{self.dataset_name}_interaction.npz"),
        )

        return os.path.join(self.processed_path, f"{self.dataset_name}_interaction.npz"), data

nbgdata = NBGSampleInstTrans()
result, data = nbgdata.preprocess()

In [6]:
result

'C:\\Users\\Javier\\AppData\\Roaming\\Python\\Python38\\site-packages\\datasets\\nbg_marketplace_cf_1\\processed\\nbg_marketplace_cf_1_interaction.npz'

In [7]:
data

Unnamed: 0,col_user,col_item,col_rating,col_timestamp
0,0,12541,1.0,2018-05-09
1,1,11907,1.0,2018-02-01
2,1,13233,1.0,2018-10-12
3,1,11907,1.0,2018-10-12
4,2,12917,1.0,2018-07-18
...,...,...,...,...
312999,32904,89,1.0,2021-05-07
313000,32904,268,1.0,2021-05-07
313001,32905,12885,1.0,2021-02-09
313002,32905,12885,1.0,2021-02-15


## Model configuration

Recommender models such as this one have a vast range of adjustable parameters for saving intermediate and output files, as well as for optimizing performance. We create a configuration file that can be fed into the model that contains these key parameters and serves as an easily reusable way to not only create further models, but also to change model parameters without much hassle.

In [4]:
config_file = {
    "config_file":"C:\\Users\\Javier\\Documents\\glasgow\\Infinitech-FAR-CollaborativeFiltering\\configs\\lightgcn.json"
}

## Data splitting

We then perform a temporal split on the data, through the data object we just created. Beta-RecSys provides methods to split data in several different ways, such as leave-one-out, random, and several types of basket splits. A temporal data split splits the data into training, testing, and validation sets on the basis of the timestamps associated with each data point, where earlier data points are used for training and later ones for validation and testing, in order to simulate a more real-world setting where future trends are learned from past events. This is suitable for our current data, where investment transactions can change depending on temporal factors. However, if needed, one can also try out other splitting strategies.

The below cell displays the output of data splitting.

In [5]:
split_dataset = nbgdata.load_temporal_split(n_test=1, n_negative=100)
new_data =  BaseData(split_dataset)

--------------------------------------------------------------------------------
Loaded training set statistics
+---------+------------+------------+--------------+-----------------+
|         | col_user   | col_item   | col_rating   | col_timestamp   |
|---------+------------+------------+--------------+-----------------|
| count   | 214521     | 214521     | 214521       | 214521          |
| nunique | 10894      | 926        | 2            | 693             |
+---------+------------+------------+--------------+-----------------+
valid_data_0 statistics
+---------+------------+------------+--------------+-----------------+
|         | col_user   | col_item   | col_rating   | col_timestamp   |
|---------+------------+------------+--------------+-----------------|
| count   | 531566     | 531566     | 531566       | 531566          |
| nunique | 5150       | 938        | 2            | 1               |
+---------+------------+------------+--------------+-----------------+
test_data_0 

## Model training

Collaborative filtering is a type of recommendation approach which uses user-item interactions to recommend items that the user might be interested in, based on the habits of other similar users. We choose a LightGCN recommendation model (available here: https://arxiv.org/abs/2002.02126), which is an improvement on the Neural Graph Collaborative Filtering model that leverages the most essential component of the graph structure - the neighbourhood aggregation, for collaborative filtering. This is trained with the temporally split data we just constructed.

In [6]:
from beta_rec.recommenders import MatrixFactorization, NGCF, NeuCF, LightGCN
engine = LightGCN(config_file)
engine.train(new_data)
print ('Model training complete')

loading config file C:\Users\Javier\Documents\glasgow\Infinitech-FAR-CollaborativeFiltering\configs\lightgcn.json
--------------------------------------------------------------------------------
Received parameters from command line (or default):
+--------+----------+
| keys   | values   |
|--------+----------|
+--------+----------+
--------------------------------------------------------------------------------
logs will save in file: C:\Users\Javier\PycharmProjects\pythonProject\IPythonNotebooks\default\logs/lightgcn_default_20220127_180546_qdsifk .stdout.log .stderr.log
2022-01-27 18:05:48 [INFO]-Python version: 3.8.10 (tags/v3.8.10:3d8993a, May  3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]
2022-01-27 18:05:48 [INFO]-pytorch version: 1.10.1+cu113
2022-01-27 18:05:48 [INFO]-The intermediate running statuses will be reported in folder: C:\Users\Javier\PycharmProjects\pythonProject\IPythonNotebooks\default\runs/lightgcn_default_20220127_180546_qdsifk
2022-01-27 18:05:48 [INFO]-Model c

2022-01-27 18:05:48 [ERROR]-
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ratings[DEFAULT_RATING_COL][ratings[DEFAULT_RATING_COL] > 0] = 1.0


2022-01-27 18:05:50 [INFO]-C:\Users\Javier\PycharmProjects\pythonProject\IPythonNotebooks\default\processes/nbg_marketplace_cf_1/ngcf_nbg_marketplace_cf_1_temporal
2022-01-27 18:05:56 [INFO]-already create adjacency matrix (11820, 11820)
2022-01-27 18:05:56 [INFO]-generate single-normalized adjacency matrix.
2022-01-27 18:05:57 [INFO]-generate single-normalized adjacency matrix.
2022-01-27 18:05:57 [INFO]-already normalize adjacency matrix
2022-01-27 18:05:57 [INFO]-Setting device for torch_engine cuda:0
2022-01-27 18:05:57 [INFO]-
LightGCN(
  (user_embedding): Embedding(10894, 64)
  (item_embedding): Embedding(926, 64)

2022-01-27 18:05:57 [INFO]-
2022-01-27 18:06:30 [INFO]-Making PairwiseNegativeDataset of length 214521
  0%|                                                                                                                                                                                                                                                | 0/5 [00:00<?, ?it/s]E

2022-01-27 18:07:17 [ERROR]-Exception in thread Thread-22:
2022-01-27 18:07:17 [ERROR]-Traceback (most recent call last):
2022-01-27 18:07:17 [ERROR]-  File "C:\Users\Javier\AppData\Local\Programs\Python\Python38\lib\threading.py", line 932, in _bootstrap_inner
2022-01-27 18:07:17 [ERROR]-    self.run()
2022-01-27 18:07:17 [ERROR]-  File "C:\Users\Javier\AppData\Local\Programs\Python\Python38\lib\threading.py", line 870, in run
2022-01-27 18:07:17 [ERROR]-    self._target(*self._args, **self._kwargs)
2022-01-27 18:07:17 [ERROR]-  File "C:\Users\Javier\AppData\Roaming\Python\Python38\site-packages\beta_rec\utils\common_util.py", line 232, in wrapper
2022-01-27 18:07:17 [ERROR]-    result = method(*args, **kw)
2022-01-27 18:07:17 [ERROR]-  File "C:\Users\Javier\AppData\Roaming\Python\Python38\site-packages\beta_rec\core\eval_engine.py", line 112, in train_eval_worker
2022-01-27 18:07:17 [ERROR]-    testEngine.expose_performance(valid_result, test_result)
2022-01-27 18:07:17 [ERROR]-Attri

2022-01-27 18:07:32 [INFO]-[Training Epoch 1], Loss 0.010593081824481487
 40%|############################################################################################8                                                                                                                                           | 2/5 [01:02<01:40, 33.42s/it]Epoch 2 starts !
2022-01-27 18:07:32 [INFO]---------------------------------------------------------------------------------
2022-01-27 18:08:15 [INFO]-[Training Epoch 2], Loss 0.013884486630558968
 60%|###########################################################################################################################################2                                                                                            | 3/5 [01:45<01:15, 37.65s/it]Epoch 3 starts !
2022-01-27 18:08:15 [INFO]---------------------------------------------------------------------------------
2022-01-27 18:09:13 [INFO]-[Training Epoch 3], Loss 0.0259892866015434

## Obtaining recommendations
Once we have trained the model, it is possible to predict the value of the different financial assets for the different users and rank them. Here, we show an example for user 0.

In [None]:
data[DEFAULT_USER_COL].unique()

## Evaluation

Beta-RecSys provides several inbuilt metrics for the evaluation of the performance of collaborative filtering recommenders. These can be specified in the configuration file, along with their ranges i.e. over how many predictions in the ranking these metrics are calculated. For our purposes we will use four key metrics commonly used to evaluate recommenders. These are:

1. <b>Normalized discounted cumulative gain (NDCG)</b>: This is a measure of ranking quality normalized across predictions, and calculates the performance of the recommender on the basis that highly relevant items should be recommended at higher ranks, as they are more useful as recommendations than other, less relevant items. It is the ratio of the discounted cumulative gain (DCG) and the idealized DCG. DCG is represented below

\begin{equation}
\text{DCG} = \sum_{i=1}^N \frac{rel_{i}}{\log_2 (i+1)}
\end{equation}

\begin{equation}
\text{NDCG @ position p} = \frac{DCG_{p}}{IDCG_{p}}
\end{equation}


2. <b>Precision</b>: This commonly used metric refers to the number of true positive, or relevant instances among retrieved items.

\begin{equation}
\text{Precision} = \frac{True \ Positives}{True \ Positives \ + \ False \ Positives}
\end{equation}

3. <b>Recall</b>: Also referred to as sensitivity, this refers to the fraction of all relevant instances that were retrieved.

\begin{equation}
\text{Recall} = \frac{True \ Positives}{True \ Positives \ + \ False \ Negatives}
\end{equation}

4. <b>Mean average precision (MAP)</b>: The mean average precision of a recommender is a metric that summarizes the precision-recall curve into a metric that represents the area under this curve and therefore is representative of the average of all precisions. MAP is the average of the AP value described below.

\begin{equation}
\text{Average Precision} = \sum_n (R_n - R_{n-1}) P_n
\end{equation}

These metrics are calculated across four ranges: 1, 3, 5 and 10 queries respectively. Beta-RecSys automatically calculates this set of metrics across the predictions for the ranges specified in the configuration file upon testing the model.

In [13]:
engine.config["save_mode"] = "average"
temporal_result = engine.test(new_data.test[0])
print ('Model testing complete')

KeyError: 'save_mode'

The results are displayed in the form of a JSON file below.

In [None]:
temporal_result