# Financial asset recommender - Collaborative filtering with LightGCN

## Introduction

This notebook is aimed to serve as an introduction to the creation of a recommender that operates on a proprietary dataset of user-financial asset transactions in order to recommend assets that are similar to those that the user has invested in. It makes use of the Beta-RecSys library developed by the University of Glasgow, a comprehensive package used for training and evaluating powerful deep learning recommendation models. The library is available on GitHub at: https://github.com/beta-team/beta-recsys

To get started, we install the Beta-RecSys framework. This can be done through pip - alternative installation methods are provided on the GitHub repository.


## Dataset setup

We use a proprietary dataset of investment transactions provided by the National Bank of Greece. This can be downloaded through the Infinitech Marketplace.

Beta-RecSys supports a range of well-known interaction datasets used for recommendation models. However, this dataset being proprietary and not for public distribution means we will have to custom load its interactions into a format usable by the library. For this we then create a dataset object from this raw data, which can be processed as interactions by Beta-RecSys.

In [1]:
import sys, os
import pandas as pd
import numpy as np
from functools import partial

In [5]:
DEFAULT_USER_COL = "col_user"
DEFAULT_ITEM_COL = "col_item"
DEFAULT_RATING_COL = "col_rating"
DEFAULT_TIMESTAMP_COL = "col_timestamp"

In [7]:
data = pd.read_csv(
            'interactions-stocks.csv',
            skiprows=[0],
            engine="python",
            names=[
                DEFAULT_USER_COL,
                DEFAULT_ITEM_COL,
                DEFAULT_TIMESTAMP_COL,
                DEFAULT_RATING_COL,
            ],
        )
data


Unnamed: 0,col_user,col_item,col_timestamp,col_rating
0,0,MSPR,2018-09-11,1.0
1,0,REGN,2019-11-04,1.0
2,0,RCM,2018-05-03,1.0
3,0,TXMD,2020-07-20,1.0
4,0,ZWS,2019-08-14,1.0
...,...,...,...,...
21447,999,OTEX,2019-02-12,1.0
21448,999,AREN,2020-03-17,1.0
21449,999,ONB,2020-02-08,1.0
21450,999,VLTA,2021-01-06,1.0


In [9]:
data[DEFAULT_TIMESTAMP_COL] =  pd.to_datetime(data[DEFAULT_TIMESTAMP_COL], format='%Y-%m-%d')


In [10]:
data

Unnamed: 0,col_user,col_item,col_timestamp,col_rating
0,0,MSPR,2018-09-11,1.0
1,0,REGN,2019-11-04,1.0
2,0,RCM,2018-05-03,1.0
3,0,TXMD,2020-07-20,1.0
4,0,ZWS,2019-08-14,1.0
...,...,...,...,...
21447,999,OTEX,2019-02-12,1.0
21448,999,AREN,2020-03-17,1.0
21449,999,ONB,2020-02-08,1.0
21450,999,VLTA,2021-01-06,1.0


## Data splitting

We then perform a temporal split on the data, through the data object we just created. Beta-RecSys provides methods to split data in several different ways, such as leave-one-out, random, and several types of basket splits. A temporal data split splits the data into training, testing, and validation sets on the basis of the timestamps associated with each data point, where earlier data points are used for training and later ones for validation and testing, in order to simulate a more real-world setting where future trends are learned from past events. This is suitable for our current data, where investment transactions can change depending on temporal factors. However, if needed, one can also try out other splitting strategies.

The below cell displays the output of data splitting.

In [12]:
train = data[data[DEFAULT_TIMESTAMP_COL] < "2020-06-30"]
test = data[data[DEFAULT_TIMESTAMP_COL] >= "2020-06-30"]

## Model training

Collaborative filtering is a type of recommendation approach which uses user-item interactions to recommend items that the user might be interested in, based on the habits of other similar users. In this example, we are going to build a kNN based model. In order to do this, we need the following:

We are going to use cosine as similarity between users. Therefore, we first need to compute the modules.

In [14]:
import math
modules = dict()
for u in train[DEFAULT_USER_COL].unique().flatten():
    u_df = train[train[DEFAULT_USER_COL] == u]
    module = 0.0
    for index, row in u_df.iterrows():
        module += row[DEFAULT_RATING_COL]*row[DEFAULT_RATING_COL]
    modules[u] = math.sqrt(module)

Then, we are going to find the 10 most similar neighbors to each of the customers

In [20]:
k = 10

user_similarities = dict()
for u in train[DEFAULT_USER_COL].unique().flatten():
    similarity_dict = dict()

    items = train[train[DEFAULT_USER_COL] == u]
    for index, row in items.iterrows():
        i = row[DEFAULT_ITEM_COL]
        ui = row[DEFAULT_RATING_COL]
        i_df = train[train[DEFAULT_ITEM_COL] == i]
        for index2, row2 in i_df.iterrows():
            v = row2[DEFAULT_USER_COL]
            vi = row2[DEFAULT_RATING_COL]
            if u != v:
                if v in similarity_dict:
                    similarity_dict[v] = similarity_dict[v] + ui*vi
                else:
                    similarity_dict[v] = ui*vi

    for v in similarity_dict:
        if modules[u]*modules[v] > 0.0:
            similarity_dict[v] = similarity_dict[v]/(modules[u]*modules[v])
    user_similarities[u] = pd.DataFrame(similarity_dict.items(), columns=["v","sim"])
    user_similarities[u] = user_similarities[u].sort_values(by=["sim"], ascending=False).head(k)

## Obtaining recommendations
Once we have trained the model, it is possible to predict the value of the different financial assets for the different users and rank them. Here, we show an example for user 999.

In [24]:
recs = []
for u in user_similarities:
    item_dict = dict()
    for index, v in user_similarities[u].iterrows():
        for index2, j in train[train[DEFAULT_USER_COL] == v["v"]].iterrows():
            j_item = j[DEFAULT_ITEM_COL]
            if j_item in item_dict:
                item_dict[j_item] = item_dict[j_item] + v["sim"] * j[DEFAULT_RATING_COL]
            else:
                item_dict[j_item] = v["sim"] * j[DEFAULT_RATING_COL]
    item_df = pd.DataFrame(item_dict.items(), columns=[DEFAULT_ITEM_COL, DEFAULT_RATING_COL])
    item_df[DEFAULT_USER_COL] = u
    item_df = item_df[~item_df[DEFAULT_ITEM_COL].isin(train[train[DEFAULT_USER_COL]==u][DEFAULT_ITEM_COL])]
recs.append(item_df.sort_values(by=DEFAULT_RATING_COL, ascending=False).head(10))

recomms = pd.concat(recs)

In [40]:
recs[0]

Unnamed: 0,col_item,col_rating,col_user
13,SBI,1.125833,999
2,VERB,0.979762,999
0,BILI,0.963859,999
1,PHD,0.657673,999
11,WOLF,0.622008,999
36,AIRTP,0.594861,999
16,CARA,0.594861,999
18,GOTU,0.589255,999
24,ARR,0.589255,999
25,HTD,0.578959,999


## Evaluation

For our purposes we will use four key metrics commonly used to evaluate recommenders. These are:

1. <b>Normalized discounted cumulative gain (NDCG)</b>: This is a measure of ranking quality normalized across predictions, and calculates the performance of the recommender on the basis that highly relevant items should be recommended at higher ranks, as they are more useful as recommendations than other, less relevant items. It is the ratio of the discounted cumulative gain (DCG) and the idealized DCG. DCG is represented below

\begin{equation}
\text{DCG} = \sum_{i=1}^N \frac{rel_{i}}{\log_2 (i+1)}
\end{equation}

\begin{equation}
\text{NDCG @ position p} = \frac{DCG_{p}}{IDCG_{p}}
\end{equation}


2. <b>Precision</b>: This commonly used metric refers to the number of true positive, or relevant instances among retrieved items.

\begin{equation}
\text{Precision} = \frac{True \ Positives}{True \ Positives \ + \ False \ Positives}
\end{equation}

3. <b>Recall</b>: Also referred to as sensitivity, this refers to the fraction of all relevant instances that were retrieved.

\begin{equation}
\text{Recall} = \frac{True \ Positives}{True \ Positives \ + \ False \ Negatives}
\end{equation}

These metrics are calculated at rank 10 (over the top 10 recommendation results).

In [36]:
def precision(u, rec, test, k):
    assets = set(test[test[DEFAULT_USER_COL]==u][DEFAULT_ITEM_COL])
    if len(assets) == 0: return 0.0
    rec_assets = set(rec[DEFAULT_ITEM_COL].head(k))
    return (len(assets & rec_assets)+0.0)/(len(rec_assets) + 0.0)

def recall(u, rec, test, k):
    assets = set(test[test[DEFAULT_USER_COL]==u][DEFAULT_ITEM_COL])
    if len(assets) == 0: return 0.0
    rec_assets = set(rec[DEFAULT_ITEM_COL].head(k))
    return (len(assets & rec_assets)+0.0)/(len(assets) + 0.0)

def ndcg(u, rec, test, k):
    idcg = 0.0
    assets = set(test[test[DEFAULT_USER_COL]==u][DEFAULT_ITEM_COL])
    if len(assets) == 0: return 0.0
    for i in range(1,k+1):
        idcg += math.log(2.0)/math.log(1+i)
    i = 1
    dcg = 0.0
    for index, row in rec.iterrows():
        dcg += math.log(2.0)/math.log(1+i) if row[DEFAULT_ITEM_COL] in assets else 0.0
        i += 1
        if i > k:
          break
    return dcg/idcg


In [37]:
precs = []
recalls = []
ndcgs = []

valid_users = set(train[DEFAULT_USER_COL]) & set(test[DEFAULT_USER_COL])
for r in recs:
    u = r[DEFAULT_USER_COL][0]
    if u in valid_users :
      precs.append(precision(u, r, test, 10))
      recalls.append(recall(u, r, test, 10))
      ndcgs.append(ndcg(u, r, test, 10))

precision2 = {"metric" : "precision", "value" : sum(precs)/len(precs)}
recall = {"metric" : "recall", "value" : sum(recalls)/len(recalls)}
ndcgs = {"metric" : "ndcg", "value" : sum(ndcgs)/len(ndcgs)}

metrics = [precision2, recall, ndcgs]

metrics_df = pd.DataFrame(metrics)
metrics_df


Unnamed: 0,metric,value
0,precision,0.0
1,recall,0.0
2,ndcg,0.0


Note: as we can see, in this case, the metrics are 0.0 for all three. This is because the interactions were generated randomly.