## Learning to Rank for Peers

Peers can be thought of as a ranking problem, where for a given borrower company (hereafter refered to as the query
company, or just query) we aim to rank companies present in the data platform based on how likely they are to be a suitable peer for the query.

Our initial attempts using ElasticSearch's built-in functionality produced a useful baseline, but we still seem to be
retrieving an unfortunate amount of false positives (i.e. we suggest peers that are in fact unrelated to the query). We hope that learning to rank (LTR) based on annotations provided by analysts can be used to more accurately retrieve useful peers.

Next, we explain how we intend to use annotations and the LTR methodology explained [here](https://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf) to improve the model.

### Training for LTR

Before explaining how to use the annotations for training, it is useful to briefly explain the LTR methodology we
intend to use. On a high level, LTR attempts to learn whether a given document is relevant for a given query based on a set of extracted features for the document/query pair. These features, together with a label of the relevance of a given document for a given query (provided by analysts in our case, see below) provide the basis for our training routine. We next explain the above more formally.

###### Form of training data

The training data consists of a set $\mathcal{Q}$ of queries (base borrowers in our case), each with a set of retrieved documents $D_q$ together with their relevance scores $R_q$ (annotated peers in our case), with $q\in \mathcal{Q}$.

Let us see what the training data would look like for a given query $q$. For simplicity, let $R_i$ denote the relevance of document $i$ and let $x_i$ denote the set of features extracted for $i$, with $i\in D_q$. $x_i$ can, for example, be a vector having the dot product of a tf-idf bow representation of query company and peer company description as one of its components, and a time series distance between the query company and peer company as another one of its components.

Thus, the training data comprises tuples of the form $(x^q_i, R^q_i)$, with $x^q_i$ being features of a document $i\in D_q$ and $R^q_i$ being the relevance of said document for all queries $q\in \mathcal{Q}$.

To train, we can define a per-query loss and accumulate it over all queries to train a model. We next describe how the above can be used for training.

###### Loss function

In order to properly train a LTR model, we need to define a loss function. We first begin by showing what such a function might look like for a single query $q\in\mathcal{Q}$. We then simply accumulate this over all queries.

Suppose we have a query $q\in\mathcal{Q}$, with pairs $(x_i, R_i)$ as above (with the superscript indexing query omitted for brevity). Suppose also that $x_i$ has dimension $d$. We then want to learn a function $f: \mathbf{R}^d \mapsto \mathbf{R}$, such that $f(x_i) > f(x_j)$ if $R_i > R_j$. Basically $f$ takes the extracted query-document features and attempts to map more relevant documents to bigger numbers. $f$ can be anything, but a natural and flexible choice seems to be making it a neural network.

Now the question becomes, how can we train the neural network for $f$ in a principled way. For this, we can adopt a probabilistic interpretation of ranking. Suppose that for every pair $i,j$ of documents we define $P_{ij}$, the probability of document $i$ being ranked higher than document $j$. We use the relevance annotation of documents $i$ and $j$ to obtain target values for $P_{ij}$, by setting $P_{ij} = 1$ if $R_i > R_j$, or $P_{ij} = 0.5$ if $R_i = R_j$. Thus we have a principled way of obtaining targets for documents, even when their relevances are equal and there is no inherent ordering between the two.

So now it remains to understand how we can use these computed $P_{ij}$ targets in order to train $f$. We can do this by computing the difference in $f$ values for pairs of documents and use this difference to attempt to compute a predicted value of the probability of $i$ being ranked higher than $j$, denoted $\overline{P}_{ij}$. A natural choice for this is the sigmoid function. 

More formally, let us define the following. $o_{ij} = f(x_i) - f(x_j)$. Also, compute $\overline{P}_{ij} = \sigma(o_{ij})$. We can then define a cross entropy loss for this pair as
$$C(i,j) = -P_{ij}log(\overline{P}_{ij}) - (1-P_{ij})log(1-\overline{P}_{ij})$$.

By the above, we can define a per-query loss function as follows:

$$\mathcal{L}_q = \sum_{i=1}^{m_q}\sum_{j=1, R_j \leq R_i }^{m_q} C(i,j)$$, with $m_q$ denoting the number of retrived documents (suggested peers in our case) for query $q$. In human language, the above loss attempts to train a loss $f$ such that $f(x_i) > f(x_j)$ if $R_i > R_j$, with the cross-entropy loss $C(i,j)$ penalising pairs for which this does not hold.

Now this can be summed over all queries to obtain the final loss:

$$\mathcal{L_{\mathcal{Q}}} = \sum_{q\in\mathcal{Q}}\mathcal{L}_q$$

### Application in our case

Here we explain how the above LTR framework can be used to train a shallow neural-net model that can be a first POC for our peers use case.

The training data consists of a set of carefully chosen query companies for which we generated peer suggestions using the ElasticSearch baseline. Annotator then scored these peers based on how relevant they think they are, with "most relevant", "relevant", "least relevant" and "non-relevant" being the levels (see the consensus experiment [doc](https://acornlab.atlassian.net/wiki/spaces/ML/pages/1032716566/Consensus+Experiment) for details).

Restricting our attention to a single query $q$ for now, we can use the above information as follows. Define four relevance levels to match the above and assign the following numeric values:
* 4 - Most Relevant
* 3 - Relevant
* 2 - Least Relevant
* 1 - Negative

Thus, for a document $i\in D_q$ we can set $R_i$ as one of 1,2,3,4 as above. As far as $x_i$ is concerned, we are free to extract any document-query features we feel are relevant. As an example, we set $x_i$ to be a 2d vector where one entry is the tf-idf dot product between query and suggested peer company description, and the second component is dynamic time warping (DTW) between the total revenue time serios of the query and suggestion company.

We can thus use the above training framework to obtain a principled way of training model parameters for the shallow neural network $f$.

It should be noted that the above numeric values apply only to the most recent batch of annotations, while the V1 annotations and the initial set of 40 public companies have different scoring strategies. We can, however, analogously apply this methodology to those annotations as well, by setting $P_{ij} = 0.5$ for two equally relevant documents $i,j$ and setting $P_{ij} = 1$ when document $i$ is judged to be more relevant than document $j$.

Another thing to be careful about is how we manage inter-annotator disagreements on the relevance of certain documents. An elegant way to solve this is to treat each annotator as independent and sum query loss per annotator of that query. Thus, $\mathcal{L}_q$ becomes

$$\mathcal{L}_q = \sum_a\sum_{i=1}^{m_q}\sum_{j=1, R^a_j \leq R^a_i }^{m_q} C(i,j)$$ where we now sum over all annotators of a query $q$ and $R^a_i$ denotes annotator $a$'s relevance call for document $i$.

At test time, we can use the output of $f$ to reorder documents from highest value to lowest for a given query.

###### Potentially relevant features

When constructing $x_i$, we need to decide which query-document features to extract. Fortunately, the annotation exercise has provided us with some valuable insights into what analysts care about when making a call about peers.

These are:

* Sector

* Business Area (Roughly captured by sub-sector)

* Time Series of Revenues (from the discussion w/ Credit, it seems that they are particularly interested in comparing similarities of timeseries financials during “crisis“, e.g. 2012 EEA, 2008 Global Recession, etc. Might want to take this into consideration)

* Geolocation (Same continents or not?)

* Time Series of EBITDA (profitability)*

* Time Series of Total equity/ Total asset/ Market cap*

* Total employees

* Time Series of Net Debt to Equity*

* Are both companies related? (the same subsidiary company?)

* The length of the company description

For measuring similarity between time-series of two companies, we will try different baselines and choose the one that performs the best. A short list of candidates include but not limited to are:

* Dynamic Time Warping, experiments from Credit Science team.

* DFT, DWT, DCT,

* SVD

* Edit Distance

###### Performance metric for Dev purposes

The ideal way of understanding whether we are doing a good job with changes to our model is to have the analysts take another look at the annotations. This, however, is not feasible so we need to define an evaluation strategy that is a good proxy for this but can work in an automated fashion based on the data we have. We describe this now.

A sensible choice for an 'offline' evaluation metric that is easily available is simply the above loss on a hold-out set of queries. That is to say we train and tune the model on a certain set of queries, and evaluate how these learned parameters perform on an unseen (during training) set of queries for which we know the answer. This number will give us a proxy for how well the model picks up on positives without the requirement of an analyst investigating the new order.