# Introduction

This code is based on SBERT, which is a streamlined sentence comparison API built on BERT (and is itself a siamese network). See https://arxiv.org/pdf/1908.10084.pdf.

## Model Architecture

The simplest variant of our model is the stock use of SBERT. It consists of two major parts: the embedding part and the searching part. After embedding into representation space (see vertical flow chart going downwards), we can query the space for similar companies using a cosine similarity matching algorithm (see blue box). The model is built with the settings configured as a shared-state, which allows for the model parameters to be individually and continuously reconfigured in one step. This allows for users to fine-tune the model easily. Default parameters are provided in the file `Settings.custom`.

![image](report/uml.jpeg)



# Configure settings and data

After installing dependencies, import all relavent modules and configure settings

In [123]:
%load_ext autoreload
%autoreload 2
#!pip install sklearn

from Settings.settings import *
from Data.config_data import *
from Model.model import *
from Query.query import *

#configure settings
ms = Settings()
ms.configure()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Configure data: this step goes through the csv file and finds and trims the descriptions of the companies. It returns a list of companies. `ConfigData` can be called in three ways. 
1. To work with the entire universe, call `ConfigData.run()`
2. To work with IT companies only, call `ConfigData.run_IT()`
3. To work with a small dataset of the first 10000 rows, call `ConfigData.run_test()`

In [5]:
data = ConfigData().run_test()

we can see that this returns a list of companies with their respective descriptions (with the company name in the 0th index and the description in the first).

In [6]:
data[5:7]

[['Klink Mobile',
  'Klink Mobile, Inc. mobile payments company provides secure global application wireless infrastructure enables people from around globe transfer money exchange'],
 ['CompareTheJobBoards.com',
  'A comparison site just jobs. A job seeker can search array adverts across hundreds job boards all in place making']]

# Word Embedding
We then use the `Model()` module to embed our selected company universe into representation space

In [None]:
model = Model(data)
embed_univ = model.run()

As we can see, this is a 768 dimensional vector space:

In [9]:
len(embed_univ[0])

768

In [14]:
import Settings
Settings.custom.top_n

10

## Saving the embedded universe
You might have noticed that the embedding takes some time. We should thus save it for later use. We can do this into a csv by writing:

In [163]:
import numpy as np
emb_univ = embed_univ.numpy()
df = pd.DataFrame(emb_univ)
df.to_csv("random_select_train", index=False)

# Querying

Querying can then be done in one step: we input the sentence for a description and get as an output the results. Note the two scores at the bottom: the first is how many sentences are above a certain cosine similarity threshold, and the second is SBERT's model evaluation score (see next section).

In [15]:
query = Query(embed_univ, data)

In [16]:
query.run("always developing cutting edge internet solutions. Our team has researched informational technologies automation management remote computer access services many")

([['Ammyy',
   'Ammyy always developing cutting edge internet solutions. Our team has researched informational technologies automation management remote computer access services many'],
  ['Crescent Technologies',
   'We provide services support achieve business goal by combines tech expertise business intelligent our customers. We take each project seriously'],
  ['Invent Orbit',
   'Invent Orbit technology start-up our mission help people organize share useful information make them accessible globally, currently we working exciting'],
  ['Aventus software',
   'We help Internet-based businesses product companies design develop cloud-native web mobile solutions. We drive digital transformation businesses by helping them'],
  ['Akson Engineering',
   'We use pioneer communication technologies enhance our clients’ business. We listen our clients, share our expertise collaborate build them perfect'],
  ['WamiTech',
   'WamiTech Technology Blog all kinds your day day technology concerns. 

# Model evaluation

We have three ways of evaluating the model. 

1. Check how many outputs are above a certain similarity threshold level
2. Evaluate the model based on the similarity of the embeddings by calculating the Spearman rank correlation in comparison to "gold standards". These calibration sentences are set in `settings.custom` as follows

In [56]:
custom.sentences1 = ['Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores', 'Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores', 'Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores','Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores','Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores']
custom.sentences2 = ['Zippin is the next generation of checkout-free technology enabling retailers to quickly deploy frictionless shopping in their stores.', 'Standard Cognition provides an autonomous checkout tool that can be installed into retailers’ existing stores.', 'AiFi enables reliable, cost-effective, and contactless autonomous shopping with AI-powered computer vision technology.', 'Moveworks offers an AI platform that revolutionizes how companies support their employees.','Tonkean provides an enterprise no-code process orchestration platform.']
custom.scores = [1, 1, 1, 0.03, 0.05]

(whenever we run `Query`, we get the scores along with it).

3. Out of N iterations, see how many pairs of companies are produced that are of sufficient correlation (see next section on training data)

# Model fine tuning
So far, we have used the model "as-is". For better results, we will have to retrain the model. The first two modules within the model, BERT and the pooling algorithm, are effectively the same as stock SBERT. However, we attach a fully-connected neural network to the pooling module, which condenses the 768 dimensional vector space into 256 dimensions. The learning of this condensation forms connections with words that are not immediately obviouus to the pre-trained SBERT model.

The dense layer trains on pairs of companies, along with the respective labelled cosine similarity. The network then compares company pairs and computes a similarity score, which is passed through to a similarity loss function along
with the labelled similarity. This loss function is minimised during the training process.
![image](report/training.jpeg)

## Producing training data
The training data should consist of pairs of sentences with a labelled score. We prepare the data by runnning the algorithm to select sentences that are close together, then we apply non-linear scaling on it

In [166]:
from Data.train_prep import *

datapreparing = TrainPrep(embed_univ, data)

In [167]:
pairs_data = datapreparing.run()

In [168]:
pd.DataFrame(pairs_data).to_csv("test.csv")

Alternatively, we can import a dataset as such:

In [37]:
train_set = pd.read_csv("Pairs.csv", index_col=False)
train_set.drop("Unnamed: 0", axis=1, inplace=True)
train = train_set.to_numpy()

We write a bit of code to make sure we don't have duplicates in the dataset:

In [84]:
pairs_data = np.array([[0, 0, 0, 0]])
for i in range(0, len(train)):
    if train[i][0]!=train[i][2]:
        pair = np.array([train[i][0], train[i][1], train[i][2], train[i][3]])
        pairs_data = np.vstack([pairs_data, pair])
good_pairs = np.delete(pairs_data, 0, 0)

In [122]:
len(good_pairs)

1010

## Train model

We then pass the pairs into the `train_model` module to train the dense layer. Please note that this is not finished. As of right now we only train on pairs that are "good", i.e., pairs that are very similar. In order for the dense layer to learn our data, we also need to pass on pairs that are "bad". To do this, the private method `TrainModel()._input_good_data` will have to be modified to accept good and bad pairs. I leave this to the next engineer. 

Once this is complete, the model trains without issues.

In [95]:
from Model.train_model import *
ms.configure()
trainer = TrainModel(good_pairs)

In [113]:
trainer.train(1)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/101 [00:00<?, ?it/s]

0.7826237921249264


# Using the fine-tuned model

Once we complete the training on the fine-tuned model, we can use it as before. The only modification is to pass the trained model through as a keyword argument.

For the embedding, we write

In [117]:
model = Model(data, model = trainer.model)
embed_univ = model.run()

and for the querying, we write

In [121]:
query = Query(embed_univ, data, model = trainer.model)
query.run("always developing cutting edge internet solutions. Our team has researched informational technologies automation management remote computer access services many")

([['Ammyy',
   'Ammyy always developing cutting edge internet solutions. Our team has researched informational technologies automation management remote computer access services many'],
  ['miracl3',
   'We leading web development web design agency, providing best web designing development services from last 10+ years at reasonable cost.We'],
  ['Klink Mobile',
   'Klink Mobile, Inc. mobile payments company provides secure global application wireless infrastructure enables people from around globe transfer money exchange'],
  ['Decimator Design',
   'Decimator Design Australian technology company founded in 2006 focusing design manufacture quality products service broadcast environment but applications in other'],
  ['Juick', 'IM-based social network microblogging service.'],
  ['Mongo',
   'Mongo LLP independent application studio has developed social networking app iOS devices named Mongo - location-based networkin'],
  ['Zebel',
   'Zebel software offers data analytics tools multifa

# Potential problems

1. We currently don't use the fact that these companies are categorized. This may be good or bad?
2. User will have to look at the config_data module for differently formatted data sets
