# Introduction

This code is based on SBERT, which is a streamlined sentence comparison API built on BERT. See documentation in https://www.sbert.net/docs/training/overview.html

## Model Architecture

![image](report/uml.jpeg)



# Configure settings and data

In [1]:
%load_ext autoreload
%autoreload 2
#!pip install sklearn

from Settings.settings import *
from Data.config_data import *
from Model.model import *
from Query.query import *

ms = Settings()
ms.configure()

Configure data: this step goes through the csv file and finds and trims the descriptions of the companies.

In [5]:
data = ConfigData().run_test()

we can see that this returns a list of companies with their respective descriptions (with the company name in the 0th index and the description in the first).

In [6]:
data[5:7]

[['Klink Mobile',
  'Klink Mobile, Inc. mobile payments company provides secure global application wireless infrastructure enables people from around globe transfer money exchange'],
 ['CompareTheJobBoards.com',
  'A comparison site just jobs. A job seeker can search array adverts across hundreds job boards all in place making']]

we then embed our company universe into representation space

In [7]:
#!pip install -U transformers tokenizers
model = Model(data)
embed_univ = model.run()

In [9]:
len(embed_univ[0])

768

In [14]:
import Settings
Settings.custom.top_n

10

We can also save ths embedding in a csv file:

In [163]:
import numpy as np
emb_univ = embed_univ.numpy()
dff = pd.DataFrame(emb_univ)
dff.to_csv("random_select_train", index=False)

querying can then be done in one step: we input the sentence for a description and get as an output the results.

In [15]:
ms.configure()
query = Query(embed_univ, data)

In [16]:
query.run("always developing cutting edge internet solutions. Our team has researched informational technologies automation management remote computer access services many")

([['Ammyy',
   'Ammyy always developing cutting edge internet solutions. Our team has researched informational technologies automation management remote computer access services many'],
  ['Crescent Technologies',
   'We provide services support achieve business goal by combines tech expertise business intelligent our customers. We take each project seriously'],
  ['Invent Orbit',
   'Invent Orbit technology start-up our mission help people organize share useful information make them accessible globally, currently we working exciting'],
  ['Aventus software',
   'We help Internet-based businesses product companies design develop cloud-native web mobile solutions. We drive digital transformation businesses by helping them'],
  ['Akson Engineering',
   'We use pioneer communication technologies enhance our clients’ business. We listen our clients, share our expertise collaborate build them perfect'],
  ['WamiTech',
   'WamiTech Technology Blog all kinds your day day technology concerns. 

In [20]:
embed_univ

tensor([[ 0.0088, -0.1361, -0.0023,  ..., -0.0164, -0.0305, -0.0437],
        [ 0.0056,  0.0056, -0.0190,  ..., -0.0190, -0.0209,  0.0043],
        [ 0.0603,  0.0040,  0.0058,  ..., -0.0563,  0.0562, -0.0343],
        ...,
        [ 0.0457, -0.0057,  0.0097,  ..., -0.0070,  0.0300, -0.0643],
        [ 0.0057,  0.0242, -0.0316,  ...,  0.0191, -0.0464, -0.0209],
        [ 0.0288, -0.0094, -0.0296,  ..., -0.0153,  0.0108, -0.0491]])

# Saving embedding
The embedding takes some time. We should thus save it for later use. We can do this into a csv by writing:

In [197]:
emb_univ = embed_univ.numpy()
dff = pd.DataFrame(emb_univ)
dff.to_csv("first10000", index=False)

# Experiments

In [None]:
from Experiments.experiments import *

num_words, score = Experiment().run()

# Potential problems


1. We cut the word off by character, which means we are embedding some nonsense words
2. We currently don't use the fact that these companies are categorized. This may be good or bad?
3. We don't fine-tune the model
4. User will have to look at the config_data module for differently formatted data sets


# Model evaluation

We evaluate the model based on the similarity of the embeddings by calculating the Spearman rank correlation in comparison to "gold standards". These calibration sentences are set in `settings.custom` as follows:

In [56]:
custom.sentences1 = ['Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores', 'Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores', 'Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores','Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores','Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores']
custom.sentences2 = ['Zippin is the next generation of checkout-free technology enabling retailers to quickly deploy frictionless shopping in their stores.', 'Standard Cognition provides an autonomous checkout tool that can be installed into retailers’ existing stores.', 'AiFi enables reliable, cost-effective, and contactless autonomous shopping with AI-powered computer vision technology.', 'Moveworks offers an AI platform that revolutionizes how companies support their employees.','Tonkean provides an enterprise no-code process orchestration platform.']
custom.scores = [1, 1, 1, 0.03, 0.05]

then, whenever we run `Query`, we get the scores along with it.

# Model fine tuning
So far, we have used the model "as-is". For better results, we will have to retrain the model. 

![image](report/training.jpeg)

## Training data
The training data should consist of pairs of sentences with a labelled score. 

In [165]:
len(embed_univ)

50000

In [166]:
from Data.train_prep import *

datapreparing = TrainPrep(embed_univ, data)

In [167]:
pairs_data = datapreparing.run()

In [168]:
pd.DataFrame(pairs_data).to_csv("test.csv")

1741

train model

In [None]:
from Model.train_model import *

Trainer = TrainModel(sentences1, sentences2, scores)