# Introduction

This code is based on SBERT, which is a streamlined sentence comparison API built on BERT. See documentation in https://www.sbert.net/docs/training/overview.html

## Model Architecture

![image](report/uml.jpeg)



# Configure settings and data

In [6]:
%load_ext autoreload
%autoreload 2

from Settings.settings import *
from Data.config_data import *
from Model.model import *
from Query.query import *

ms = Settings()
ms.configure()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Configure data: this step goes through the csv file and finds and trims the descriptions of the companies.

In [7]:
data = ConfigData().run_test()

###### we can see that this returns a list of companies with their respective descriptions (with the company name in the 0th index and the description in the first).

In [8]:
data[5:7]

[['Klink Mobile',
  'Klink Mobile, Inc. mobile payments company provides secure global application wireless infrastructure enables people from around globe transfer money exchange'],
 ['CompareTheJobBoards.com',
  'A comparison site just jobs. A job seeker can search array adverts across hundreds job boards all in place making']]

we then embed our company universe into representation space

In [9]:
model = Model(data)
embed_univ = model.run()

In [54]:
import numpy as np
emb_univ = embed_univ.numpy()
dff = pd.DataFrame(emb_univ)
dff.to_csv("ITonly_short_desc", index=False)

querying can then be done in one step: we input the sentence for a description and get as an output the results.

In [13]:
ms.configure()
query = Query(embed_univ, data)

In [14]:
query.run("is an enterprise automation platform that helps organizations work faster and smarter without compromising governance and security.")

([['Ammyy',
   'Ammyy always developing cutting edge internet solutions. Our team has researched informational technologies automation management remote computer access services many'],
  ['IKA Platform',
   'IKA Siri Office: Enterprise Automation vía Conversational Artificial Intelligence. Automate your processes workflows in 60 seconds, not in 60 days,'],
  ['oInvoices',
   'Online invoicing service cloud. It allows small companies startups manage invoicing from everywhere. Compatible tablets smarphones, you can create manage'],
  ['Thorium Data',
   "Thorium Data Information Asset Management (IAM) solution helps companies better monetize data. It's ML-driven IAM platform gives all your departments"],
  ['SmartOrg',
   'SmartOrg® provides software services help companies evaluate opportunities make best decisions about where invest. Companies like HP use us in'],
  ['QuaNode',
   'QuaNode, established in 2016 as software company working in-house software projects enabling multiple st

# Saving embedding
The embedding takes some time. We should thus save it for later use. We can do this into a csv by writing:

In [197]:
emb_univ = embed_univ.numpy()
dff = pd.DataFrame(emb_univ)
dff.to_csv("first10000", index=False)

# Potential problems


1. We cut the word off by character, which means we are embedding some nonsense words
2. We currently don't use the fact that these companies are categorized. This may be good or bad?
3. We don't fine-tune the model
4. User will have to look at the config_data module for differently formatted data sets


# Model evaluation

We evaluate the model based on the similarity of the embeddings by calculating the Spearman rank correlation in comparison to "gold standards". These calibration sentences are set in `settings.custom` as follows:

In [56]:
custom.sentences1 = ['Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores', 'Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores', 'Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores','Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores','Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores']
custom.sentences2 = ['Zippin is the next generation of checkout-free technology enabling retailers to quickly deploy frictionless shopping in their stores.', 'Standard Cognition provides an autonomous checkout tool that can be installed into retailers’ existing stores.', 'AiFi enables reliable, cost-effective, and contactless autonomous shopping with AI-powered computer vision technology.', 'Moveworks offers an AI platform that revolutionizes how companies support their employees.','Tonkean provides an enterprise no-code process orchestration platform.']
custom.scores = [1, 1, 1, 0.03, 0.05]

then, whenever we run `Query`, we get the scores along with it.

# Model fine tuning
So far, we have used the model "as-is". For better results, we will have to retrain the model. 

## Training data
The training data should consist of pairs of sentences with a labelled score. 

In [29]:
from Model.train_model import *

Trainer = TrainModel(sentences1, sentences2, scores)