# Introduction

This code is based on SBERT, which is a streamlined sentence comparison API built on BERT. See documentation in https://www.sbert.net/docs/training/overview.html

## Model Architecture

![image](report/uml.jpeg)



# Configure settings and data

In [2]:
%load_ext autoreload
%autoreload 2

from Settings.settings import *
from Data.config_data import *
from Model.model import *
from Query.query import *

ms = Settings()
ms.configure()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Configure data: this step goes through the csv file and finds and trims the descriptions of the companies.

In [4]:
data = ConfigData().run_test()

###### we can see that this returns a list of companies with their respective descriptions (with the company name in the 0th index and the description in the first).

In [5]:
data[4:5]

[['Juick', 'IM-based social network and microblogging service.']]

we then embed our company universe into representation space

In [6]:
model = Model(data)
embed_univ = model.run()

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [54]:
import numpy as np
emb_univ = embed_univ.numpy()
dff = pd.DataFrame(emb_univ)
dff.to_csv("ITonly_short_desc", index=False)

querying can then be done in one step: we input the sentence for a description and get as an output the results.

In [55]:
ms.configure()
query = Query(embed_univ, data)

In [61]:
query.run("is an enterprise automation platform that helps organizations work faster and smarter without compromising governance and security.")

[['MergeYourData.com', 'Business Automation Specialists'],
 ['Jareva Technologies',
  'Provides information technology automation software'],
 ['Innovative Business Software',
  'Innovative Business Software develops automation for the world’s leading security monitoring providers, and those that monitor their own.'],
 ['Professional Computing Resources (PCR)',
  'Provides an Enterprise-level management tool that tracks, manages, and bills – assets, people, operations and workflow.'],
 ['Automation Technology',
  'Automation Technology is a provider of web-based collaborative asset management software for the electrical power generation industry.'],
 ['Epic-Premier Insurance Solutions',
  'Delivering Intelligent Automation Tools & Services.'],
 ['AutomationEdge',
  'AutomationEdge is the leading IT Automation and Robotic Process Automation Solution.'],
 ['CloudRunner.io Inc',
  'CloudRunner designs and develops IT infrastructure automation framework.'],
 ['Smart Integration', 'Smart In

# Saving embedding
The embedding takes some time. We should thus save it for later use. We can do this into a csv by writing:

In [197]:
emb_univ = embed_univ.numpy()
dff = pd.DataFrame(emb_univ)
dff.to_csv("first10000", index=False)

# Potential problems


1. We cut the word off by character, which means we are embedding some nonsense words
2. We currently don't use the fact that these companies are categorized. This may be good or bad?
3. We don't fine-tune the model
4. User will have to look at the config_data module for differently formatted data sets


# Model evaluation

We evaluate the model based on the similarity of the embeddings by calculating the Spearman rank correlation in comparison to "gold standards".

In [56]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

sentences1 = ['Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores', 'Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores', 'Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores','Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores','Grabango is the leader in checkout-free technology for existing, large-scale grocery and convenience stores']
sentences2 = ['Zippin is the next generation of checkout-free technology enabling retailers to quickly deploy frictionless shopping in their stores.', 'Standard Cognition provides an autonomous checkout tool that can be installed into retailers’ existing stores.', 'AiFi enables reliable, cost-effective, and contactless autonomous shopping with AI-powered computer vision technology.', 'Moveworks offers an AI platform that revolutionizes how companies support their employees.','Tonkean provides an enterprise no-code process orchestration platform.']
scores = [1, 1, 1, 0.03, 0.05]


In [57]:

evaluator = EmbeddingSimilarityEvaluator(sentences1, sentences2, scores, write_csv= True)

In [29]:
from sentence_transformers import SentenceTransformer, util
mod = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

In [58]:
evaluator(mod)

0.8944271909999159

In [61]:
a = np.array([2,3,4,5,56,4])

In [64]:
len(np.where(a >2)[0])

5