## Text Vectorisation

In [1]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util

import fetch
import modules.urls as urls
from fetch.utils import json_or_fetch

In [2]:
df_company = pd.read_csv('data/3_geolocation.csv', index_col=0)

In [3]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')

In [4]:
corpus_embeddings = embedder.encode(df_company['industrydesc'])

### Which two companies are closest?

In [5]:
idx1 = 0
idx2 = 0
score = 0

for i, query_embedding in enumerate(corpus_embeddings):
    for j, corpus_embedding in enumerate(corpus_embeddings):
        sim = util.cos_sim(query_embedding, corpus_embedding)
        val = sim.tolist()[0][0]
        # if new max similarity and not matching self
        if score < val and i != j:
            idx1 = i
            idx2 = j
            score = val

print(idx1, idx2, score)
print(df_company.iloc[[idx1, idx2]][['name', 'industrydesc']])

6 8 1.0
                                name  \
KMD A/S                      KMD A/S   
Alpha Solutions  ALPHA SOLUTIONS A/S   

                                                      industrydesc  
KMD A/S          Konsulentbistand vedrørende informationsteknologi  
Alpha Solutions  Konsulentbistand vedrørende informationsteknologi  


Another way of doing it that might be more efficient is found on the [SentenceTransformers website](https://sbert.net/docs/usage/semantic_textual_similarity.html).

### Scraping

In [6]:
df_company.index

Index(['Dynatest A/S', 'Eriksholm Research Centre, Oticon', 'Formpipe',
       'Novo Nordisk', 'PFA', 'Topdanmark', 'KMD A/S', 'NorthTech ApS',
       'Alpha Solutions', 'Dafolo', 'Nuuday A/S', 'Netcompany A/S',
       'Wash World', 'Carve', 'GroupM', 'Brøndbyernes I.F.', 'Meew', 'Funelo',
       'PreCure', 'Wilke', 'OOONO', 'Elbek & Vejrup', 'Ellab', 'Lejka',
       'Firi'],
      dtype='object')

In [7]:
keys = urls.websites
args = tuple(zip(keys))

In [8]:
texts = json_or_fetch(fetch.scrapetext, keys, args, path='data/website_text.json')
texts = {k: ' '.join(v) for k, v in texts.items()}
# joins list of paragraphs for each website because I think that's what I actually need.

In [9]:
corpus_embeddings = embedder.encode(list(texts.values()))

### Querying

In [10]:
queries = ['Software','development', 'udvikling', 'programmering', 'programming']
query_embedding = embedder.encode(queries)

In [11]:
for idx, embedding in enumerate(corpus_embeddings):
    sim = util.cos_sim(query_embedding, embedding)
    print("{}:\t\t{:.4f}".format(df_company.iloc[idx]['name'], sim.tolist()[0][0]))

A/S DYNATEST ENGINEERING:		0.1169
PROPOLIS RESEARCH CENTRE A/S:		0.1714
FORMPIPE LASERNET A/S:		0.1248
NOVO NORDISK A/S:		0.1259
PFA BANK A/S:		-0.0880
TOPDANMARK A/S:		0.1361
KMD A/S:		0.1953
NORTHTECH ApS:		0.2179
ALPHA SOLUTIONS A/S:		0.2515
DAFOLO A/S:		0.1993
Nuuday A/S:		0.1135
Netcompany A/S:		0.1811
WASH WORLD ApS:		0.1309
CARVE KOMPLEMENTAR ApS:		0.0936
GROUPM DENMARK A/S:		0.1012
BRØNDBYERNES I.F. FODBOLD A/S:		0.0159
MeeW A/S:		0.1165
Funelo ApS:		0.2638
PreCure ApS:		0.1899
Wilke A/S:		0.1724
ooono A/S:		0.2770
ELBEK & VEJRUP A/S:		0.1455
ELLAB A/S:		0.3604
LEJKA ApS:		0.0702
Firi, filial af Firi AS:		-0.0107


Whelp. Nuuday is the highest scoring even though it had literally 0 `<p>` tags. It's a JS website so we need Selenium to get proper data there.

I've changed the selector to include `<h1>` and `<h2>`. That might make it more accurate.