# Background key phrase matching

This is an example of matching background key phrase to text key phrases
by using [EmbedRank implementation](https://github.com/swisscom/ai-research-keyphrase-extraction).

EmbedRank embeds both the document and background phrases into the same embedding space.
Current background phrases:

**"hill", "beach", "mountain", "valley", "city"**

Determining a suitable tag is done by using [Maximal Margin Relevance](https://medium.com/tech-that-works/maximal-marginal-relevance-to-rerank-results-in-unsupervised-keyphrase-extraction-22d95015c7c5#:~:text=Maximal%20Marginal%20Relevance%20a.k.a.%20MMR,already%20ranked%20documents%2Fphrases%20etc.)
using the cosine similarity between the background tags
and the document in order to model the informativness and the cosine similarity between
the tags is used to model the diversity.

[Example text](text.txt)

### Libraries

In [16]:
import sys
import pandas as pd

sys.path.append("..")

from swisscom import launch

### Reading text

In [17]:
file = open("text.txt")
raw_text = file.read()

### Creating embedding distributor and position tagger

In [18]:
embedding_distributor = launch.load_local_embedding_distributor()
pos_tagger = launch.load_local_corenlp_pos_tagger()

### Matching key phrases

In [19]:
kp = launch.extract_keyphrases(embedding_distributor, pos_tagger, raw_text, 5, 'en')  # extract 5 keyphrases

phrases, relevances, aliases = kp

['hill' 'beach' 'mountain' 'valley' 'city']
['area countryside right words week look words phrases different landscapes basic description area land grass trees green few green spaces city area green way attractive lush lush green valleys literary word verdant verdant meadows landscape few plants little rain arid few animals arid desert landscape technical description area little rain dry semi-arid semi-arid zone land dry rain long time earth/fields sun-baked land hard dry little rain long sun-baked earth full cracks other words shape land hilly area lots hills countryside round phrase hills descriptions attractive landscapes many gentle hills hills literary word undulating type landscape picturesque village hills landscape hills mountains mountainous mountainous region mountains snow top snow-capped snow-capped mountain range shape land craggy area lots rocks craggy coastline rugged similar area land wild flat photographs rugged landscape region course landscapes green hilly area flat 

In [20]:
data = { 'Phrase': phrases,
         'Relevance': relevances,
         'Aliases': aliases
         }

df = pd.DataFrame(data, columns=['Phrase', 'Relevance', 'Aliases'])
df.style.hide_index()


Phrase,Relevance,Aliases
beach,1.0,"['city', 'mountain', 'valley', 'hill']"
mountain,0.990368,"['valley', 'city', 'hill', 'beach']"
hill,0.954817,"['valley', 'mountain', 'city', 'beach']"
city,0.957754,"['mountain', 'beach', 'valley', 'hill']"
valley,0.955975,"['mountain', 'hill', 'city', 'beach']"
