# Búsqueda semántica de texto
En este notebook vamos a implementar un buscador semántico de textos similares mediante un modelo *Sentence Transformer* (https://github.com/UKPLab/sentence-transformers). \
Vamos a usar el conjunto de noticias del dataset Lee utilizado en el NLP_07.\
Hay que instalar la librería `sentence-transformers` con:\
```pip install -U sentence-transformers```

In [1]:
import os
import re
import numpy as np
import pandas as pd
import gensim
from sentence_transformers import SentenceTransformer


In [2]:
data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_data_file = data_dir + os.sep + 'lee_background.cor'

In [3]:
#Leemos todas las noticias
#Al usar transformers podemos obviar el pre-procesado del texto
with open(lee_data_file) as f:
    corpus = f.readlines()

In [4]:
len(corpus)

300

In [5]:
display(corpus[0])

'Hundreds of people have been forced to vacate their homes in the Southern Highlands of New South Wales as strong winds today pushed a huge bushfire towards the town of Hill Top. A new blaze near Goulburn, south-west of Sydney, has forced the closure of the Hume Highway. At about 4:00pm AEDT, a marked deterioration in the weather as a storm cell moved east across the Blue Mountains forced authorities to make a decision to evacuate people from homes in outlying streets at Hill Top in the New South Wales southern highlands. An estimated 500 residents have left their homes for nearby Mittagong. The New South Wales Rural Fire Service says the weather conditions which caused the fire to burn in a finger formation have now eased and about 60 fire units in and around Hill Top are optimistic of defending all properties. As more than 100 blazes burn on New Year\'s Eve in New South Wales, fire crews have been called to new fire at Gunning, south of Goulburn. While few details are available at th

In [6]:
#Con esta librería el tokenizado y la codificación se hace a la vez

model = SentenceTransformer('all-MiniLM-L6-v2')
#Llamamos a model.encode() para crear los embeddings de cada documento
doc_embeddings = model.encode(corpus)

In [7]:
doc_embeddings.shape

(300, 384)

Los embeddings generados para cada documento son los que usaremos para calcular la similitud entre documentos (con la distancia coseno). Es lo que se conoce como técnica *Bi-encoder*:\
>A Bi-Encoder Sentence Transformer model takes in one text at a time as input and outputs a fixed dimension embedding vector as the output. We can then compare any two documents by computing the cosine similarity between the embeddings of those two documents.

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

#Vemos la similitud de todos los documentos con todos
sims = cosine_similarity(doc_embeddings, doc_embeddings)
sims.shape

(300, 300)

Vemos la similitud del primer documento al resto

In [15]:
sims[0, :]

array([ 1.0000002 ,  0.1862439 ,  0.36024007,  0.06647055,  0.16054457,
        0.25456682,  0.07130697,  0.1448866 ,  0.71996063,  0.45955154,
        0.43107116,  0.36462575,  0.09600542,  0.05139546,  0.35125983,
        0.22654007,  0.10755338,  0.1285992 ,  0.06746032,  0.6088064 ,
        0.00728194,  0.3583024 ,  0.10069869,  0.22394757,  0.22305846,
        0.46384364,  0.1385611 ,  0.1459111 ,  0.12593803,  0.19521032,
        0.03333723,  0.18033558,  0.33452058,  0.786766  ,  0.11982653,
        0.15185414,  0.13297436,  0.17609438,  0.28221866,  0.19795886,
        0.7253157 ,  0.11417238,  0.14895943,  0.31006202,  0.34341425,
        0.06606716,  0.27094966,  0.15648553,  0.74195844,  0.08219413,
        0.2194731 ,  0.03587104,  0.29266202,  0.1617104 ,  0.03347444,
        0.28007174,  0.05729619,  0.11395875,  0.20978498,  0.15481192,
        0.17262931,  0.00703743,  0.11809213,  0.13129808,  0.21023971,
        0.14367817,  0.07729248,  0.10905086,  0.02731548,  0.10

In [16]:
#Ordenamos de mayor a menor
sims_sorted = sorted(enumerate(sims[0,:]), key=lambda item: -item[1])
print(sims_sorted[:10])

[(0, 1.0000002), (33, 0.786766), (48, 0.74195844), (40, 0.7253157), (8, 0.71996063), (264, 0.63760084), (19, 0.6088064), (255, 0.52170897), (272, 0.46970248), (25, 0.46384364)]


In [17]:
#Noticia más cercana
display(corpus[sims_sorted[1][0]])

'New South Wales firefighters are hoping lighter winds will help ease their workload today but are predicting "nasty" conditions over the weekend. While the winds are expected to ease somewhat today, the weather bureau says temperatures will be higher. More than 100 fires are still burning across New South Wales. The Rural Fire Service says the change may allow it to concentrate more on preventative action, but there is no room for complacency. Mark Sullivan, from the Rural Fire Service, says while conditions may be a little kinder to them today, the outlook for the weekend has them worried. "It certainly appears from the weather forecast, with very high temperatures and high winds that it certainly could be a nasty couple of days ahead," Mr Sullivan said. One of the areas causing greatest concern today is the 30-kilometre long blaze in the lower Blue Mountains. Firefighters are also keeping a close eye on a blaze at Spencer north of Sydney, which yesterday broke through containment li

In [18]:
#5 noticias más similares
for idx, score in sims_sorted[1:6]:
        print(corpus[idx], f"(Score: {score})" )

New South Wales firefighters are hoping lighter winds will help ease their workload today but are predicting "nasty" conditions over the weekend. While the winds are expected to ease somewhat today, the weather bureau says temperatures will be higher. More than 100 fires are still burning across New South Wales. The Rural Fire Service says the change may allow it to concentrate more on preventative action, but there is no room for complacency. Mark Sullivan, from the Rural Fire Service, says while conditions may be a little kinder to them today, the outlook for the weekend has them worried. "It certainly appears from the weather forecast, with very high temperatures and high winds that it certainly could be a nasty couple of days ahead," Mr Sullivan said. One of the areas causing greatest concern today is the 30-kilometre long blaze in the lower Blue Mountains. Firefighters are also keeping a close eye on a blaze at Spencer north of Sydney, which yesterday broke through containment lin

In [19]:
#Creamos un texto nuevo y buscamos la noticia más similar
texto = """the new Pakistan government falled in the terrorist attack by the islamic group Hamas"""
texto_embedding = model.encode(texto)

In [20]:
texto_embedding.shape

(384,)

In [21]:
#Comparamos con el resto
sims = cosine_similarity(texto_embedding.reshape(1, -1), doc_embeddings)[0]
sims_sorted = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims_sorted[:10])

[(34, 0.51401895), (143, 0.5119651), (58, 0.47483134), (110, 0.4738132), (1, 0.46610594), (93, 0.46422002), (220, 0.45312607), (12, 0.43502474), (227, 0.43194503), (26, 0.4288589)]


In [22]:
#5 noticias más similares
for idx, score in sims_sorted[0:5]:
        print(corpus[idx], f"(Score: {score})" )

Pakistan's Foreign Ministry has announced retaliatory sanctions against India, saying it would also downgrade embassy representation and ban Indian planes from its airspace. "Pakistan regrets the Indian decision to downgrade embassy representation by 50 per cent and confine staff to the municipal limits of New Delhi and ban access to airspace," ministry spokesman Aziz Ahmed Khan said. "Such efforts will only increase tension and we are forced to take retaliatory actions. "We will downgrade their embassy staff here, confine them to Islamabad limits, and will also ban their access to Pakistan's airspace." As tension mounted between the two rivals after the December 13 attack on the Parliament complex in new Delhi, Indian Foreign Minister Jaswant Singh earlier announced a set of new sanctions. Pakistani aircraft would not be allowed to fly over Indian airspace from January 1, and the Indian embassy in Islamabad and the Pakistan mission in New Delhi will have to reduce their staff by 50 pe