<a href="https://colab.research.google.com/github/ColinS97/htw_cnn_lecture/blob/main/assignments/transformer/nlp_3_neural_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural search with Transformers

## What are we going to do?

Instead of searching text by compareing characters and words, 
we will use the power of transfomer models and compare texts in vector sprace.

![](https://media.giphy.com/media/xT0xeJpnrWC4XWblEk/giphy.gif)

## installing dependencies

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 4.9 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 41.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 60.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 4.0 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYA

In [2]:
from transformers import AutoModel, AutoTokenizer

In [31]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


## loading a model

In [33]:
model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/449M [00:00<?, ?B/s]

In [None]:
model

## transforming a text to an vector

In [34]:
sentences = ['This is an example sentence', 'Each sentence is converted']

In [35]:
inputs = tokenizer(sentences, return_tensors="pt", padding=True,truncation=True)
inputs

{'input_ids': tensor([[     0,   3293,     83,    142,  27781, 149357,      2],
        [     0,  98423, 149357,     83, 117176,     71,      2]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1]])}

In [6]:
outputs = model(**inputs)
# ** means to use the dictionary keys as input keys for the function, so the names need to match up here
outputs 

BaseModelOutputWithPoolingAndCrossAttentions([('last_hidden_state',
                                               tensor([[[ 0.2415,  0.6428, -0.5918,  ...,  0.7404,  0.1338, -0.2779],
                                                        [ 0.2890,  0.7354, -0.9872,  ...,  0.7438, -0.6845, -0.1892],
                                                        [ 0.2164,  0.8085, -0.6320,  ...,  0.6260, -0.6735,  0.2411],
                                                        [-0.2210,  0.8246, -0.6606,  ...,  0.8001, -0.1529,  0.0831],
                                                        [-0.0573,  1.1134, -0.8065,  ...,  0.5099, -0.2104, -0.0108],
                                                        [ 0.2753,  1.0079, -0.7109,  ...,  0.3392,  0.1056,  0.1083]],
                                               
                                                       [[ 0.1849,  0.7584, -0.2930,  ...,  0.7572,  0.9876, -0.3735],
                                                        [

**last_hidden_state**: Sequence of hidden-states at the output of the last layer of the model.

In [30]:
outputs["last_hidden_state"].shape


torch.Size([2, 6, 768])

**pooler_output**: Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

In [8]:
outputs["pooler_output"].shape

torch.Size([2, 768])

## loading data 

We load a data set of news headlines from german newspapers. This data set contains the headlines and the according article urls.
After we loaded the data, we need to convert all headlines into vectors.

In [9]:
!curl -O https://www2.htw-dresden.de/~guhr/dist/feeds.tsv 


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.0M  100 10.0M    0     0  5192k      0  0:00:01  0:00:01 --:--:-- 5192k


In [10]:
!head feeds.tsv

id	title	text	time	link		
https://www.spiegel.de/politik/deutschland/corona-krise-in-deutschland-wie-kommen-wir-wieder-raus-a-d8099433-e178-46be-957a-f6c779b3f2f5	'Corona-Krise in Deutschland: Wie kommen wir wieder raus?'	'Die Bundesregierung will in der kommenden Woche über mögliche Szenarien für den Exit aus dem Lockdown beraten. Schon jetzt warnen Politiker vor einem überhasteten Aussetzen der Maßnahmen. Der Überblick.'	'Mon, 13 Apr 2020 18:18:00 +0200'	'https://www.spiegel.de/politik/deutschland/corona-krise-in-deutschland-wie-kommen-wir-wieder-raus-a-d8099433-e178-46be-957a-f6c779b3f2f5#ref=rss		
https://www.spiegel.de/wissenschaft/leopoldina-forscher-legen-konkreten-fahrplan-fuer-ende-der-kontaktsperren-vor-a-0cfd0aed-cf48-4dd1-a219-241d818d60ae	'Leopoldina-Forscher legen konkreten Fahrplan für Ende der Kontaktsperren vor'	'Die Nationalakademie Leopoldina empfiehlt eine baldige Rückkehr zur Schule. Auch Geschäfte und Behörden sollen schrittweise eröffnen und Reisen erlaubt werden

In [11]:
import time
import pandas as pd
import numpy as np

feeds_df = pd.read_csv("feeds.tsv", sep='\t', header=0,encoding="utf-8")
feeds_df.drop(columns=['text'], inplace=True)
feeds_df.drop(columns=['time'], inplace=True)
feeds_df.drop(columns=['id'], inplace=True)
feeds_df.drop(columns=['Unnamed: 5'], inplace=True)
feeds_df.drop(columns=['Unnamed: 6'], inplace=True)

In [12]:
feeds_df.head(5)

Unnamed: 0,title,link
0,'Corona-Krise in Deutschland: Wie kommen wir w...,'https://www.spiegel.de/politik/deutschland/co...
1,'Leopoldina-Forscher legen konkreten Fahrplan ...,'https://www.spiegel.de/wissenschaft/leopoldin...
2,'Philosophie Coronavirus-Lockdown: Wir müssen ...,'https://www.spiegel.de/wissenschaft/philosoph...
3,'Coronavirus in Indonesien: Gefährliche Heimre...,'https://www.spiegel.de/politik/ausland/corona...
4,'Coronavirus News am Montag: Die wichtigsten E...,'https://www.spiegel.de/wissenschaft/medizin/c...


In [13]:
# We want to remove the qoutes here in order to get better results.

def remove_quotes(text):
    return text[1:-1]

feeds_df["title"]=feeds_df["title"].map(remove_quotes)
feeds_df["link"]=feeds_df["link"].map(remove_quotes)
feeds_df.head(10)

Unnamed: 0,title,link
0,Corona-Krise in Deutschland: Wie kommen wir wi...,https://www.spiegel.de/politik/deutschland/cor...
1,Leopoldina-Forscher legen konkreten Fahrplan f...,https://www.spiegel.de/wissenschaft/leopoldina...
2,Philosophie Coronavirus-Lockdown: Wir müssen ü...,https://www.spiegel.de/wissenschaft/philosophi...
3,Coronavirus in Indonesien: Gefährliche Heimreise,https://www.spiegel.de/politik/ausland/coronav...
4,Coronavirus News am Montag: Die wichtigsten En...,https://www.spiegel.de/wissenschaft/medizin/co...
5,Corona-Lockdown: Deutsche sind immer mehr unte...,https://www.spiegel.de/panorama/corona-lockdow...
6,Corona-Krise: Warum Vorhersagen zu Wirtschaft ...,https://www.spiegel.de/wirtschaft/corona-krise...
7,"Corona-Alltags-Heldin: Susanne Rudwill, 56, Ka...",https://www.spiegel.de/panorama/gesellschaft/c...
8,"Corona: Politik darf keine Erwartungen wecken,...",https://www.spiegel.de/politik/deutschland/cor...
9,Trigema-Chef Grupp kämpft gegen die Corona-Kri...,https://www.spiegel.de/wirtschaft/unternehmen/...


In [14]:
# Number of entries in our data set
len(feeds_df)

21257

## Processing the data

In [15]:
# since 21257 entries would take a lot of time to process, we just load
# the first 3000 articles here. But you are welcome to experiment with this 
# parameter. 

titles = list(feeds_df["title"][:3000])
links = list(feeds_df["link"][:3000])

In [43]:
import torch
model.to("cuda")
tokens = tokenizer(titles, return_tensors="pt",truncation=True,padding=True)
tokens.to("cuda")
with torch.no_grad():
    headline_vectors = model(**tokens)
headline_embeddings = mean_pooling(headline_vectors, tokens['attention_mask'])


AttributeError: ignored

In [44]:
tokens = tokenizer("Klima", return_tensors="pt",truncation=True, padding=True)
tokens.to("cuda")
with torch.no_grad():
  query_vector = model(**tokens)
query_embeddings = mean_pooling(query_vector, tokens['attention_mask'])


In [45]:
# calculate the dot product
result = torch.sum(query_embeddings * headline_embeddings,axis=1) 
result.shape

torch.Size([3000])

In [46]:
result

tensor([8.5284, 3.2245, 5.7746,  ..., 1.7417, 2.5137, 3.3640], device='cuda:0')

## Ranking the results

In [47]:
topk = 20
values, indices = torch.topk(result, topk,largest=True)
print(values,indices)

tensor([16.7430, 16.4768, 16.3090, 15.4628, 14.8662, 14.4715, 14.2645, 14.1040,
        13.7637, 13.5963, 13.4705, 13.3970, 13.2582, 13.1259, 12.7523, 11.8828,
        11.7547, 11.6815, 11.6120, 11.4196], device='cuda:0') tensor([2483, 2780, 2298,  737, 1558,  365, 2700, 2666,  157, 1514, 2743,  304,
        2871, 1977, 2217, 2902, 1526, 2679, 1492, 2680], device='cuda:0')


In [48]:
for i in range(0,topk):
  index = indices[i].item()
  value = int(values[i].item())
  print(value,titles[index],links[index])


16 Klimaschutz: Und was ist mit dem Klima? https://www.zeit.de/2020/17/klimaschutz-corona-krise-oekologie-wirtschaftswachstu
16 Meteorologie: Wie das trockene Frühjahr die Natur belastet https://www.sueddeutsche.de/wissen/trockenheit-wetter-klima-regen-april-1.488363
16 Klimawandel in der Arktis: Das Eis am Nordpol ist nicht mehr zu retten https://www.spiegel.de/wissenschaft/natur/klimawandel-in-der-arktis-das-eis-am-nordpol-ist-nicht-mehr-zu-retten-a-d923c467-e6ff-4e94-92c1-03fe20c20d1b#ref=rs
15 Corona-Krise: Was macht eigentlich der Klimaschutz? https://www.spiegel.de/politik/deutschland/corona-krise-was-macht-eigentlich-der-klimaschutz-a-00000000-0002-0001-0000-000170435621#ref=rs
14 Corona-Krise: Update: Jetzt fürs Klima demonstrieren? https://www.zeit.de/politik/2020-04/corona-krise-demonstrationen-klimaschutz-infektionsschut
14 Die Grünen in der Corona-Krise: Niemand redet mehr über das Klima https://www.zeit.de/politik/deutschland/2020-04/gruene-corona-krise-klimawandel-emissio

## Let's take a look at the vector space
Download the two files and upload them into [Tensorflow Projector](https://projector.tensorflow.org/).

In [23]:
# export to tf projector
x_np = headline_vectors.cpu().numpy()
x_df = pd.DataFrame(x_np)
x_df.to_csv('vectors.tsv',sep="\t",index=False, header=None,encoding="utf-8")

with open('titles.tsv', 'w') as writer:
  for title in titles:
    writer.write(title[:150]+"...\n")


# Your tasks

Try to improve the search results. Here are some ideas:

* try out sentence transformers like this one: [Language-agnostic BERT Sentence Embedding](https://arxiv.org/abs/2007.01852)
* try to adapt the sample code from [sentence transformers project.](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)

Check your results with the embedding projector and compare them. What do you see?


Bonus:

* try a clustering like k-nearest neighbors to group news artikels
