<a href="https://colab.research.google.com/github/TurkuNLP/Text_Mining_Course/blob/master/text_sim_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Note! Run this notebook with GPU acceleration! (Runtime -> Change runtime type)**

# Text similarity / neural models

* TF-IDF does not take into account semantic similarity / paraphrasing
* Paraphrase: same meaning, different wording
* Low lexical overlap (different wording) means erroneously low TF-IDF similarity
* In the ideal case, one would capture meaning regardless of wording
* This is ongoing research, no silver bullet solutions, but steady progress in this direction can be observed!

# BERT -based similarity

* The BERT model can be seen as a device to turn input text into dense vector representation
* We will see it does not capture paraphrasing all that well, but we can use it as a suitable model to learn how to manipulate embeddings produced by neural models and gain intuition into the model's out-of-the-box capabilities
* Let us test the BERT model on sentences from news and see what it can do


In [1]:
!wget http://dl.turkunlp.org/textual-data-analysis-course-data/hs_yle_spring_2020.json.gz

--2021-03-10 15:17:58--  http://dl.turkunlp.org/textual-data-analysis-course-data/hs_yle_spring_2020.json.gz
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20590829 (20M) [application/octet-stream]
Saving to: ‘hs_yle_spring_2020.json.gz’


2021-03-10 15:18:02 (5.24 MB/s) - ‘hs_yle_spring_2020.json.gz’ saved [20590829/20590829]



In [2]:
import json
import gzip
from pprint import pprint  #pprint is prettyprint

with gzip.open("hs_yle_spring_2020.json.gz") as f:
    news_data=json.load(f)


In [3]:
yle=news_data["2020"]["01"]["yle-text"]
hs=news_data["2020"]["01"]["hs-text"]



# Preprocess data with Udpipe

* We need to split the data into sentences
* This is covered in the Intro to NLP course [here](https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/basic_nlp.ipynb) so see that notebook for further details if and as needed

In [4]:
!pip3 install ufal.udpipe

Collecting ufal.udpipe
[?25l  Downloading https://files.pythonhosted.org/packages/e5/72/2b8b9dc7c80017c790bb3308bbad34b57accfed2ac2f1f4ab252ff4e9cb2/ufal.udpipe-1.2.0.3.tar.gz (304kB)
[K     |█                               | 10kB 10.2MB/s eta 0:00:01[K     |██▏                             | 20kB 12.7MB/s eta 0:00:01[K     |███▎                            | 30kB 9.6MB/s eta 0:00:01[K     |████▎                           | 40kB 8.0MB/s eta 0:00:01[K     |█████▍                          | 51kB 4.5MB/s eta 0:00:01[K     |██████▌                         | 61kB 5.0MB/s eta 0:00:01[K     |███████▌                        | 71kB 4.8MB/s eta 0:00:01[K     |████████▋                       | 81kB 5.1MB/s eta 0:00:01[K     |█████████▊                      | 92kB 5.4MB/s eta 0:00:01[K     |██████████▊                     | 102kB 4.3MB/s eta 0:00:01[K     |███████████▉                    | 112kB 4.3MB/s eta 0:00:01[K     |█████████████                   | 122kB 4.3MB/s eta 0

In [5]:
# Download the model
# Download link can be found from Udpipe's "Models" page, this picks the Finnish model, but there are models for many more languages
!wget -nc -O fi_model.udpipe https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3131/finnish-tdt-ud-2.5-191206.udpipe?sequence=25&isAllowed=y


--2021-03-10 15:21:08--  https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3131/finnish-tdt-ud-2.5-191206.udpipe?sequence=25
Resolving lindat.mff.cuni.cz (lindat.mff.cuni.cz)... 195.113.20.140
Connecting to lindat.mff.cuni.cz (lindat.mff.cuni.cz)|195.113.20.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21613253 (21M) [application/octet-stream]
Saving to: ‘fi_model.udpipe’


2021-03-10 15:21:15 (5.16 MB/s) - ‘fi_model.udpipe’ saved [21613253/21613253]



In [6]:
#Build a pipeline which tokenizes and sentence-splits text, returning one sentence per line
import ufal.udpipe as udpipe

model = udpipe.Model.load("fi_model.udpipe")
pipeline = udpipe.Pipeline(model,"tokenize","none","none","horizontal")


In [7]:
print(pipeline.process(yle[0]["text"]))

Helsingin Kansalaistorin juhlat sujuivat mallikkaasti – poliisi torui muualla kaupungissa nuoria rakettien ampumisesta ihmisiä päin Helsingin Kansalaistorille oli kerääntynyt juhlimaan vuoden vaihtumista arviolta 85 000 ihmistä .
Helsingin poliisi kertoo saaneensa kymmeniä ilmoituksia ilotulitteiden väärinkäytöksistä ympäri kaupunkia .
Poliisin johtokeskuksesta kerrottiin yöllä , että poliisi oli saanut iltakymmeneen mennessä kymmeniä ilmoituksia väärinkäytöksistä .
Poliisin mukaan nuorisoporukat ovat ampuneet ilotulitteita ihmisiä , autoja ja rakennuksia päin .
Ilotulitteita ammuttiin myös sellaisilla alueilla , missä niiden ampuminen on kiellettyä .
Kansalaistorilla noin 85 000 ihmistä Helsingin kansalaistorilla järjestettiin musiikkia ja ilotulituksen sisältävä uudenvuoden juhla .
Järjestäjän arvion mukaan Kansalaistorille oli kerääntynyt juhlimaan vuoden vaihtumista arviolta 85 000 ihmistä .
Juhlinta sujui mallikkaasti , eikä poliisin tietoon tullut juhlapaikalta ilmoituksia vakava

* We need to save time a bit, so we cannot process all data
* Let us pick sentences from the first 25% of YLE and first 25% of HS in Feb 2020
* Further, let us pick the first five sentences of every article only
* This is simply to keep the sentence count manageable

In [8]:
import tqdm #progress bar
for d in tqdm.tqdm(yle[:len(yle)//4]):
    d["segmented"]=pipeline.process(d["text"]).strip().split("\n")[:5]
for d in tqdm.tqdm(hs[:len(hs)//4]):
    d["segmented"]=pipeline.process(d["text"]).strip().split("\n")[:5]

100%|██████████| 480/480 [00:08<00:00, 55.82it/s]
100%|██████████| 1770/1770 [00:30<00:00, 58.82it/s]


In [9]:
print(yle[0]["segmented"])

['Helsingin Kansalaistorin juhlat sujuivat mallikkaasti – poliisi torui muualla kaupungissa nuoria rakettien ampumisesta ihmisiä päin Helsingin Kansalaistorille oli kerääntynyt juhlimaan vuoden vaihtumista arviolta 85\xa0000 ihmistä .', 'Helsingin poliisi kertoo saaneensa kymmeniä ilmoituksia ilotulitteiden väärinkäytöksistä ympäri kaupunkia .', 'Poliisin johtokeskuksesta kerrottiin yöllä , että poliisi oli saanut iltakymmeneen mennessä kymmeniä ilmoituksia väärinkäytöksistä .', 'Poliisin mukaan nuorisoporukat ovat ampuneet ilotulitteita ihmisiä , autoja ja rakennuksia päin .', 'Ilotulitteita ammuttiin myös sellaisilla alueilla , missä niiden ampuminen on kiellettyä .']


* Now we can build a list of all unique sentences
* We will simply pour together YLE and HS, and not worry about anything else
* Let's keep things simple...

In [10]:
all_sentences=[]
for d in yle:
    all_sentences.extend(d.get("segmented",[])) #I use .get() because only 1/4 of the dictionaries have been segmented
for d in hs:
    all_sentences.extend(d.get("segmented",[]))
print("All sentences",len(all_sentences))
unique_sentences=list(set(all_sentences)) #make the sentences unique
unique_sentences.sort()
print("All unique sentences",len(unique_sentences))
for s in unique_sentences[:10]:
    print(s)

All sentences 10633
All unique sentences 6597

" Ei työssäkäyvän ihmisen talous siihen kaadu "
" Esimiehet pyörisivät yksin töissä ilman työntekijöitä "
" Hankitaan vain kirjoja , jotka ovat saaneet megakohun aikaan "
" Huolestunut vihje voi tulla pankista tai kaupan kassalta "
" Huono yhteistyö ei ole laitonta , se on vain hankalaa "
" Iskujen tarkoituksena ei ollut tappaa USA:n sotilaita , mutta operaatio amerikkalaisjoukkojen pois ajamiseksi jatkuu "
" Juoksi keittiöön ja nappasi leipäveitsen käteen " – vartijat joutuvat yhä useammin turvaamaan kotihoidon työntekijöiden kotikäyntejä
" Lasten ja vanhempien mielikuvitus on laiskistunut " – 8 vinkkiä lumettomiin lomapäiviin
" Menestystarina on auennut "


# BERT model in practice

* Running the model is somewhat involved, even though in the end, it is actually only few lines of python :)
* Hang on!
* We will use the Huggingface Transformers library to run the model

In [11]:
!pip3 install transformers
import transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 5.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 19.3MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 22.6MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=8d9ea

# BERT tokenization

* BERT does (and needs!) own sub-word tokenization
* When applying it to data, you need the correct tokenizer for the language you use and for the model you will use
* A large number of models are [easily available and distributed by Huggingface](https://huggingface.co/transformers/pretrained_models.html)
* Conveniently, the Finnish BERT model has been added by TurkuNLP :)


In [12]:
bert_tokenizer=transformers.BertTokenizer.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=424343.0, style=ProgressStyle(descripti…




* [Documentation](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.__call__)
* The tokenizer receives list of sentences, and returns everything needed to run the BERT model on the data
   * input token IDs
   * token type IDs
   * attention mask
* We will also ask it to:
   * truncate the data to maximum length BERT can accept (512 tokens per sequence)
   * pad the data to rectangular shape (add zeros to sentences shorter than the longest one)
   * return the data directly as torch tensors which we will need (correspond to numpy's ndarray)




In [13]:
sents_tokenized=bert_tokenizer(unique_sentences,padding=True,truncation=True,return_tensors="pt")

In [14]:
print("Tokenizer output:",list(sents_tokenized.keys()))
print("input_ids shape:",sents_tokenized["input_ids"].shape)

Tokenizer output: ['input_ids', 'token_type_ids', 'attention_mask']
input_ids shape: torch.Size([6597, 136])


* We have 6597 input sentences
* The longest one is 136 tokens long
* Now we need to batch the data for the model, i.e. chop it to batches of a handful of sentences at a time which will be processed at once

In [15]:
import torch

#1) dataset is something which gives individual examples, TensorDatasets gives the rows of the tensors in the tokenizer output
ds=torch.utils.data.TensorDataset(sents_tokenized["input_ids"],sents_tokenized["token_type_ids"],sents_tokenized["attention_mask"])
#2) DataLoader can take the TensorDataset and batch it for us (it can do plenty of other things too!)
batched_ds=torch.utils.data.DataLoader(ds,batch_size=20)

In [16]:
#item should be one batch now
for item in batched_ds:
    print(item)
    break

print("item[0] shape",item[0].shape)
print("item[1] shape",item[1].shape)
print("item[2] shape",item[2].shape)

[tensor([[  102,   103,     0,  ...,     0,     0,     0],
        [  102,   245,   771,  ...,     0,     0,     0],
        [  102,   245,  2412,  ...,     0,     0,     0],
        ...,
        [  102,   348, 27617,  ...,     0,     0,     0],
        [  102,   348, 27617,  ...,     0,     0,     0],
        [  102,   348, 27617,  ...,     0,     0,     0]]), tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), tensor([[1, 1, 0,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])]
item[0] shape torch.Size([20, 136])
item[1] shape torch.Size([20, 136])
item[2] shape torch.Size([20, 136])


* We have all ready on the input side
* Now we can load the model and push it onto the GPU

In [17]:
model=transformers.BertModel.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1")
model=model.cuda() #moves the model onto GPU
model=model.eval() #tells the model it will not be training, it will be used to simply process input


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=500709232.0, style=ProgressStyle(descri…




* Now we push every batch of data through the model
* Harvest the pooler output which is a single vector representing the whole sentence

In [18]:
all_vectors=[] #store the model outputs here for all batches
with torch.no_grad(): #this tells torch to not collect gradient data, conserves memory big time!
    for input_ids,token_type_ids,att_mask in tqdm.tqdm(batched_ds): #each batch is composed of input_ids, token_type_ids, attention mask
        input_ids=input_ids.cuda() #move data onto GPU since the model is there too
        token_type_ids=token_type_ids.cuda() #...
        att_mask=att_mask.cuda() #...
        model_out=model(input_ids=input_ids,token_type_ids=token_type_ids,attention_mask=att_mask) #run the model
        all_vectors.append(model_out.pooler_output.cpu()) #move the pooler output back to CPU and store it


100%|██████████| 330/330 [01:46<00:00,  3.11it/s]


In [20]:
# Combines all of the individual batches into a single tensor
embedded=torch.vstack(all_vectors).numpy()
print(embedded.shape)

(6597, 768)


* At this point, we are on a familiar territory
* We have a matrix with as many rows as we have "documents" and each row is a vector which represents the document
* These vectors are 768 long and dense
* Other than that, it is exactly as what we had from the TfidfVectorizer
* So we can re-use the code to calculate the nearest documents with very few modifications
* Below is mostly copied code:

In [21]:
#2) Compare
import sklearn.metrics.pairwise as pairwise

sent_sims=pairwise.cosine_similarity(embedded) 
print(sent_sims.shape) #we now have all-vs-all cosine similarities of the BERT sentence embeddings


(6597, 6597)


In [23]:
#3) Pick most similar
# code unchanged except for the 1:2 slice

import numpy as np
sorted_indices=np.argsort(-sent_sims)[:,1:2] #we cannot take the first, because that would be the sentence itself! (this was [:,:1] previously)
# argsort (argument sort, gives indices rather than sorted values)
# sort is always ascending but we want descending, the solution is to sort -sent_sims
# [:,1:2] means "take all rows and the second column" but do keep as a 2-dim array
print("Sorted_indices shape",sorted_indices.shape) #as many rows as there are YLE articles, and the index of the most similar HS article

#But now we want to see the sentences that have the highest correspondence to any other sentence
#for that we need to sort again. For that, we also need the scores!
scores=np.take_along_axis(sent_sims,sorted_indices,-1)  #pick values from sent_sims using the sorted_indices, on the last axis (does your head spin?)
print("scores.shape",scores.shape)
scores_sorted_indices=np.argsort(-scores.flatten()) #We need to flatten before sort or else the 2nd dimension (which has only one element) will get sorted
#this is now indices to unique_sentences sorted in descending order by their similarity to any other sentence



Sorted_indices shape (6597, 1)
scores.shape (6597, 1)


In [25]:
#4) Inspect!

#Can we convince ourselves this works?
for i in scores_sorted_indices[1000:1100]: #first 100 sentences
    #Which is the corresponding one?
    j=sorted_indices[i][0] #so which is the HS index? look it up in sorted_indices, and since that is a 2-dim array, pick the first column (numpy arrays can be head-spinning experience)
    print("------------------------------------------")
    print("i",i,"j",j) #now we know which row and column we are referring to
    sim=sent_sims[i,j] #this is the similarity
    print("Sim",sim)
    print("*********** ")
    print(unique_sentences[i]) #this is the first one
    print("*********** ")
    print(unique_sentences[j]) #...and this is the second one
    print("------------------------------------------")
    print()

------------------------------------------
i 4454 j 4455
Sim 0.95956564
*********** 
Sopimuksen hyväksyminen merkitsee toteutuessaan päänavausta talven työmarkkinakierrokselle .
*********** 
Sopimuksen hyväksyminen merkitsisi päänavausta talven työmarkkinakierrokselle .
------------------------------------------

------------------------------------------
i 4078 j 4076
Sim 0.95950454
*********** 
Rikollisjengi United Brotherhood ( UB ) ja sen alajärjestö Bad Union ovat päättäneet lakkauttaa toimintansa .
*********** 
Rikollisjengi United Brotherhood ( UB ) ja sen alajärjestö Bad Union ilmoittaa lopettavansa toimintansa , kertovat Helsingin Sanomat ja MTV:n uutiset .
------------------------------------------

------------------------------------------
i 4076 j 4078
Sim 0.95950454
*********** 
Rikollisjengi United Brotherhood ( UB ) ja sen alajärjestö Bad Union ilmoittaa lopettavansa toimintansa , kertovat Helsingin Sanomat ja MTV:n uutiset .
*********** 
Rikollisjengi United Brotherhoo

# That worked like charm!

* try also going deeper into the sentence list perhaps `[1000:1100]` etc...
* Below is the BERT-specific code to show that, in the end, it is not all that much of code to master
* Using deep learning models is made easy!

In [None]:
bert_tokenizer=transformers.BertTokenizer.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1")
sents_tokenized=bert_tokenizer(unique_sentences,padding=True,truncation=True,return_tensors="pt")

ds=torch.utils.data.TensorDataset(sents_tokenized["input_ids"],sents_tokenized["token_type_ids"],sents_tokenized["attention_mask"])
batched_ds=torch.utils.data.DataLoader(ds,batch_size=20)

model=transformers.BertModel.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1")
model=model.cuda() #moves the model onto GPU
model=model.eval() #tells the model it will not be training, it will be used to simply process input

all_vectors=[] #store the model outputs here for all batches
with torch.no_grad(): #this tells torch to not collect gradient data, conserves memory big time!
    for input_ids,token_type_ids,att_mask in tqdm.tqdm(batched_ds): #each batch is composed of input_ids, token_type_ids, attention mask
        input_ids=input_ids.cuda() #move data onto GPU since the model is there too
        token_type_ids=token_type_ids.cuda() #...
        att_mask=att_mask.cuda() #...
        model_out=model(input_ids=input_ids,token_type_ids=token_type_ids,attention_mask=att_mask) #run the model
        all_vectors.append(model_out.pooler_output.cpu()) #move the pooler output back to CPU and store it

embedded=torch.vstack(all_vectors).numpy()