<a href="https://colab.research.google.com/github/TurkuNLP/textual-data-analysis-course/blob/main/tda_2025_week2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Load and embed some textual data</h1>



*   Load data as sequence of texts
*   Run them through some popular Embedding model
*   Inspect



In [1]:
# News data from the first demo
!wget http://dl.turkunlp.org/TKO_8964_2023/news-en-2021.jsonl

--2025-01-19 20:07:35--  http://dl.turkunlp.org/TKO_8964_2023/news-en-2021.jsonl
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3385882 (3.2M) [application/octet-stream]
Saving to: ‘news-en-2021.jsonl’


2025-01-19 20:07:37 (2.38 MB/s) - ‘news-en-2021.jsonl’ saved [3385882/3385882]



In [2]:
import json
all_news=[]
with open("news-en-2021.jsonl") as f:
    for line in f:
        one_news = json.loads(line)
        all_news.append(one_news)



In [3]:
print(list(all_news[0].keys()))

['summary', 'tags', 'text', 'timestamp', 'title', 'url']


In [4]:
summaries=[n["summary"] for n in all_news]
print(f"We have in total {len(summaries)} summaries")
print(summaries[:5])

We have in total 1059 summaries
['The decisions follow a meeting of government ministers at the House of the Estates on Thursday afternoon.', 'The median rent for a studio apartment in central Helsinki was 809 euros per month, while they cost around 583 euros in downtown Tampere and 515 euros in the centre of Oulu.', "Emma Terho was a member of Finland's bronze-winning ice hockey teams in 1998 and 2010, and has since held several high profile roles within the IOC.", 'The Regional State Administrative Agency of Southern Finland said it has requested expert analysis on the pandemic situation and will decide on new regulations next week.', 'While some hospital districts have moved healthcare staff up the vaccination queue, a number of workers have not yet had the chance to receive the first dose.']


In [5]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("intfloat/e5-small-v2")#all-MiniLM-L6-v2")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [6]:
%time
embeddings=model.encode(summaries)

CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 7.87 µs


In [7]:
print(embeddings.__class__)
print(embeddings.shape)

<class 'numpy.ndarray'>
(1059, 384)


In [8]:
embeddings[:3,]

array([[-0.07388607,  0.09047342,  0.0257495 , ..., -0.0038086 ,
        -0.03786822,  0.00333837],
       [-0.05818596,  0.01598927,  0.02870906, ...,  0.00504935,
        -0.06872119,  0.003671  ],
       [-0.08355302,  0.04069423,  0.0364077 , ...,  0.01249816,
        -0.00039737,  0.03977849]], dtype=float32)

...ha ha now that was quite easy, wasn't it... Things are quite easy as long as your data is small and fits into memory.

<h1> Simple exhaustive lookup </h1>

* Embed the query
* Compare with the embeddings of the news using e.g. cosine similarity
* Pick the highest value

In [9]:
query="Comparison of rents in Helsinki and other parts of Finland"

q_emb=model.encode([query]) #we could do without the [...] but this way it is easy to extend to many queries at once
q_emb.shape

(1, 384)

In [10]:
import sklearn
similarities=sklearn.metrics.pairwise.cosine_similarity(q_emb,embeddings)
print(similarities.shape)

(1, 1059)


In [11]:
best_match_idxs=similarities.argmax(axis=1)
print(best_match_idxs)

[1]


In [12]:
print(summaries[best_match_idxs[0]])

The median rent for a studio apartment in central Helsinki was 809 euros per month, while they cost around 583 euros in downtown Tampere and 515 euros in the centre of Oulu.


In [13]:
best_match_idxs_sorted=(-similarities).argsort(axis=1)
print(best_match_idxs_sorted.shape)
print(best_match_idxs_sorted[:,:])

(1, 1059)
[[  1 793 380 ... 192 506 129]]


In [14]:
for i in best_match_idxs_sorted[0,:5]:
    print(f"[{i}] {summaries[i]}")


[1] The median rent for a studio apartment in central Helsinki was 809 euros per month, while they cost around 583 euros in downtown Tampere and 515 euros in the centre of Oulu.
[793] There are signs that Helsinki rents might be dropping.
[380] People in Finland are thinking about returning to the office, but what will that look like?
[732] Women in Finland earn on average about 16 percent less than men.
[547] A day in a Finnish prison costs the same as a stay in a top hotel. Officials now favour cheaper ways for offenders to serve their sentences.


<h1> FAISS </h1>

* What we had above is fast, simple, and will scale up quite a bit
* Let us try doing the same thing with FAISS

In [15]:
!pip3 install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m60.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1


In [16]:

import faiss
dim=embeddings.shape[1]

#This is the simplest, exhaustive search index (note: based on L2 norm, not cosine)
#basically a fancy implementation of the little thing we've got above
index=faiss.IndexFlatL2(dim)
index.add(embeddings)

In [17]:
distances,indices=index.search(q_emb,5)
for i in indices[0]:
    print(summaries[i])

The median rent for a studio apartment in central Helsinki was 809 euros per month, while they cost around 583 euros in downtown Tampere and 515 euros in the centre of Oulu.
There are signs that Helsinki rents might be dropping.
People in Finland are thinking about returning to the office, but what will that look like?
Women in Finland earn on average about 16 percent less than men.
A day in a Finnish prison costs the same as a stay in a top hotel. Officials now favour cheaper ways for offenders to serve their sentences.


In [18]:
# Let's try one of the little more advanced indices

coarse_quantizer = faiss.IndexFlatL2(dim) #how are the vectors assigned to each Voronoi cell
# nlist -> how many partitions is the data divided into (num of Voronoi cells)
#          for large indices, this should be a pretty large number!
# n_partitions -> how many partitions is each vector divided into ("m" in lecture slides) for quantization
# nbits -> how many bits per quantized value, 8 means quantization into 256 values
nlist=10
n_partitions=8
assert dim%n_partitions==0, f"n_partitions {n_partitions} must divide dim {dim}"
nbits=8
index = faiss.IndexIVFPQ (coarse_quantizer, dim,
                          nlist, n_partitions, nbits)
index.nprobe = 5 #how many Voronoi cells to probe?
index.train(embeddings[:500]) #need to train the index on something!
index.add(embeddings)

distances,indices=index.search(q_emb,5)
for i in indices[0]:
    print(summaries[i])


The median rent for a studio apartment in central Helsinki was 809 euros per month, while they cost around 583 euros in downtown Tampere and 515 euros in the centre of Oulu.
There are signs that Helsinki rents might be dropping.
Street parking is to become much rarer in central Helsinki.
Women in Finland earn on average about 16 percent less than men.
Drivers rushed to tank up when the pump price at three stations in western Finland plummeted overnight to 14 cents/litre.
