<a href="https://colab.research.google.com/github/LorenzoBellomo/InformationRetrieval/blob/main/notebooks/3_TFIDFandEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text processing with vectors
In this lecture we focus on techinques that allow to model the text as vectors of floating points numbers. This allows us to easily process and compute similarities between words, sentences, and documents.

In [None]:
!pip install scikit-learn
!pip install nltk



In [None]:
from nltk.tokenize import word_tokenize
import nltk
import numpy as np
import json

nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/5articles.json

--2025-02-11 12:47:05--  https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/5articles.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12566 (12K) [text/plain]
Saving to: ‘5articles.json’


2025-02-11 12:47:05 (49.4 MB/s) - ‘5articles.json’ saved [12566/12566]



In [None]:
with open("5articles.json", "r") as f:
    articles = json.load(f)

articles

[{'title': 'American Airlines orders 60 Overture supersonic jets',
  'maintext': "The revival of supersonic passenger travel, thought to be long dead with the demise of Concorde nearly two decades ago, could be about to take wing as American Airlines has put in an order for 60 aircraft capable of flying at 1.7 times the speed of sound. \nBoom is a start-up based in Denver, Colorado, whose development of Overture, an ultra-fast successor to Concorde that seats 65 to 88 passengers, is so advanced that it showed off designs at last month's Farnborough air show.",
  'date': '2022-08-18',
  'source': 'The New York Times'},
 {'title': "Conte: 'Chelsea are not in the race to sign Sanchez'",
  'maintext': 'Antonio Conte. Pic: PA\nHead coach Antonio Conte does not think Chelsea are in the race to sign Arsenal forward Alexis Sanchez.\nSanchez is out of contract this summer and seemed certain to join Manchester City this month.\nBut the Premier League leaders on Monday evening decided to end thei

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
tfidf_vectorizer = TfidfVectorizer(input='content')
count_vectorizer = CountVectorizer(input='content')
titles = [a["title"] for a in articles]
tfidf_vectors = tfidf_vectorizer.fit_transform(titles)

In [None]:
import pandas as pd
tfidf_df = pd.DataFrame(tfidf_vectors.toarray(), index=titles, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df.loc['zz_Document Frequency'] = (tfidf_df > 0).sum()
tfidf_df[['airlines', 'chelsea', 'car', 'murder', 'think', 'one','the', 'to']].sort_index().round(decimals=2)

Unnamed: 0,airlines,chelsea,car,murder,think,one,the,to
'One-punch killer's sentence will make others think twice',0.0,0.0,0.0,0.0,0.33,0.33,0.0,0.0
American Airlines orders 60 Overture supersonic jets,0.38,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Conte: 'Chelsea are not in the race to sign Sanchez',0.0,0.32,0.0,0.0,0.0,0.0,0.32,0.26
Gunman opens fire on car just metres from scene of Hamid Sanambar murder,0.0,0.0,0.28,0.28,0.0,0.0,0.0,0.0
Leclerc dedicates win to Hubert,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.37
zz_Document Frequency,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0


In [None]:
def get_top_n_words(documents, tfidf_vectorizer, count_vectorizer, top_n = 10):
  maintexts = [documents for a in articles]
  tfidf_vectors, count_vectors = tfidf_vectorizer.fit_transform(documents), count_vectorizer.fit_transform(documents)
  feature_names_tfidf, feature_names_count = tfidf_vectorizer.get_feature_names_out(), count_vectorizer.get_feature_names_out()
  avg_tfidf_per_word, avg_count_per_word = np.mean(tfidf_vectors.toarray(), axis=0), np.mean(count_vectors.toarray(), axis=0)
  top_indices_tfidf, top_indices_count = np.argsort(avg_tfidf_per_word)[-top_n:][::-1], np.argsort(avg_count_per_word)[-top_n:][::-1]
  top_words_tfidf = [(feature_names_tfidf[i], round(avg_tfidf_per_word[i]*100)/100) for i in top_indices_tfidf]
  top_words_count = [(feature_names_count[i], round(avg_count_per_word[i]*100)/100) for i in top_indices_count]
  print("TFIDF       -        COUNT")
  for tf, cf in zip(top_words_tfidf, top_words_count):
    print("{} ({})   -   {} ({})".format(tf[0], tf[1], cf[0], cf[1]))

In [None]:
maintexts = [a["maintext"] for a in articles]
get_top_n_words(maintexts, tfidf_vectorizer, count_vectorizer)

TFIDF       -        COUNT
the (0.38)   -   the (23.8)
to (0.21)   -   to (12.6)
of (0.16)   -   of (8.6)
in (0.14)   -   in (7.6)
that (0.1)   -   and (6.0)
his (0.09)   -   that (6.0)
and (0.09)   -   his (5.6)
on (0.08)   -   was (5.2)
was (0.08)   -   on (5.0)
at (0.07)   -   he (4.0)


In [None]:
tfidf_vectorizer = TfidfVectorizer(input='content', stop_words="english")
count_vectorizer = CountVectorizer(input='content', stop_words="english")
get_top_n_words(maintexts, tfidf_vectorizer, count_vectorizer)

TFIDF       -        COUNT
area (0.07)   -   said (3.0)
reilly (0.07)   -   reilly (2.4)
said (0.07)   -   ellis (2.2)
hubert (0.06)   -   hall (2.2)
leclerc (0.06)   -   luke (2.2)
ellis (0.06)   -   mr (1.8)
hall (0.06)   -   don (1.6)
luke (0.06)   -   brien (1.4)
concorde (0.06)   -   mother (1.2)
don (0.06)   -   described (1.2)


In [None]:
query = "car"
tfidf_vectors = tfidf_vectorizer.fit_transform(maintexts)
query_vector = tfidf_vectorizer.transform([query])
# get top_5 results by cosine similarity
cosine_similarities = cosine_similarity(query_vector, tfidf_vectors).flatten()
top_indices = np.argsort(cosine_similarities)[::-1][:3]
print("Top 3 matching documents with \"{}\":".format(query))
for index in top_indices:
    print(f"Score: {cosine_similarities[index]:.4f} - {maintexts[index][:200]}...")

Top 3 matching documents with "car":
Score: 0.1722 - Hamid Sanambar
Gardai are hunting for a gunman who opened fire on a car in north Dublin - just metres from where Hamid Sanambar was gunned down last week.
Emergency services were alerted to reports of...
Score: 0.0000 - Charles Leclerc
Charles Leclerc registered the maiden win of his Formula One career after romping to victory at the Belgian Grand Prix.
Less than 24 hours after Leclerc's French motor racing contempor...
Score: 0.0000 - Luke O'Reilly with his mother Janet O'Brien Luke O'Reilly Jack Hall Ellis The Metro One Bar in Tallaght, where Hall Ellis had earlier accused Luke O'Reilly of talking to his girlfriend
The mother of a...


In [None]:
print("Car" in maintexts[1])
print("car" in maintexts[1])

True
False


In [None]:
!pip install rank_bm25
from rank_bm25 import BM25Okapi

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [None]:
tokenized_corpus = [doc.split(" ") for doc in maintexts]
bm25 = BM25Okapi(tokenized_corpus)

In [None]:
scores = bm25.get_scores("car")
for title, score in zip(titles, scores):
  print(title, " - ", score)

American Airlines orders 60 Overture supersonic jets  -  0.3616455097312771
Conte: 'Chelsea are not in the race to sign Sanchez'  -  0.50168811270542
Gunman opens fire on car just metres from scene of Hamid Sanambar murder  -  0.4852267619383623
'One-punch killer's sentence will make others think twice'  -  0.4880411225200318
Leclerc dedicates win to Hubert  -  0.5073575476960249


Applying Machine Learning in order to obtain embedding vectors

In [None]:
!pip install gensim
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/alice.txt
import gensim

--2025-02-11 12:47:54--  https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/alice.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 151255 (148K) [text/plain]
Saving to: ‘alice.txt’


2025-02-11 12:47:54 (4.13 MB/s) - ‘alice.txt’ saved [151255/151255]



In [None]:
with open("alice.txt", 'r') as alice_file:
  alice = alice_file.read().replace("\n", " ")
  alice_tokens = []
  for i in nltk.sent_tokenize(alice):
    sentence = []
    for j in word_tokenize(i):
      sentence.append(j.lower())
    alice_tokens.append(sentence)
alice_tokens[0]

['\ufeff\ufeff',
 '*',
 '*',
 '*',
 'start',
 'of',
 'the',
 'project',
 'gutenberg',
 'ebook',
 'alice',
 "'s",
 'adventures',
 'in',
 'wonderland',
 '*',
 '*',
 '*',
 '[',
 'illustration',
 ']',
 'alice',
 '’',
 's',
 'adventures',
 'in',
 'wonderland',
 'by',
 'lewis',
 'carroll',
 'the',
 'millennium',
 'fulcrum',
 'edition',
 '3.0',
 'contents',
 'chapter',
 'i',
 '.']

In [None]:
# CBOW model
cbow_model = gensim.models.Word2Vec(alice_tokens, min_count=1,
                                vector_size=100, window=5)
# Skip Grap model
skipgram_model = gensim.models.Word2Vec(alice_tokens, min_count=1, vector_size=100,
                                window=5, sg=1)

In [None]:
print("Cosine similarity between 'alice' " + "and 'wonderland' - CBOW : ",
      cbow_model.wv.similarity('alice', 'wonderland'))
print("Cosine similarity between 'alice' " + "and 'wonderland' - SkipGram : ",
      skipgram_model.wv.similarity('alice', 'wonderland'))

Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.98292464
Cosine similarity between 'alice' and 'wonderland' - SkipGram :  0.76471376


In [None]:
print("Cosine similarity between 'alice' " + "and 'machines' - CBOW : ",
      cbow_model.wv.similarity('alice', 'machines'))
print("Cosine similarity between 'alice' " + "and 'machines' - SkipGram : ",
      skipgram_model.wv.similarity('alice', 'machines'))

Cosine similarity between 'alice' and 'machines' - CBOW :  0.9118765
Cosine similarity between 'alice' and 'machines' - SkipGram :  0.8741628


In [None]:
#get the most similar vector to "alice"
cbow_model.wv.most_similar('alice', topn=5)

[(':', 0.9998022317886353),
 ('that', 0.9997600317001343),
 ('the', 0.9997377991676331),
 ('all', 0.9997363090515137),
 (',', 0.9997309446334839)]

Now let's see how to handle phrases on word2vec. This is not the suggested solution, as "full-phrase" models like doc2vec have been shown to outperform word2vec.
We can handle handle phrases as list of word2vec vectors, and perform some mathematical operations on them (i.e., sum, average, subtract).

In [None]:
query_phrase = "alice in wonderland"
#sum the vectors of the individual words
query_vector_sum = np.zeros(100)
for word in query_phrase.split():
  query_vector_sum += cbow_model.wv[word]

In [None]:
print("Cosine similarity with 'machines' - CBOW (SUM) : ",
      cosine_similarity([query_vector_sum], [cbow_model.wv['machines']])[0][0])
print("Cosine similarity with 'the' - CBOW (SUM) : ",
      cosine_similarity([query_vector_sum], [cbow_model.wv['the']])[0][0])

Cosine similarity with 'machines' - CBOW (SUM) :  0.9116884642601564
Cosine similarity with 'the' - CBOW (SUM) :  0.9997253270972859


And we can also apply this concept to entity embeddings, using Wikipedia as a backend

In [None]:
!pip install wikipedia2vec
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/enwiki_20180420_100d_part.txt

Collecting wikipedia2vec
  Downloading wikipedia2vec-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.0 kB)
Collecting lmdb (from wikipedia2vec)
  Downloading lmdb-1.6.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.1 kB)
Collecting mwparserfromhell (from wikipedia2vec)
  Downloading mwparserfromhell-0.6.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.3 kB)
Downloading wikipedia2vec-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lmdb-1.6.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (297 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.8/297.8 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading mwparserfromhell-0.6.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (196 kB)
[2K   [90m━━━━━━

In [None]:
from wikipedia2vec import Wikipedia2Vec
wiki2vec = Wikipedia2Vec.load_text("enwiki_20180420_100d_part.txt")

In [None]:
 wiki2vec.most_similar(wiki2vec.get_word('the'), 5)

[ItemWithScore(item=<Word the>, score=1.0000000000000002),
 ItemWithScore(item=<Word of>, score=0.8721518672047108),
 ItemWithScore(item=<Word in>, score=0.8169867648118897),
 ItemWithScore(item=<Word a>, score=0.779299496137427),
 ItemWithScore(item=<Word biology>, score=0.3447563348657311)]

In [None]:
 wiki2vec.most_similar(wiki2vec.get_word('biology'), 5)

[ItemWithScore(item=<Word biology>, score=0.9999999999999998),
 ItemWithScore(item=<Word biotechnology>, score=0.7477050583513458),
 ItemWithScore(item=<Entity Biology>, score=0.739285025982951),
 ItemWithScore(item=<Entity Biotechnology>, score=0.6665049773155601),
 ItemWithScore(item=<Word of>, score=0.3983874277237702)]

And also Embeddings for Graphs

In [None]:
!pip install networkx node2vec
import networkx as nx
from node2vec import Node2Vec

Collecting node2vec
  Downloading node2vec-0.5.0-py3-none-any.whl.metadata (849 bytes)
Downloading node2vec-0.5.0-py3-none-any.whl (7.2 kB)
Installing collected packages: node2vec
Successfully installed node2vec-0.5.0


Random walks with a length of 30 and a total number of walks equal to 200.

In [None]:
G = nx.fast_gnp_random_graph(n=100, p=0.5)
node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200, workers=4)

Computing transition probabilities:   0%|          | 0/100 [00:00<?, ?it/s]

In [None]:
model = node2vec.fit(window=10, min_count=1, batch_words=4)

In [None]:
model.wv.save_word2vec_format("embeddings_node2vec.txt")

In [None]:
embeddings = {str(node): model.wv[str(node)] for node in G.nodes()}

In [None]:
embeddings["0"]

array([ 0.02811118, -0.19869821,  0.02688526, -0.00462381,  0.09133011,
        0.01881642, -0.20118372, -0.27657598, -0.05830151, -0.12739585,
        0.0033474 , -0.19003619, -0.06725416, -0.1191246 ,  0.09743958,
        0.05735265, -0.14829881,  0.06482732,  0.09081763,  0.16073528,
        0.11511111,  0.18987186,  0.07656778, -0.02186786, -0.07698189,
        0.05943468,  0.02583197,  0.01035995, -0.0456528 ,  0.160254  ,
       -0.15718094, -0.00581779, -0.1072759 , -0.14463033, -0.04525446,
       -0.07229756,  0.02464688,  0.07242435,  0.28599098, -0.21982786,
       -0.00163014, -0.02298314, -0.22220005,  0.05077866, -0.00036252,
       -0.0834906 ,  0.24764869, -0.01062005,  0.06816956, -0.05606081,
       -0.09813996, -0.1181584 ,  0.08337452,  0.10293908,  0.00787122,
        0.03153153, -0.02092545, -0.12008427, -0.16192147,  0.22136568,
       -0.02357381, -0.0193423 , -0.037785  ,  0.01323807], dtype=float32)

## Facebook FAISS
A library for efficient similarity search and clustering of dense vectors. Comes in GPU and CPU form.

In [None]:
!pip install faiss-cpu
!pip install sentence-transformers

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [None]:
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/500news.json

--2025-02-11 13:32:03--  https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/500news.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 147867 (144K) [text/plain]
Saving to: ‘500news.json’


2025-02-11 13:32:03 (4.45 MB/s) - ‘500news.json’ saved [147867/147867]



In [None]:
import json
with open("500news.json", "r") as f:
    articles = json.load(f)
articles[10]

{'date': '2019-12-04',
 'maintext': "Zelimkhan Khangoshvili, former Chechen rebel commander, was murdered on 23 August in a park in Berlin. According to the German prosecutor's office, the murder was carried out 'either on behalf of the Russian state authorities or on behalf of the Autonomous Chechen Republic, part of the Russian Federation'. Replication of Moscow: hostile act, we will respond symmetrically. Merkel: “From Moscow no help”",
 'author': 'Paola Candreva',
 'source': 'La Repubblica'}

In [None]:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [None]:
sentences = []
for a in [a["maintext"] for a in articles]:
  sent_list = tokenizer.tokenize(a)
  for s in sent_list:
    sentences.append(s)
len(sentences)

776

### Sentence embeddings

We build our dense vector representations of each sentence using some libraries that we list in the code, as options.

Other models in: https://sbert.net/docs/pretrained_models.html

We limit the number of sentences for time reasons.

In [None]:
from sentence_transformers import SentenceTransformer
from sentence_transformers import util
from sentence_transformers.util import dot_score
from sentence_transformers.util import cos_sim

In [None]:
## Other models:
##    model = SentenceTransformer('bert-base-nli-mean-tokens')
##    model = SentenceTransformer("hkunlp/instructor-large")

# Initialize sentence transformer model
model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")

# create sentence embeddings - we limit to s sentences because of time reasons
s = 400
sentence_embeddings = model.encode(sentences[:s])

print(f"\n\nnumber of examples = {sentence_embeddings.shape[0]}, and number of dimensions = {sentence_embeddings.shape[1]}")


NameError: name 'SentenceTransformer' is not defined