<a href="https://colab.research.google.com/github/LorenzoBellomo/InformationRetrieval/blob/main/notebooks/2_TFIDFandEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text processing with vectors
In this lecture we focus on techinques that allow to model the text as vectors of floating point numbers. This allows us to easily process and compute similarities between words, sentences, and documents.

In [1]:
!pip install scikit-learn
!pip install nltk



In [2]:
from nltk.tokenize import word_tokenize
import nltk
import numpy as np
import json

nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [5]:
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/5articles.json

--2025-02-24 16:56:19--  https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/5articles.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12566 (12K) [text/plain]
Saving to: ‘5articles.json.2’


2025-02-24 16:56:19 (4.32 MB/s) - ‘5articles.json.2’ saved [12566/12566]



In [3]:
with open("5articles.json", "r") as f:
    articles = json.load(f)

articles

[{'title': 'American Airlines orders 60 Overture supersonic jets',
  'maintext': "The revival of supersonic passenger travel, thought to be long dead with the demise of Concorde nearly two decades ago, could be about to take wing as American Airlines has put in an order for 60 aircraft capable of flying at 1.7 times the speed of sound. \nBoom is a start-up based in Denver, Colorado, whose development of Overture, an ultra-fast successor to Concorde that seats 65 to 88 passengers, is so advanced that it showed off designs at last month's Farnborough air show.",
  'date': '2022-08-18',
  'source': 'The New York Times'},
 {'title': "Conte: 'Chelsea are not in the race to sign Sanchez'",
  'maintext': 'Antonio Conte. Pic: PA\nHead coach Antonio Conte does not think Chelsea are in the race to sign Arsenal forward Alexis Sanchez.\nSanchez is out of contract this summer and seemed certain to join Manchester City this month.\nBut the Premier League leaders on Monday evening decided to end thei

In [4]:
from sklearn.feature_extraction.text import CountVectorizer # Just counts the occurrences of terms
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [5]:
tfidf_vectorizer = TfidfVectorizer(input='content')
count_vectorizer = CountVectorizer(input='content')
titles = [a["title"] for a in articles]
tfidf_vectors = tfidf_vectorizer.fit_transform(titles)

In [6]:
import pandas as pd
tfidf_df = pd.DataFrame(tfidf_vectors.toarray(), index=titles, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df.loc['__Document Frequency__'] = (tfidf_df > 0).sum()
tfidf_df[['airlines', 'chelsea', 'car', 'murder', 'think', 'one','the', 'to']].sort_index().round(decimals=2)

Unnamed: 0,airlines,chelsea,car,murder,think,one,the,to
'One-punch killer's sentence will make others think twice',0.0,0.0,0.0,0.0,0.33,0.33,0.0,0.0
American Airlines orders 60 Overture supersonic jets,0.38,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Conte: 'Chelsea are not in the race to sign Sanchez',0.0,0.32,0.0,0.0,0.0,0.0,0.32,0.26
Gunman opens fire on car just metres from scene of Hamid Sanambar murder,0.0,0.0,0.28,0.28,0.0,0.0,0.0,0.0
Leclerc dedicates win to Hubert,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.37
__Document Frequency__,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0


In [7]:
def get_top_n_words(documents, tfidf_vectorizer, count_vectorizer, top_n = 10):
  tfidf_vectors, count_vectors = tfidf_vectorizer.fit_transform(documents), count_vectorizer.fit_transform(documents)
  feature_names_tfidf, feature_names_count = tfidf_vectorizer.get_feature_names_out(), count_vectorizer.get_feature_names_out()
  top_indices_tfidf, top_indices_count = np.argsort(tfidf_vectors.data)[:-(top_n):-1], np.argsort(count_vectors.data)[:-(top_n):-1]
  print("TFIDF       -        COUNT")
  for tfidx, cidx in zip(top_indices_tfidf, top_indices_count):
    print("{} ({}) - {} ({})".format(feature_names_tfidf[tfidf_vectors.indices[tfidx]], tfidf_vectors.data[tfidx], feature_names_count[count_vectors.indices[cidx]], count_vectors.data[cidx]))

In [8]:
maintexts = [a["maintext"] for a in articles]
get_top_n_words(maintexts, tfidf_vectorizer, count_vectorizer)

TFIDF       -        COUNT
the (0.5013872645606862) - the (49)
the (0.41332887693486225) - the (26)
the (0.412342924920705) - to (25)
the (0.37787148056358155) - that (22)
to (0.34351952778507416) - the (22)
of (0.28551663727015064) - of (21)
area (0.2428194172589411) - his (20)
concorde (0.23967548943773337) - to (20)
his (0.23654432389849608) - the (19)


In [9]:
tfidf_vectorizer = TfidfVectorizer(input='content', stop_words="english")
count_vectorizer = CountVectorizer(input='content', stop_words="english")
get_top_n_words(maintexts, tfidf_vectorizer, count_vectorizer)

TFIDF       -        COUNT
area (0.34430193043881974) - reilly (12)
reilly (0.32845517992013706) - ellis (11)
hubert (0.3027643279414842) - hall (11)
leclerc (0.3027643279414842) - luke (11)
luke (0.3010839149267923) - said (9)
ellis (0.3010839149267923) - mr (8)
hall (0.3010839149267923) - brien (7)
concorde (0.30058998587414587) - area (6)
chelsea (0.2526808530319859) - don (6)


In [10]:
!pip install gensim
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/alice.txt
import gensim

--2025-02-24 17:05:33--  https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/alice.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 151255 (148K) [text/plain]
Saving to: ‘alice.txt.4’


2025-02-24 17:05:35 (1.23 MB/s) - ‘alice.txt.4’ saved [151255/151255]



In [11]:
with open("alice.txt", 'r') as alice_file:
  alice = alice_file.read().lower()
sentences = [a for a in alice.split('\n') if a]
print(sentences[:10])
tfidf_vectorizer = TfidfVectorizer(input='content')
count_vectorizer = CountVectorizer(input='content')
get_top_n_words(sentences, tfidf_vectorizer, count_vectorizer)

["\ufeff\ufeff*** start of the project gutenberg ebook alice's adventures in", 'wonderland ***', '[illustration]', 'alice’s adventures in wonderland', 'by lewis carroll', 'the millennium fulcrum edition 3.0', 'contents', ' chapter i.     down the rabbit-hole', ' chapter ii.    the pool of tears', ' chapter iii.   a caucus-race and a long tale']
TFIDF       -        COUNT
wonderland (1.0) - you (5)
person (1.0) - you (5)
the (1.0) - you (5)
know (1.0) - not (5)
moved (1.0) - not (5)
loud (1.0) - you (5)
alice (1.0) - the (5)
telescope (1.0) - mouse (5)
otherwise (1.0) - the (4)


In [12]:
query = "car"
tfidf_vectors = tfidf_vectorizer.fit_transform(maintexts)
query_vector = tfidf_vectorizer.transform([query])
# get top_5 results by cosine similarity
cosine_similarities = cosine_similarity(query_vector, tfidf_vectors).flatten()
top_indices = np.argsort(cosine_similarities)[::-1][:3]
print("Top 3 matching documents with \"{}\":".format(query))
for index in top_indices:
    print(f"\nScore: {cosine_similarities[index]:.4f} - {maintexts[index][:200]}...")

Top 3 matching documents with "car":

Score: 0.1214 - Hamid Sanambar
Gardai are hunting for a gunman who opened fire on a car in north Dublin - just metres from where Hamid Sanambar was gunned down last week.
Emergency services were alerted to reports of...

Score: 0.0000 - Charles Leclerc
Charles Leclerc registered the maiden win of his Formula One career after romping to victory at the Belgian Grand Prix.
Less than 24 hours after Leclerc's French motor racing contempor...

Score: 0.0000 - Luke O'Reilly with his mother Janet O'Brien Luke O'Reilly Jack Hall Ellis The Metro One Bar in Tallaght, where Hall Ellis had earlier accused Luke O'Reilly of talking to his girlfriend
The mother of a...


In [13]:
print("Car" in maintexts[1])
print("car" in maintexts[1])

True
False


In [14]:
!pip install rank_bm25
from rank_bm25 import BM25Okapi



In [15]:
tokenized_corpus = [doc.split(" ") for doc in maintexts]
bm25 = BM25Okapi(tokenized_corpus)

In [16]:
print("BM25 score of \"car\"\n")
scores = bm25.get_scores("car")
for title, score in zip(titles, scores):
  print(title, " - ", score)

BM25 score of "car"

American Airlines orders 60 Overture supersonic jets  -  0.3616455097312771
Conte: 'Chelsea are not in the race to sign Sanchez'  -  0.50168811270542
Gunman opens fire on car just metres from scene of Hamid Sanambar murder  -  0.4852267619383623
'One-punch killer's sentence will make others think twice'  -  0.4880411225200318
Leclerc dedicates win to Hubert  -  0.5073575476960249


Applying Machine Learning in order to obtain embedding vectors

In [17]:
alice_tokens = []
for i in nltk.sent_tokenize(alice):
  sentence = []
  for j in word_tokenize(i):
    sentence.append(j.lower())
  alice_tokens.append(sentence)
alice_tokens[0]

['\ufeff\ufeff',
 '*',
 '*',
 '*',
 'start',
 'of',
 'the',
 'project',
 'gutenberg',
 'ebook',
 'alice',
 "'s",
 'adventures',
 'in',
 'wonderland',
 '*',
 '*',
 '*',
 '[',
 'illustration',
 ']',
 'alice',
 '’',
 's',
 'adventures',
 'in',
 'wonderland',
 'by',
 'lewis',
 'carroll',
 'the',
 'millennium',
 'fulcrum',
 'edition',
 '3.0',
 'contents',
 'chapter',
 'i.',
 'down',
 'the',
 'rabbit-hole',
 'chapter',
 'ii',
 '.']

In [18]:
# CBOW model
cbow_model = gensim.models.Word2Vec(alice_tokens, min_count=1,
                                vector_size=100, window=5)
# Skip Grap model
skipgram_model = gensim.models.Word2Vec(alice_tokens, min_count=1, vector_size=100,
                                window=5, sg=1)

In [19]:
print("Cosine similarity between 'alice' " + "and 'wonderland' - CBOW : ",
      cbow_model.wv.similarity('alice', 'wonderland'))
print("Cosine similarity between 'alice' " + "and 'wonderland' - SkipGram : ",
      skipgram_model.wv.similarity('alice', 'wonderland'))

Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.9877026
Cosine similarity between 'alice' and 'wonderland' - SkipGram :  0.691606


In [20]:
print("Cosine similarity between 'alice' " + "and 'machines' - CBOW : ",
      cbow_model.wv.similarity('alice', 'machines'))
print("Cosine similarity between 'alice' " + "and 'machines' - SkipGram : ",
      skipgram_model.wv.similarity('alice', 'machines'))

Cosine similarity between 'alice' and 'machines' - CBOW :  0.8677022
Cosine similarity between 'alice' and 'machines' - SkipGram :  0.86635345


In [21]:
#get the most similar vector to "alice"
cbow_model.wv.most_similar('alice', topn=5)

[(':', 0.9997751116752625),
 ('that', 0.9997420310974121),
 ('so', 0.999741792678833),
 ('and', 0.9997307658195496),
 ('this', 0.9997274279594421)]

Now let's see how to handle phrases on word2vec. This is not the suggested solution, as "full-phrase" models like doc2vec have been shown to outperform word2vec.
We can handle handle phrases as list of word2vec vectors, and perform some mathematical operations on them (i.e., sum, average, subtract).

In [22]:
query_phrase = "alice in wonderland"
#sum the vectors of the individual words
query_vector_sum = np.zeros(100)
for word in query_phrase.split():
  query_vector_sum += cbow_model.wv[word]

In [23]:
print("Cosine similarity with 'machines' - CBOW (SUM) : ",
      cosine_similarity([query_vector_sum], [cbow_model.wv['machines']])[0][0])
print("Cosine similarity with 'the' - CBOW (SUM) : ",
      cosine_similarity([query_vector_sum], [cbow_model.wv['the']])[0][0])

Cosine similarity with 'machines' - CBOW (SUM) :  0.8668031009529055
Cosine similarity with 'the' - CBOW (SUM) :  0.9996603074386445


And we can also apply this concept to entity embeddings, using Wikipedia as a backend

In [24]:
!pip install wikipedia2vec
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/enwiki_20180420_100d_part.txt

Collecting wikipedia2vec
  Downloading wikipedia2vec-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.0 kB)
Collecting lmdb (from wikipedia2vec)
  Downloading lmdb-1.6.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.1 kB)
Collecting mwparserfromhell (from wikipedia2vec)
  Downloading mwparserfromhell-0.6.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.3 kB)
Downloading wikipedia2vec-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lmdb-1.6.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (297 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.8/297.8 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading mwparserfromhell-0.6.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (196 kB)
[2K   [90m━━━━━━━

In [25]:
from wikipedia2vec import Wikipedia2Vec
wiki2vec = Wikipedia2Vec.load_text("enwiki_20180420_100d_part.txt")

In [26]:
 wiki2vec.most_similar(wiki2vec.get_word('the'), 5)

[ItemWithScore(item=<Word the>, score=1.0000000000000002),
 ItemWithScore(item=<Word of>, score=0.8721518672047108),
 ItemWithScore(item=<Word in>, score=0.8169867648118897),
 ItemWithScore(item=<Word a>, score=0.779299496137427),
 ItemWithScore(item=<Word biology>, score=0.3447563348657311)]

In [27]:
 wiki2vec.most_similar(wiki2vec.get_word('biology'), 5)

[ItemWithScore(item=<Word biology>, score=0.9999999999999998),
 ItemWithScore(item=<Word biotechnology>, score=0.7477050583513458),
 ItemWithScore(item=<Entity Biology>, score=0.739285025982951),
 ItemWithScore(item=<Entity Biotechnology>, score=0.6665049773155601),
 ItemWithScore(item=<Word of>, score=0.3983874277237702)]

And also Embeddings for Graphs

In [28]:
!pip install networkx node2vec
import networkx as nx
from node2vec import Node2Vec

Collecting node2vec
  Downloading node2vec-0.5.0-py3-none-any.whl.metadata (849 bytes)
Downloading node2vec-0.5.0-py3-none-any.whl (7.2 kB)
Installing collected packages: node2vec
Successfully installed node2vec-0.5.0


Random walks with a length of 30 and a total number of walks equal to 200.

In [29]:
G = nx.fast_gnp_random_graph(n=100, p=0.5)
node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200, workers=4)

Computing transition probabilities:   0%|          | 0/100 [00:00<?, ?it/s]

In [30]:
model = node2vec.fit(window=10, min_count=1, batch_words=4)

In [31]:
model.wv.save_word2vec_format("embeddings_node2vec.txt")

In [32]:
embeddings = {str(node): model.wv[str(node)] for node in G.nodes()}

In [33]:
embeddings["0"]

array([ 0.02114162, -0.01181947,  0.2531696 ,  0.11677124,  0.1214668 ,
       -0.13313246,  0.07342637, -0.01340989, -0.1097443 , -0.01400342,
        0.1875674 , -0.2735659 , -0.11868636,  0.12196268,  0.08813833,
       -0.02144602,  0.03767652,  0.10469939,  0.1128641 ,  0.28707615,
        0.1911489 ,  0.1577747 ,  0.08131135, -0.03926132, -0.02961335,
        0.1032261 , -0.01775808,  0.10911202, -0.03192254, -0.01620896,
       -0.06426856,  0.00578381,  0.02563978, -0.01992829, -0.0269547 ,
       -0.1699783 ,  0.12017433,  0.1584796 ,  0.0192078 ,  0.13830477,
        0.1028373 ,  0.15709727, -0.03732548, -0.03472867,  0.11252451,
       -0.10906149, -0.06489538, -0.14028272,  0.02249607,  0.12640887,
       -0.06963255, -0.15898845,  0.08090515,  0.07493564,  0.13102487,
        0.18418434, -0.3010741 , -0.16893214, -0.20037538, -0.0446628 ,
        0.0966665 ,  0.00804324,  0.04701839,  0.07428567], dtype=float32)