<a href="https://colab.research.google.com/github/LorenzoBellomo/InformationRetrieval/blob/main/notebooks/3_TFIDFandEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text processing with vectors
In this lecture we focus on techinques that allow to model the text as vectors of floating points numbers. This allows us to easily process and compute similarities between words, sentences, and documents.

In [1]:
!pip install scikit-learn
!pip install nltk



In [2]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk
import numpy as np
import json

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/5articles.json

--2025-02-07 14:18:19--  https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/5articles.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12566 (12K) [text/plain]
Saving to: ‘5articles.json’


2025-02-07 14:18:19 (76.1 MB/s) - ‘5articles.json’ saved [12566/12566]



In [3]:
with open("5articles.json", "r") as f:
    articles = json.load(f)

articles

[{'title': 'American Airlines orders 60 Overture supersonic jets',
  'maintext': "The revival of supersonic passenger travel, thought to be long dead with the demise of Concorde nearly two decades ago, could be about to take wing as American Airlines has put in an order for 60 aircraft capable of flying at 1.7 times the speed of sound. \nBoom is a start-up based in Denver, Colorado, whose development of Overture, an ultra-fast successor to Concorde that seats 65 to 88 passengers, is so advanced that it showed off designs at last month's Farnborough air show.",
  'date': '2022-08-18',
  'source': 'The New York Times'},
 {'title': "Conte: 'Chelsea are not in the race to sign Sanchez'",
  'maintext': 'Antonio Conte. Pic: PA\nHead coach Antonio Conte does not think Chelsea are in the race to sign Arsenal forward Alexis Sanchez.\nSanchez is out of contract this summer and seemed certain to join Manchester City this month.\nBut the Premier League leaders on Monday evening decided to end thei

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
tfidf_vectorizer = TfidfVectorizer(input='content')
count_vectorizer = CountVectorizer(input='content')
titles = [a["title"] for a in articles]
tfidf_vectors = tfidf_vectorizer.fit_transform(titles)

In [6]:
import pandas as pd
tfidf_df = pd.DataFrame(tfidf_vectors.toarray(), index=titles, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df.loc['zz_Document Frequency'] = (tfidf_df > 0).sum()
tfidf_df[['airlines', 'chelsea', 'car', 'murder', 'think', 'one','the', 'to']].sort_index().round(decimals=2)

Unnamed: 0,airlines,chelsea,car,murder,think,one,the,to
'One-punch killer's sentence will make others think twice',0.0,0.0,0.0,0.0,0.33,0.33,0.0,0.0
American Airlines orders 60 Overture supersonic jets,0.38,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Conte: 'Chelsea are not in the race to sign Sanchez',0.0,0.32,0.0,0.0,0.0,0.0,0.32,0.26
Gunman opens fire on car just metres from scene of Hamid Sanambar murder,0.0,0.0,0.28,0.28,0.0,0.0,0.0,0.0
Leclerc dedicates win to Hubert,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.37
zz_Document Frequency,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0


In [7]:
def get_top_n_words(documents, tfidf_vectorizer, count_vectorizer, top_n = 10):
  maintexts = [documents for a in articles]
  tfidf_vectors, count_vectors = tfidf_vectorizer.fit_transform(documents), count_vectorizer.fit_transform(documents)
  feature_names_tfidf, feature_names_count = tfidf_vectorizer.get_feature_names_out(), count_vectorizer.get_feature_names_out()
  avg_tfidf_per_word, avg_count_per_word = np.mean(tfidf_vectors.toarray(), axis=0), np.mean(count_vectors.toarray(), axis=0)
  top_indices_tfidf, top_indices_count = np.argsort(avg_tfidf_per_word)[-top_n:][::-1], np.argsort(avg_count_per_word)[-top_n:][::-1]
  top_words_tfidf = [(feature_names_tfidf[i], round(avg_tfidf_per_word[i]*100)/100) for i in top_indices_tfidf]
  top_words_count = [(feature_names_count[i], round(avg_count_per_word[i]*100)/100) for i in top_indices_count]
  print("TFIDF       -        COUNT")
  for tf, cf in zip(top_words_tfidf, top_words_count):
    print("{} ({})   -   {} ({})".format(tf[0], tf[1], cf[0], cf[1]))

In [8]:
maintexts = [a["maintext"] for a in articles]
get_top_n_words(maintexts, tfidf_vectorizer, count_vectorizer)

TFIDF       -        COUNT
the (0.38)   -   the (23.8)
to (0.21)   -   to (12.6)
of (0.16)   -   of (8.6)
in (0.14)   -   in (7.6)
that (0.1)   -   and (6.0)
his (0.09)   -   that (6.0)
and (0.09)   -   his (5.6)
on (0.08)   -   was (5.2)
was (0.08)   -   on (5.0)
at (0.07)   -   he (4.0)


In [9]:
tfidf_vectorizer = TfidfVectorizer(input='content', stop_words="english")
count_vectorizer = CountVectorizer(input='content', stop_words="english")
get_top_n_words(maintexts, tfidf_vectorizer, count_vectorizer)

TFIDF       -        COUNT
area (0.07)   -   said (3.0)
reilly (0.07)   -   reilly (2.4)
said (0.07)   -   ellis (2.2)
hubert (0.06)   -   hall (2.2)
leclerc (0.06)   -   luke (2.2)
ellis (0.06)   -   mr (1.8)
hall (0.06)   -   don (1.6)
luke (0.06)   -   brien (1.4)
concorde (0.06)   -   mother (1.2)
don (0.06)   -   described (1.2)


## Facebook FAISS
A library for efficient similarity search and clustering of dense vectors. Comes in GPU and CPU form.

In [4]:
!pip install faiss-cpu

Collecting faiss-cpu
  Using cached faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Using cached faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl (30.7 MB)
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.10.0
