# Content-based search

The Internet has brought forward a marvelous source of information. But - simply knowing that we *have* information is just not enough to *use* this information. For example, we *know* that, somewhere on the Internet, there is a book on Natural Language Processing. But, how can we find this book?

In this notebook, we are going to work with the following use case (which was also approached in [Amami et al., "An LDA-Based Approach to Scientific Paper Recommendation",Natural Language Processing and Information Systems, 2016 ](http://link.springer.com/10.1007/978-3-319-41754-7_17), based on ideas by [Griffiths and Steyvers, "Finding Scientific Topics", Proc. Natl. Acad. Sci. U.S.A., 2004](https://doi.org/10.1073/pnas.0307752101).

Suppose a scientist is writing an article. Articles usually start with a session called "abstract", which summarizes the contents of the whole paper. We want our system to get the abstract we are working with, and then find possible articles we could work with.

We will start by simulating our data with a subset of an ArXiv dataset available at Kaggle:

In [1]:
import pandas as pd 
import os
import kagglehub
from tqdm import tqdm
from pathlib import Path
    
path = kagglehub.dataset_download("tiagoft/arvix-data-filtered-for-cs-only-data")
path = Path(path)
df = pd.read_csv(path / 'arxiv-metadata-oai-snaptshot-cs-only.csv')

Downloading from https://www.kaggle.com/api/v1/datasets/download/tiagoft/arvix-data-filtered-for-cs-only-data?dataset_version_number=1...


100%|██████████| 37.3M/37.3M [00:04<00:00, 9.25MB/s]

Extracting files...





In [11]:
len(df)

62905

In [13]:
df.head()

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,704.0046,Denes Petz,"I. Csiszar, F. Hiai and D. Petz",A limit relation for entropy and channel capac...,"LATEX file, 11 pages","J. Math. Phys. 48(2007), 092102.",10.1063/1.2779138,,quant-ph cs.IT math.IT,,"In a quantum mechanical model, Diosi, Feldma...","[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2009-11-13,"[['Csiszar', 'I.', ''], ['Hiai', 'F.', ''], ['..."
1,704.0062,Tom\'a\v{s} Vina\v{r},"Rastislav \v{S}r\'amek, Bro\v{n}a Brejov\'a, T...",On-line Viterbi Algorithm and Its Relationship...,,Algorithms in Bioinformatics: 7th Internationa...,10.1007/978-3-540-74126-8_23,,cs.DS,,"In this paper, we introduce the on-line Vite...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2010-01-25,"[['Šrámek', 'Rastislav', ''], ['Brejová', 'Bro..."
2,704.0098,Jack Raymond,"Jack Raymond, David Saad",Sparsely-spread CDMA - a statistical mechanics...,"23 pages, 5 figures, figure 1 amended since pu...",J. Phys. A: Math. Theor. 40 No 41 (12 October ...,10.1088/1751-8113/40/41/004,,cs.IT math.IT,,"Sparse Code Division Multiple Access (CDMA),...","[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2009-11-13,"[['Raymond', 'Jack', ''], ['Saad', 'David', '']]"
3,704.0217,Wiroonsak Santipach,Wiroonsak Santipach and Michael L. Honig,Capacity of a Multiple-Antenna Fading Channel ...,,"IEEE Trans. Inf. Theory, vol. 55, no. 3, pp. 1...",10.1109/TIT.2008.2011437,,cs.IT math.IT,http://arxiv.org/licenses/nonexclusive-distrib...,Given a multiple-input multiple-output (MIMO...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2010-08-27,"[['Santipach', 'Wiroonsak', ''], ['Honig', 'Mi..."
4,704.0301,Akitoshi Kawamura,Akitoshi Kawamura,Differential Recursion and Differentially Alge...,"14 pages, 3 figures",Revised and published in ACM Trans. Comput. Lo...,10.1145/1507244.1507252,,cs.CC,,"Moore introduced a class of real-valued ""rec...","[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...",2009-04-19,"[['Kawamura', 'Akitoshi', '']]"


In [2]:
sample_title = "Enhancing Autonomous Agents with Multimodal Generative AI for Improved Human-AI Collaboration"
sample_abstract = """The integration of multimodal generative AI into autonomous agents presents a significant advancement in human-AI collaboration. 
This study explores the development of autonomous agents capable of processing and generating various data types,
including text-to-image and image-to-audio conversions. By leveraging multimodal generative AI, these agents can interpret and generate 
content across different modalities, enhancing their ability to interact with humans in more natural and intuitive ways.
We propose a novel framework that combines generative AI with transfer learning techniques to enable autonomous agents to adapt 
knowledge acquired from one context to another with minimal additional data. Our experiments demonstrate that this approach significantly
improves the agents' performance in tasks requiring human-AI collaboration, such as virtual reality environments and smart city applications.
The results highlight the potential of multimodal generative AI to revolutionize human-AI interaction, paving the way for more immersive 
and adaptive collaborative experiences.
"""
sample_keywords = ["autonomous agents", "multimodal generative AI", "human-AI collaboration", "transfer learning", "virtual reality", "smart city applications"]

## Exercise 1: search by keyword

Searching by keywords is somewhat simple because we can simply use an inverted index. In fact, online search engines usually implement inverted index.

Use your inverted index to try to find other, relevant articles within our dataset using the keywords provided by the abstract's author.

In [3]:
import re
from collections import defaultdict
from nltk.stem import WordNetLemmatizer

iidx = defaultdict(set)

def adicionar_no_indice(iidx, texto_novo, idx, lematizar=False):
    texto = texto_novo.lower()

    palavras = re.findall(r'\b\w+\b', texto)
    
    if lematizar:
        lematizer = WordNetLemmatizer()
        palavras = [ lematizer.lemmatize(w) for w in palavras ]

    for p in palavras:
        iidx[p].add(idx)

    return iidx

def encontrar_no_indice(iidx, palavras, lematizar=False):
    
    if lematizar:
        lematizer = WordNetLemmatizer()
        palavras = [lematizer.lemmatize(p) for p in palavras]
    
    docs = [iidx[p] for p in palavras]
    d = docs[0]
    for i in range(1,len(docs)):
        d = d.intersection(docs[i])
    return d

iidx = adicionar_no_indice(iidx, "texto novo novo texto eba oba yeyeye", 0)
print(iidx)

defaultdict(<class 'set'>, {'texto': {0}, 'novo': {0}, 'eba': {0}, 'oba': {0}, 'yeyeye': {0}})


In [4]:
from tqdm import tqdm
iidx = defaultdict(set)

for idx, content in tqdm(enumerate (df['abstract']), total=len(df)):
    iidx = adicionar_no_indice(iidx, content, idx, lematizar=False)

100%|██████████| 62905/62905 [00:03<00:00, 20319.71it/s]


In [8]:
docs_encontrados = encontrar_no_indice(iidx, ['covid'], lematizar=False)
print(f"Numero de docs encontrados: {len(docs_encontrados)}")
for i in docs_encontrados:
    print(df.iloc[i]['abstract'])

Numero de docs encontrados: 531
  The COVID-19 (coronavirus disease 2019) pandemic affected more than 186
million people with over 4 million deaths worldwide by June 2021. The magnitude
of which has strained global healthcare systems. Chest Computed Tomography (CT)
scans have a potential role in the diagnosis and prognostication of COVID-19.
Designing a diagnostic system which is cost-efficient and convenient to operate
on resource-constrained devices like mobile phones would enhance the clinical
usage of chest CT scans and provide swift, mobile, and accessible diagnostic
capabilities. This work proposes developing a novel Android application that
detects COVID-19 infection from chest CT scans using a highly efficient and
accurate deep learning algorithm. It further creates an attention heatmap,
augmented on the segmented lung parenchyma region in the CT scans through an
algorithm developed as a part of this work, which shows the regions of
infection in the lungs. We propose a selectio

## Exercise 2: finding better keywords

Keywords are words that differentiate a particular document from the other documents in the collection.

This means that the TFIDF measure could be useful to find keywords within a document.

For such, fit a TFIDF vectorizer in the whole collection of abstracts and then experiment to find out:

1. if the words with largest TFIDF in our abstract are the same as the proposed keywords
1. if the words are meaningful towards our abstract
1. if searching by the TFIDF-generated words could lead to better recommendations

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=2,
                             max_df=0.2,
                             max_features=1000,
                             stop_words='english',
                            )
vectorizer.fit(df['abstract'])

In [16]:
X = vectorizer.transform(df['abstract']).toarray()

In [18]:
import numpy as np
print(X.shape)

iidx = defaultdict(set)
k = 5
for i in tqdm(range(X.shape[0])):
    amax = np.argsort(X[i,:])
    k_amax = amax[-k:]
    words = vectorizer.get_feature_names_out()[k_amax]
    iidx = adicionar_no_indice(iidx, ' '.join(words), i, lematizar=False)

(62905, 1000)


100%|██████████| 62905/62905 [00:14<00:00, 4257.38it/s]


In [21]:
docs_encontrados = encontrar_no_indice(iidx, ['intelligence', 'artificial'], lematizar=False)
print(f"Numero de docs encontrados: {len(docs_encontrados)}")
for i in docs_encontrados:
    print(df.iloc[i]['abstract'])

Numero de docs encontrados: 77
  Most applications of Artificial Intelligence (AI) are designed for a confined
and specific task. However, there are many scenarios that call for a more
general AI, capable of solving a wide array of tasks without being specifically
designed for them. The term General-Purpose Artificial Intelligence Systems
(GPAIS) has been defined to refer to these AI systems. To date, the possibility
of an Artificial General Intelligence, powerful enough to perform any
intellectual task as if it were human, or even improve it, has remained an
aspiration, fiction, and considered a risk for our society. Whilst we might
still be far from achieving that, GPAIS is a reality and sitting at the
forefront of AI research. This work discusses existing definitions for GPAIS
and proposes a new definition that allows for a gradual differentiation among
types of GPAIS according to their properties and limitations. We distinguish
between closed-world and open-world GPAIS, characteris

## Exercise 3: modelling abstracts with topics

Remember that, in our topic model with LDA, we decompose the word count matrix as:

$$
X \approx BA,
$$

where $B$ contains a representation of each document in terms of its topics.

However, we have not discussed how to find an optimal number of topics.

The idea used by [Amami et al.](http://link.springer.com/10.1007/978-3-319-41754-7_17) is to choose the number of topics that minimizes a metric called *perplexity*.

Perplexity is a measure of the certainty of sampling a word using our model (see [Griffiths and Steyvers](https://doi.org/10.1073/pnas.0307752101)). Lower values are better. With too few topics, the model is in fact making very broad assumptions regarding data; with too many topics, there is a greater chance of finding data is too sparse for a relevant estimation.

Modify the code below to find an optimal number of topics for our data. Then, decompose all documents in the collection (also, do it to our abstract!) using the topic model.

In [23]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from tqdm import tqdm

print('Fitting vectorizer')
vectorizer = CountVectorizer(stop_words='english', min_df=10, max_df=0.8, max_features=1000).fit(df['abstract'])
abstract_vectorized = vectorizer.transform(df['abstract'].sample(10000))

print('Fitting LDA')
for n_components in tqdm([2, 10, 20, 50, 100]):
    lda = LatentDirichletAllocation(n_components=n_components, random_state=42, n_jobs=-1)
    lda.fit(abstract_vectorized)
    print(f"Number of components: {n_components}. Perplexity: {lda.perplexity(abstract_vectorized)}")


Fitting vectorizer
Fitting LDA


 20%|██        | 1/5 [00:04<00:17,  4.39s/it]

Number of components: 2. Perplexity: 656.7059240772633


 40%|████      | 2/5 [00:07<00:10,  3.64s/it]

Number of components: 10. Perplexity: 601.8735641541717


 60%|██████    | 3/5 [00:11<00:07,  3.64s/it]

Number of components: 20. Perplexity: 593.4533369513492


 80%|████████  | 4/5 [00:15<00:03,  3.93s/it]

Number of components: 50. Perplexity: 630.5809182927723


100%|██████████| 5/5 [00:20<00:00,  4.15s/it]

Number of components: 100. Perplexity: 734.90689729687





## Exercise 4: KL and JS divergences

The decomposition resulting from LDA is a probability distribution. The distance between two probability distributions can be calculated using the Kullback-Leibner divergence, which is calculated by:

$$
D_{KL}(P \parallel Q) = \sum_{i} P(i) \log \left( \frac{P(i)}{Q(i)} \right)
$$

However, the KL divergence is not symetric, which was bothersome to Amani and their colleagues. For this reason, they used the Jensen-Shannon (JS) divergence, given by:

$$
D_{JS}(P,Q) = \frac{D_{KL}(P \parallel Q) + D_{KL}(Q \parallel P)}{2}
$$

See the code below demonstrating how this works in practice:


In [24]:
from scipy.spatial.distance import jensenshannon

lda = LatentDirichletAllocation(n_components=5, random_state=42, n_jobs=-1)
lda.fit(abstract_vectorized)

topics1 = lda.transform(abstract_vectorized[0,:])
topics2 = lda.transform(abstract_vectorized[1,:])
topics3 = lda.transform(abstract_vectorized[500,:])

print(topics1)
print(topics2)
print(topics3)

print(jensenshannon(topics1.ravel(), topics2.ravel()))
print(jensenshannon(topics1.ravel(), topics3.ravel()))
print(jensenshannon(topics2.ravel(), topics3.ravel()))

[[0.00287341 0.00282371 0.130757   0.86071659 0.00282928]]
[[0.13270142 0.0027445  0.33616711 0.23410955 0.29427743]]
[[0.00452972 0.00452    0.00451197 0.98193378 0.00450453]]
0.5037107843943454
0.1981790845219522
0.5993884732363441


Using the LDA models you fitted in Exercise 4. Find the topic models for our abstract, and for each of the elements in the dataset. Then, make a function that retrieves the $K$ elements (where $K$ is an integer you can choose!) from the dataset that are closer to our abstract!

## Exercise 5

Compare the recommendations provided by keyword searching, by TDIDF keyword searching, and by topic modelling. 

1. Which recommendation seems more useful?
1. Could you combine the techniques above (at least 2 of them) to get a possibly better recommendation?
1. Can you use an LLM to help with this task? How? Implement an LLM-based solution and compare it with the previous ones.


---
### Minhas Notas:

### 🔍 Explicando TF-IDF (Term Frequency - Inverse Document Frequency)

#### 📌 O que é TF-IDF?
O **TF-IDF (Term Frequency - Inverse Document Frequency)** é uma métrica usada em **Processamento de Linguagem Natural (NLP)** para avaliar a importância de uma palavra dentro de um documento em um conjunto de documentos.

Ele combina duas medidas:

1️⃣ **TF (Term Frequency - Frequência do Termo)** -> É o `CountVectorizer` do Scikit-Learn
O **TF** mede **quantas vezes uma palavra aparece em um documento**, indicando sua relevância.

A fórmula do TF é:

$$
TF(w) = \frac{\text{Número de vezes que a palavra aparece no documento}}{\text{Total de palavras no documento}}
$$

- Se uma palavra aparece **muitas vezes em um documento**, ela pode ser relevante.
- Porém, palavras comuns como **"o", "a", "de"** aparecem em quase todos os textos, então precisamos de outro fator.

2️⃣ **IDF (Inverse Document Frequency - Frequência Inversa do Documento)**
O **IDF** mede o **quão rara uma palavra é** em um conjunto de documentos.

A fórmula do IDF é:

$$
IDF(w) = \log \left(\frac{\text{Número total de documentos}}{\text{Número de documentos que contêm a palavra}} \right)
$$

- Se uma palavra está **em quase todos os documentos**, seu IDF será **baixo**.
- Se uma palavra está **em poucos documentos**, seu IDF será **alto**.
- Isso ajuda a reduzir o peso de palavras **muito comuns** e destacar palavras **específicas**.

3️⃣ **TF-IDF Score**
O **TF-IDF** final é calculado multiplicando **TF × IDF**:

$$
TFIDF(w) = TF(w) \times IDF(w)
$$

Isso significa que:
- Palavras **muito frequentes em um documento** e **raras no conjunto de documentos** terão **TF-IDF alto**.
- Palavras **muito comuns** terão **TF-IDF baixo**.