# Content-based search

The Internet has brought forward a marvelous source of information. But - simply knowing that we *have* information is just not enough to *use* this information. For example, we *know* that, somewhere on the Internet, there is a book on Natural Language Processing. But, how can we find this book?

In this notebook, we are going to work with the following use case (which was also approached in [Amami et al., "An LDA-Based Approach to Scientific Paper Recommendation",Natural Language Processing and Information Systems, 2016 ](http://link.springer.com/10.1007/978-3-319-41754-7_17), based on ideas by [Griffiths and Steyvers, "Finding Scientific Topics", Proc. Natl. Acad. Sci. U.S.A., 2004](https://doi.org/10.1073/pnas.0307752101).

Suppose a scientist is writing an article. Articles usually start with a session called "abstract", which summarizes the contents of the whole paper. We want our system to get the abstract we are working with, and then find possible articles we could work with.

We will start by simulating our data with a subset of an ArXiv dataset available at Kaggle:

In [26]:
import pandas as pd 
import os
import kagglehub
from tqdm import tqdm
from pathlib import Path
    
path = kagglehub.dataset_download("tiagoft/arvix-data-filtered-for-cs-only-data")
path = Path(path)
df = pd.read_csv(path / 'arxiv-metadata-oai-snaptshot-cs-only.csv')

In [3]:
sample_title = "Enhancing Autonomous Agents with Multimodal Generative AI for Improved Human-AI Collaboration"
sample_abstract = """The integration of multimodal generative AI into autonomous agents presents a significant advancement in human-AI collaboration. 
This study explores the development of autonomous agents capable of processing and generating various data types,
including text-to-image and image-to-audio conversions. By leveraging multimodal generative AI, these agents can interpret and generate 
content across different modalities, enhancing their ability to interact with humans in more natural and intuitive ways.
We propose a novel framework that combines generative AI with transfer learning techniques to enable autonomous agents to adapt 
knowledge acquired from one context to another with minimal additional data. Our experiments demonstrate that this approach significantly
improves the agents' performance in tasks requiring human-AI collaboration, such as virtual reality environments and smart city applications.
The results highlight the potential of multimodal generative AI to revolutionize human-AI interaction, paving the way for more immersive 
and adaptive collaborative experiences.
"""
sample_keywords = ["autonomous agents", "multimodal generative AI", "human-AI collaboration", "transfer learning", "virtual reality", "smart city applications"]

## Exercise 1: search by keyword

Searching by keywords is somewhat simple because we can simply use an inverted index. In fact, online search engines usually implement inverted index.

Use your inverted index to try to find other, relevant articles within our dataset using the keywords provided by the abstract's author.

In [33]:
import re
import numpy as np
import nltk
from nltk.stem import  PorterStemmer
from collections import defaultdict

nltk.download('wordnet')
iidx = defaultdict(set)

def add_to_index(iidx, new_text, idx, lemmatize=False):
    all_words = new_text.lower()
    all_words = re.findall(r'\b\w\w+\b', all_words)

    if lemmatize:
        stemmed = PorterStemmer()
        all_words = [stemmed.stem(word) for word in all_words]

    for word in all_words:
        iidx[word].add(idx)
    return iidx

def find_in_index(iidx, words, lemmatize=False):
    if lemmatize:
        stemmed = PorterStemmer()
        words = [stemmed.stem(word) for word in words]
    
    docs = [iidx[word] for word in words]
    d = docs[0]
    for i in range(len(docs)):
        d = d.intersection(docs[i])
    return d

iidx = add_to_index(iidx, "NLP is so cool!", 0, lemmatize=True)
iidx

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\emend\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


defaultdict(set, {'nlp': {0}, 'is': {0}, 'so': {0}, 'cool': {0}})

In [25]:
iidx = defaultdict(set)

for idx, content in enumerate(df["abstract"]):
    iidx = add_to_index(iidx, content, idx)

KeyError: 'abstract'

In [16]:
find_in_index(iidx, ["artificial", "intelligence"])

{40962,
 49158,
 49160,
 57360,
 49175,
 49184,
 57400,
 57402,
 41022,
 57409,
 24644,
 49226,
 57426,
 57431,
 32859,
 57439,
 57442,
 41067,
 32882,
 41079,
 16518,
 49289,
 41135,
 41143,
 49337,
 49340,
 57549,
 49358,
 32980,
 49365,
 57556,
 57560,
 33000,
 24816,
 41200,
 57611,
 57613,
 57623,
 49433,
 57632,
 49447,
 57642,
 33084,
 33096,
 33097,
 33105,
 33106,
 41299,
 33116,
 57709,
 57713,
 16762,
 49540,
 57739,
 33179,
 41378,
 57764,
 49580,
 57777,
 25012,
 49588,
 49592,
 41401,
 49598,
 57790,
 57801,
 41429,
 57824,
 49637,
 16883,
 41465,
 25086,
 41474,
 49667,
 49674,
 49677,
 57878,
 25122,
 41509,
 57894,
 8743,
 49705,
 8760,
 41528,
 41543,
 49741,
 57935,
 16976,
 49751,
 41572,
 49765,
 41580,
 41587,
 49790,
 49791,
 49794,
 647,
 49805,
 58002,
 25238,
 41623,
 666,
 49818,
 58014,
 58015,
 49826,
 58019,
 33444,
 41637,
 25260,
 8888,
 33464,
 49850,
 33475,
 33477,
 49887,
 33506,
 58088,
 8948,
 33525,
 41718,
 41720,
 58114,
 8965,
 58122,
 8977,
 3

## Exercise 2: finding better keywords

Keywords are words that differentiate a particular document from the other documents in the collection.

This means that the TFIDF measure could be useful to find keywords within a document.

For such, fit a TFIDF vectorizer in the whole collection of abstracts and then experiment to find out:

1. if the words with largest TFIDF in our abstract are the same as the proposed keywords
1. if the words are meaningful towards our abstract
1. if searching by the TFIDF-generated words could lead to better recommendations

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer

def k_most_relevant_words_in_context(context, k):
    tfidf_matrix = TfidfVectorizer(min_df=2, max_df=0.2, max_features=1000, stop_words="english").fit_transform(context)
    
    return np.argsort(tfidf_matrix.toarray()[0,:])[-k:]

In [38]:
k_most_relevant_words_in_context(df["abstract"], 5)

array([520, 611, 506, 304, 720])

## Exercise 3: modelling abstracts with topics

Remember that, in our topic model with LDA, we decompose the word count matrix as:

$$
X \approx BA,
$$

where $B$ contains a representation of each document in terms of its topics.

However, we have not discussed how to find an optimal number of topics.

The idea used by [Amami et al.](http://link.springer.com/10.1007/978-3-319-41754-7_17) is to choose the number of topics that minimizes a metric called *perplexity*.

Perplexity is a measure of the certainty of sampling a word using our model (see [Griffiths and Steyvers](https://doi.org/10.1073/pnas.0307752101)). Lower values are better. With too few topics, the model is in fact making very broad assumptions regarding data; with too many topics, there is a greater chance of finding data is too sparse for a relevant estimation.

Modify the code below to find an optimal number of topics for our data. Then, decompose all documents in the collection (also, do it to our abstract!) using the topic model.

In [39]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from tqdm import tqdm

print('Fitting vectorizer')
vectorizer = CountVectorizer(stop_words='english', min_df=10, max_df=0.8, max_features=1000).fit(df['abstract'])
abstract_vectorized = vectorizer.transform(df['abstract'].sample(10000))

print('Fitting LDA')
for n_components in tqdm([2, 10, 20, 50, 100]):
    lda = LatentDirichletAllocation(n_components=n_components, random_state=42, n_jobs=-1)
    lda.fit(abstract_vectorized)
    print(f"Number of components: {n_components}. Perplexity: {lda.perplexity(abstract_vectorized)}")


Fitting vectorizer
Fitting LDA


 20%|██        | 1/5 [00:21<01:27, 21.81s/it]

Number of components: 2. Perplexity: 659.4180045242211


 40%|████      | 2/5 [00:40<00:59, 19.89s/it]

Number of components: 10. Perplexity: 603.6368515894242


 60%|██████    | 3/5 [01:00<00:40, 20.19s/it]

Number of components: 20. Perplexity: 596.4369294606857


 80%|████████  | 4/5 [01:22<00:20, 20.66s/it]

Number of components: 50. Perplexity: 638.2369209586114


100%|██████████| 5/5 [01:46<00:00, 21.23s/it]

Number of components: 100. Perplexity: 734.419984768313





## Exercise 4: KL and JS divergences

The decomposition resulting from LDA is a probability distribution. The distance between two probability distributions can be calculated using the Kullback-Leibner divergence, which is calculated by:

$$
D_{KL}(P \parallel Q) = \sum_{i} P(i) \log \left( \frac{P(i)}{Q(i)} \right)
$$

However, the KL divergence is not symetric, which was bothersome to Amani and their colleagues. For this reason, they used the Jensen-Shannon (JS) divergence, given by:

$$
D_{JS}(P,Q) = \frac{D_{KL}(P \parallel Q) + D_{KL}(Q \parallel P)}{2}
$$

See the code below demonstrating how this works in practice:


In [41]:
from scipy.spatial.distance import jensenshannon

lda = LatentDirichletAllocation(n_components=5, random_state=42, n_jobs=-1)
lda.fit(abstract_vectorized)

topics1 = lda.transform(abstract_vectorized[0,:])
topics2 = lda.transform(abstract_vectorized[1,:])
topics3 = lda.transform(abstract_vectorized[500,:])

print(topics1)
print(topics2)
print(topics3)

print(jensenshannon(topics1.ravel(), topics2.ravel()))
print(jensenshannon(topics1.ravel(), topics3.ravel()))
print(jensenshannon(topics2.ravel(), topics3.ravel()))

[[0.00262729 0.42444138 0.24123015 0.00260597 0.3290952 ]]
[[0.00400545 0.48318949 0.00402044 0.07129568 0.43748893]]
[[0.26500287 0.0031395  0.00311942 0.72560769 0.00313053]]
0.3155663578041288
0.8066210162533497
0.735833206525308


Using the LDA models you fitted in Exercise 4. Find the topic models for our abstract, and for each of the elements in the dataset. Then, make a function that retrieves the $K$ elements (where $K$ is an integer you can choose!) from the dataset that are closer to our abstract!

## Exercise 5

Compare the recommendations provided by keyword searching, by TDIDF keyword searching, and by topic modelling. 

1. Which recommendation seems more useful?
1. Could you combine the techniques above (at least 2 of them) to get a possibly better recommendation?
1. Can you use an LLM to help with this task? How? Implement an LLM-based solution and compare it with the previous ones.
