# Content-based search

The Internet has brought forward a marvelous source of information. But - simply knowing that we *have* information is just not enough to *use* this information. For example, we *know* that, somewhere on the Internet, there is a book on Natural Language Processing. But, how can we find this book?

In this notebook, we are going to work with the following use case (which was also approached in [Amami et al., "An LDA-Based Approach to Scientific Paper Recommendation",Natural Language Processing and Information Systems, 2016 ](http://link.springer.com/10.1007/978-3-319-41754-7_17), based on ideas by [Griffiths and Steyvers, "Finding Scientific Topics", Proc. Natl. Acad. Sci. U.S.A., 2004](https://doi.org/10.1073/pnas.0307752101).

Suppose a scientist is writing an article. Articles usually start with a session called "abstract", which summarizes the contents of the whole paper. We want our system to get the abstract we are working with, and then find possible articles we could work with.

We will start by simulating our data with a subset of an ArXiv dataset available at Kaggle:

In [1]:
import pandas as pd 
import os
import kagglehub
from tqdm import tqdm
from pathlib import Path
    
path = kagglehub.dataset_download("tiagoft/arvix-data-filtered-for-cs-only-data")
path = Path(path)
df = pd.read_csv(path / 'arxiv-metadata-oai-snaptshot-cs-only.csv')

In [2]:
df.head()

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,704.0046,Denes Petz,"I. Csiszar, F. Hiai and D. Petz",A limit relation for entropy and channel capac...,"LATEX file, 11 pages","J. Math. Phys. 48(2007), 092102.",10.1063/1.2779138,,quant-ph cs.IT math.IT,,"In a quantum mechanical model, Diosi, Feldma...","[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2009-11-13,"[['Csiszar', 'I.', ''], ['Hiai', 'F.', ''], ['..."
1,704.0062,Tom\'a\v{s} Vina\v{r},"Rastislav \v{S}r\'amek, Bro\v{n}a Brejov\'a, T...",On-line Viterbi Algorithm and Its Relationship...,,Algorithms in Bioinformatics: 7th Internationa...,10.1007/978-3-540-74126-8_23,,cs.DS,,"In this paper, we introduce the on-line Vite...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2010-01-25,"[['Šrámek', 'Rastislav', ''], ['Brejová', 'Bro..."
2,704.0098,Jack Raymond,"Jack Raymond, David Saad",Sparsely-spread CDMA - a statistical mechanics...,"23 pages, 5 figures, figure 1 amended since pu...",J. Phys. A: Math. Theor. 40 No 41 (12 October ...,10.1088/1751-8113/40/41/004,,cs.IT math.IT,,"Sparse Code Division Multiple Access (CDMA),...","[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2009-11-13,"[['Raymond', 'Jack', ''], ['Saad', 'David', '']]"
3,704.0217,Wiroonsak Santipach,Wiroonsak Santipach and Michael L. Honig,Capacity of a Multiple-Antenna Fading Channel ...,,"IEEE Trans. Inf. Theory, vol. 55, no. 3, pp. 1...",10.1109/TIT.2008.2011437,,cs.IT math.IT,http://arxiv.org/licenses/nonexclusive-distrib...,Given a multiple-input multiple-output (MIMO...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2010-08-27,"[['Santipach', 'Wiroonsak', ''], ['Honig', 'Mi..."
4,704.0301,Akitoshi Kawamura,Akitoshi Kawamura,Differential Recursion and Differentially Alge...,"14 pages, 3 figures",Revised and published in ACM Trans. Comput. Lo...,10.1145/1507244.1507252,,cs.CC,,"Moore introduced a class of real-valued ""rec...","[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...",2009-04-19,"[['Kawamura', 'Akitoshi', '']]"


In [3]:
sample_title = "Enhancing Autonomous Agents with Multimodal Generative AI for Improved Human-AI Collaboration"
sample_abstract = """The integration of multimodal generative AI into autonomous agents presents a significant advancement in human-AI collaboration. 
This study explores the development of autonomous agents capable of processing and generating various data types,
including text-to-image and image-to-audio conversions. By leveraging multimodal generative AI, these agents can interpret and generate 
content across different modalities, enhancing their ability to interact with humans in more natural and intuitive ways.
We propose a novel framework that combines generative AI with transfer learning techniques to enable autonomous agents to adapt 
knowledge acquired from one context to another with minimal additional data. Our experiments demonstrate that this approach significantly
improves the agents' performance in tasks requiring human-AI collaboration, such as virtual reality environments and smart city applications.
The results highlight the potential of multimodal generative AI to revolutionize human-AI interaction, paving the way for more immersive 
and adaptive collaborative experiences.
"""
sample_keywords = ["autonomous agents", "multimodal generative AI", "human-AI collaboration", "transfer learning", "virtual reality", "smart city applications"]

In [4]:
# df size
print(f"Number of rows: {df.shape[0]}")

Number of rows: 62905


In [5]:
# small test df
small_df = df.sample(1000)
small_df.head()

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
45530,2203.16538,Athanasios Lentzas,Athanasios Lentzas and Dimitris Vrakas,Machine Learning Approaches for Non-Intrusive ...,"20 pages,submitted to ""Expert Systems with App...","Expert Systems with Applications, 210, 118454 ...",10.1016/j.eswa.2022.118454,,cs.LG cs.AI cs.NE eess.SP,http://creativecommons.org/licenses/by/4.0/,Home absence detection is an emerging field ...,"[{'version': 'v1', 'created': 'Wed, 30 Mar 202...",2022-08-23,"[['Lentzas', 'Athanasios', ''], ['Vrakas', 'Di..."
22961,1810.11556,Yuhang Che,"Yuhang Che, Allison M. Okamura, Dorsa Sadigh",Efficient and Trustworthy Social Navigation Vi...,,"IEEETransactionsonRobotics,pp(99):1-16,2020",10.1109/TRO.2020.2964824,,cs.RO,http://arxiv.org/licenses/nonexclusive-distrib...,"In this paper, we present a planning framewo...","[{'version': 'v1', 'created': 'Fri, 26 Oct 201...",2020-02-25,"[['Che', 'Yuhang', ''], ['Okamura', 'Allison M..."
41793,2109.06045,Nadica Miljkovi\'c,"Nadica Miljkovi\'c, Ana Trisovic and Limor Peer",Towards FAIR Principles for Open Hardware,"12 pages, 3 figures, 2 tables",PSSOH Conference (2021): 90-101,10.5281/zenodo.5524414,,cs.DL cs.SE,http://creativecommons.org/licenses/by/4.0/,The lack of scientific openness is identifie...,"[{'version': 'v1', 'created': 'Mon, 13 Sep 202...",2023-04-18,"[['Miljković', 'Nadica', ''], ['Trisovic', 'An..."
23656,1812.06247,Hock Hung Chieng,"Hock Hung Chieng, Noorhaniza Wahid, Pauline On...",Flatten-T Swish: a thresholded ReLU-Swish-like...,,International Journal of Advances in Intellige...,10.26555/ijain.v4i2.249,,cs.NE cs.LG stat.ML,http://creativecommons.org/licenses/by-nc-sa/4.0/,Activation functions are essential for deep ...,"[{'version': 'v1', 'created': 'Sat, 15 Dec 201...",2018-12-18,"[['Chieng', 'Hock Hung', ''], ['Wahid', 'Noorh..."
22626,1810.01272,Philip Feldman,"Philip Feldman, Aaron Dant, and Wayne Lutters",Disrupting the Coming Robot Stampedes: Designi...,"5 pages, 2 figures","14th International Conference, iConference 2019",10.1007/978-3-030-15742-5,,cs.CY,http://arxiv.org/licenses/nonexclusive-distrib...,Machines are designed to communicate widely ...,"[{'version': 'v1', 'created': 'Tue, 2 Oct 2018...",2019-04-11,"[['Feldman', 'Philip', ''], ['Dant', 'Aaron', ..."


## Exercise 1: search by keyword

Searching by keywords is somewhat simple because we can simply use an inverted index. In fact, online search engines usually implement inverted index.

Use your inverted index to try to find other, relevant articles within our dataset using the keywords provided by the abstract's author.

In [6]:
from typing import Dict
from sklearn.feature_extraction.text import CountVectorizer
import logging
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer

logging.basicConfig(level=logging.WARNING)

def make_inverted_index_from_df(df : pd.DataFrame, lemmatize = False) -> Dict:
    if lemmatize:
        nltk.download('wordnet')
        lemmatizer = WordNetLemmatizer()
        df['abstract'] = df['abstract'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))
    vectorizer = CountVectorizer(binary=True, min_df=0.01, max_df=0.8, stop_words='english')
    X = vectorizer.fit_transform(df['abstract']) # X is a sparse matrix
    inverted_index = {}
    for word in vectorizer.vocabulary_:
        inverted_index[word] = X[:, vectorizer.vocabulary_[word]].nonzero()[0]
    return inverted_index

In [7]:
inverted_index = make_inverted_index_from_df(small_df)

In [8]:
#inverted_index['neural']

In [9]:
def search(df : pd.DataFrame, inverted_index : Dict, query : str) -> pd.DataFrame:
    query = query.lower()
    if query not in inverted_index:
        return pd.DataFrame()
    return df.iloc[inverted_index[query]]

search(small_df, inverted_index, 'neural')

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
23656,1812.06247,Hock Hung Chieng,"Hock Hung Chieng, Noorhaniza Wahid, Pauline On...",Flatten-T Swish: a thresholded ReLU-Swish-like...,,International Journal of Advances in Intellige...,10.26555/ijain.v4i2.249,,cs.NE cs.LG stat.ML,http://creativecommons.org/licenses/by-nc-sa/4.0/,Activation functions are essential for deep ...,"[{'version': 'v1', 'created': 'Sat, 15 Dec 201...",2018-12-18,"[['Chieng', 'Hock Hung', ''], ['Wahid', 'Noorh..."
39510,2105.08086,Elizabeth Bennewitz,"Elizabeth R. Bennewitz, Florian Hopfmueller, B...",Neural Error Mitigation of Near-Term Quantum S...,"20 pages, 4 main figures, 7 supplementary figu...","Nat Mach Intell 4, 618-624 (2022)",10.1038/s42256-022-00509-0,,quant-ph cs.AI cs.LG,http://arxiv.org/licenses/nonexclusive-distrib...,Near-term quantum computers provide a promis...,"[{'version': 'v1', 'created': 'Mon, 17 May 202...",2023-01-31,"[['Bennewitz', 'Elizabeth R.', ''], ['Hopfmuel..."
52406,2304.07711,Wendong Zhang,Wendong Zhang and Qingjie Chai and Quanqi Zhan...,Obstacle-Transformer: A Trajectory Prediction ...,"8 pages, 4 figures","IET Cyber-Systems and Robotics, 2023, 5(1), e1...",10.1049/csy2.12066,,cs.CV,http://arxiv.org/licenses/nonexclusive-distrib...,"Recurrent Neural Network, Long Short-Term Me...","[{'version': 'v1', 'created': 'Sun, 16 Apr 202...",2023-04-18,"[['Zhang', 'Wendong', ''], ['Chai', 'Qingjie',..."
46529,2205.11082,Tanvi Mehta,"Tanvi Mehta, Ganesh Deshmukh",YouTube Ad View Sentiment Analysis using Deep ...,"5 pages, 9 figures, Published with Internation...",International Journal of Computer Applications...,10.5120/ijca2022922078,,cs.LG cs.IR,http://creativecommons.org/licenses/by/4.0/,Sentiment Analysis is currently a vital area...,"[{'version': 'v1', 'created': 'Mon, 23 May 202...",2022-05-24,"[['Mehta', 'Tanvi', ''], ['Deshmukh', 'Ganesh'..."
55751,2310.10664,Dmitrijs Trizna,"Dmitrijs Trizna, Luca Demetrio, Battista Biggi...",Nebula: Self-Attention for Dynamic Malware Ana...,"18 pages, 7 figures, 12 tables, preprint, in r...",IEEE Transactions on Information Forensics and...,10.1109/TIFS.2024.3409083,,cs.CR cs.LG,http://creativecommons.org/licenses/by-sa/4.0/,Dynamic analysis enables detecting Windows m...,"[{'version': 'v1', 'created': 'Tue, 19 Sep 202...",2024-10-29,"[['Trizna', 'Dmitrijs', ''], ['Demetrio', 'Luc..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35977,2011.1356,Mingfu Xue,"Mingfu Xue, Shichang Sun, Zhiyu Wu, Can He, Ji...",SocialGuard: An Adversarial Example Based Priv...,,Journal of Information Security and Applicatio...,10.1016/j.jisa.2021.102993,,cs.CV cs.LG,http://arxiv.org/licenses/nonexclusive-distrib...,The popularity of various social platforms h...,"[{'version': 'v1', 'created': 'Fri, 27 Nov 202...",2022-07-05,"[['Xue', 'Mingfu', ''], ['Sun', 'Shichang', ''..."
55387,2309.16032,Yuezhu Xu,Yuezhu Xu and S. Sivaranjani,Learning Dissipative Neural Dynamical Systems,6 pages,IEEE Control Systems Letters 2023,10.1109/LCSYS.2023.3337851,,cs.LG cs.SY eess.SY math.DS math.OC,http://creativecommons.org/licenses/by/4.0/,Consider an unknown nonlinear dynamical syst...,"[{'version': 'v1', 'created': 'Wed, 27 Sep 202...",2024-04-09,"[['Xu', 'Yuezhu', ''], ['Sivaranjani', 'S.', '']]"
46939,2206.05963,M\'aty\'as Sz\'ant\'o,"M\'aty\'as Sz\'ant\'o, Gy\""orgy R. Bog\'ar, L\...",ATDN vSLAM: An all-through Deep Learning-Based...,Published in Periodica Polytechnica Electrical...,Periodica Polytechnica Electrical Engineering ...,10.3311/PPee.20437,,cs.CV,http://creativecommons.org/licenses/by/4.0/,"In this paper, a novel solution is introduce...","[{'version': 'v1', 'created': 'Mon, 13 Jun 202...",2022-07-18,"[['Szántó', 'Mátyás', ''], ['Bogár', 'György R..."
46795,2206.01904,Abhijith Sharma Mr,Abhijith Sharma and Apurva Narayan,Soft Adversarial Training Can Retain Natural A...,"7 pages, 6 figures",In Proceedings of the 14th International Confe...,10.5220/0010871000003116,,cs.LG cs.AI cs.CR,http://creativecommons.org/licenses/by/4.0/,Adversarial training for neural networks has...,"[{'version': 'v1', 'created': 'Sat, 4 Jun 2022...",2022-06-07,"[['Sharma', 'Abhijith', ''], ['Narayan', 'Apur..."


In [19]:
# print some  inverted index entries
for i, (word, idx) in enumerate(inverted_index.items()):
    if i > 10:
        break
    print(f"{word}")

absence
detection
emerging
field
smart
identifying
present
important
numerous
scenarios
possible


In [11]:
# print the inverted index size
print(f"Size of inverted index: {len(inverted_index)}")

Size of inverted index: 1631


## Exercise 2: finding better keywords

Keywords are words that differentiate a particular document from the other documents in the collection.

This means that the TFIDF measure could be useful to find keywords within a document.

For such, fit a TFIDF vectorizer in the whole collection of abstracts and then experiment to find out:

1. if the words with largest TFIDF in our abstract are the same as the proposed keywords
1. if the words are meaningful towards our abstract
1. if searching by the TFIDF-generated words could lead to better recommendations

In [12]:
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

def make_inverted_index_from_df_tfidf(df : pd.DataFrame, lemmatize = False) -> Dict:
    if lemmatize:
        nltk.download('wordnet')
        lemmatizer = WordNetLemmatizer()
        df['abstract'] = df['abstract'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word.lower()) for word in x.split()]))
    vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8, min_df=0.01)
    X = vectorizer.fit_transform(df['abstract']) # X is a sparse matrix
    inverted_index = {}
    for word in vectorizer.vocabulary_:
        inverted_index[word] = X[:, vectorizer.vocabulary_[word]].nonzero()[0]
    return inverted_index



In [13]:
inverted_index_tfidf = make_inverted_index_from_df_tfidf(small_df, lemmatize=True)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rodri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [14]:
# print the index size
print(f"Number of unique words: {len(inverted_index_tfidf)}")

Number of unique words: 1533


In [15]:
# print some  inverted index entries
for i, (word, idx) in enumerate(inverted_index_tfidf.items()):
    if i > 10:
        break
    print(f"{word}")

absence
detection
emerging
field
smart
identifying
present
important
numerous
scenarios
possible


In [16]:
# search using TF-IDF
search(small_df, inverted_index_tfidf, 'neural')


Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
23656,1812.06247,Hock Hung Chieng,"Hock Hung Chieng, Noorhaniza Wahid, Pauline On...",Flatten-T Swish: a thresholded ReLU-Swish-like...,,International Journal of Advances in Intellige...,10.26555/ijain.v4i2.249,,cs.NE cs.LG stat.ML,http://creativecommons.org/licenses/by-nc-sa/4.0/,activation function are essential for deep lea...,"[{'version': 'v1', 'created': 'Sat, 15 Dec 201...",2018-12-18,"[['Chieng', 'Hock Hung', ''], ['Wahid', 'Noorh..."
39510,2105.08086,Elizabeth Bennewitz,"Elizabeth R. Bennewitz, Florian Hopfmueller, B...",Neural Error Mitigation of Near-Term Quantum S...,"20 pages, 4 main figures, 7 supplementary figu...","Nat Mach Intell 4, 618-624 (2022)",10.1038/s42256-022-00509-0,,quant-ph cs.AI cs.LG,http://arxiv.org/licenses/nonexclusive-distrib...,near-term quantum computer provide a promising...,"[{'version': 'v1', 'created': 'Mon, 17 May 202...",2023-01-31,"[['Bennewitz', 'Elizabeth R.', ''], ['Hopfmuel..."
52406,2304.07711,Wendong Zhang,Wendong Zhang and Qingjie Chai and Quanqi Zhan...,Obstacle-Transformer: A Trajectory Prediction ...,"8 pages, 4 figures","IET Cyber-Systems and Robotics, 2023, 5(1), e1...",10.1049/csy2.12066,,cs.CV,http://arxiv.org/licenses/nonexclusive-distrib...,"recurrent neural network, long short-term memo...","[{'version': 'v1', 'created': 'Sun, 16 Apr 202...",2023-04-18,"[['Zhang', 'Wendong', ''], ['Chai', 'Qingjie',..."
46529,2205.11082,Tanvi Mehta,"Tanvi Mehta, Ganesh Deshmukh",YouTube Ad View Sentiment Analysis using Deep ...,"5 pages, 9 figures, Published with Internation...",International Journal of Computer Applications...,10.5120/ijca2022922078,,cs.LG cs.IR,http://creativecommons.org/licenses/by/4.0/,sentiment analysis is currently a vital area o...,"[{'version': 'v1', 'created': 'Mon, 23 May 202...",2022-05-24,"[['Mehta', 'Tanvi', ''], ['Deshmukh', 'Ganesh'..."
55751,2310.10664,Dmitrijs Trizna,"Dmitrijs Trizna, Luca Demetrio, Battista Biggi...",Nebula: Self-Attention for Dynamic Malware Ana...,"18 pages, 7 figures, 12 tables, preprint, in r...",IEEE Transactions on Information Forensics and...,10.1109/TIFS.2024.3409083,,cs.CR cs.LG,http://creativecommons.org/licenses/by-sa/4.0/,dynamic analysis enables detecting window malw...,"[{'version': 'v1', 'created': 'Tue, 19 Sep 202...",2024-10-29,"[['Trizna', 'Dmitrijs', ''], ['Demetrio', 'Luc..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35977,2011.1356,Mingfu Xue,"Mingfu Xue, Shichang Sun, Zhiyu Wu, Can He, Ji...",SocialGuard: An Adversarial Example Based Priv...,,Journal of Information Security and Applicatio...,10.1016/j.jisa.2021.102993,,cs.CV cs.LG,http://arxiv.org/licenses/nonexclusive-distrib...,the popularity of various social platform ha p...,"[{'version': 'v1', 'created': 'Fri, 27 Nov 202...",2022-07-05,"[['Xue', 'Mingfu', ''], ['Sun', 'Shichang', ''..."
55387,2309.16032,Yuezhu Xu,Yuezhu Xu and S. Sivaranjani,Learning Dissipative Neural Dynamical Systems,6 pages,IEEE Control Systems Letters 2023,10.1109/LCSYS.2023.3337851,,cs.LG cs.SY eess.SY math.DS math.OC,http://creativecommons.org/licenses/by/4.0/,consider an unknown nonlinear dynamical system...,"[{'version': 'v1', 'created': 'Wed, 27 Sep 202...",2024-04-09,"[['Xu', 'Yuezhu', ''], ['Sivaranjani', 'S.', '']]"
46939,2206.05963,M\'aty\'as Sz\'ant\'o,"M\'aty\'as Sz\'ant\'o, Gy\""orgy R. Bog\'ar, L\...",ATDN vSLAM: An all-through Deep Learning-Based...,Published in Periodica Polytechnica Electrical...,Periodica Polytechnica Electrical Engineering ...,10.3311/PPee.20437,,cs.CV,http://creativecommons.org/licenses/by/4.0/,"in this paper, a novel solution is introduced ...","[{'version': 'v1', 'created': 'Mon, 13 Jun 202...",2022-07-18,"[['Szántó', 'Mátyás', ''], ['Bogár', 'György R..."
46795,2206.01904,Abhijith Sharma Mr,Abhijith Sharma and Apurva Narayan,Soft Adversarial Training Can Retain Natural A...,"7 pages, 6 figures",In Proceedings of the 14th International Confe...,10.5220/0010871000003116,,cs.LG cs.AI cs.CR,http://creativecommons.org/licenses/by/4.0/,adversarial training for neural network ha bee...,"[{'version': 'v1', 'created': 'Sat, 4 Jun 2022...",2022-06-07,"[['Sharma', 'Abhijith', ''], ['Narayan', 'Apur..."


## Exercise 3: modelling abstracts with topics

Remember that, in our topic model with LDA, we decompose the word count matrix as:

$$
X \approx BA,
$$

where $B$ contains a representation of each document in terms of its topics.

However, we have not discussed how to find an optimal number of topics.

The idea used by [Amami et al.](http://link.springer.com/10.1007/978-3-319-41754-7_17) is to choose the number of topics that minimizes a metric called *perplexity*.

Perplexity is a measure of the certainty of sampling a word using our model (see [Griffiths and Steyvers](https://doi.org/10.1073/pnas.0307752101)). Lower values are better. With too few topics, the model is in fact making very broad assumptions regarding data; with too many topics, there is a greater chance of finding data is too sparse for a relevant estimation.

Modify the code below to find an optimal number of topics for our data. Then, decompose all documents in the collection (also, do it to our abstract!) using the topic model.

In [17]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from tqdm import tqdm

print('Fitting vectorizer')
vectorizer = CountVectorizer(stop_words='english', min_df=10, max_df=0.8, max_features=1000).fit(df['abstract'])
abstract_vectorized = vectorizer.transform(df['abstract'].sample(10000))

print('Fitting LDA')
for n_components in tqdm([2, 10, 20, 50, 100]):
    lda = LatentDirichletAllocation(n_components=n_components, random_state=42, n_jobs=-1)
    lda.fit(abstract_vectorized)
    print(f"Number of components: {n_components}. Perplexity: {lda.perplexity(abstract_vectorized)}")


Fitting vectorizer
Fitting LDA


 20%|██        | 1/5 [00:07<00:28,  7.14s/it]

Number of components: 2. Perplexity: 657.6043876629792


 40%|████      | 2/5 [00:12<00:18,  6.11s/it]

Number of components: 10. Perplexity: 593.8808316572952


 60%|██████    | 3/5 [00:18<00:11,  5.83s/it]

Number of components: 20. Perplexity: 591.0537313718271


 80%|████████  | 4/5 [00:23<00:05,  5.81s/it]

Number of components: 50. Perplexity: 634.5754592862357


100%|██████████| 5/5 [00:30<00:00,  6.20s/it]

Number of components: 100. Perplexity: 728.4324568523865





## Exercise 4: KL and JS divergences

The decomposition resulting from LDA is a probability distribution. The distance between two probability distributions can be calculated using the Kullback-Leibner divergence, which is calculated by:

$$
D_{KL}(P \parallel Q) = \sum_{i} P(i) \log \left( \frac{P(i)}{Q(i)} \right)
$$

However, the KL divergence is not symetric, which was bothersome to Amani and their colleagues. For this reason, they used the Jensen-Shannon (JS) divergence, given by:

$$
D_{JS}(P,Q) = \frac{D_{KL}(P \parallel Q) + D_{KL}(Q \parallel P)}{2}
$$

See the code below demonstrating how this works in practice:


In [18]:
from scipy.spatial.distance import jensenshannon

lda = LatentDirichletAllocation(n_components=5, random_state=42, n_jobs=-1)
lda.fit(abstract_vectorized)

topics1 = lda.transform(abstract_vectorized[0,:])
topics2 = lda.transform(abstract_vectorized[1,:])
topics3 = lda.transform(abstract_vectorized[500,:])

print(topics1)
print(topics2)
print(topics3)

print(jensenshannon(topics1.ravel(), topics2.ravel()))
print(jensenshannon(topics1.ravel(), topics3.ravel()))
print(jensenshannon(topics2.ravel(), topics3.ravel()))

[[0.00474498 0.00476342 0.82322269 0.00482352 0.1624454 ]]
[[0.00357595 0.4263573  0.56282776 0.00363762 0.00360137]]
[[0.00299487 0.12900544 0.00295331 0.39454699 0.47049939]]
0.4444883056615585
0.6907981589338485
0.7142871446042901


Using the LDA models you fitted in Exercise 4. Find the topic models for our abstract, and for each of the elements in the dataset. Then, make a function that retrieves the $K$ elements (where $K$ is an integer you can choose!) from the dataset that are closer to our abstract!

## Exercise 5

Compare the recommendations provided by keyword searching, by TDIDF keyword searching, and by topic modelling. 

1. Which recommendation seems more useful?
1. Could you combine the techniques above (at least 2 of them) to get a possibly better recommendation?
1. Can you use an LLM to help with this task? How? Implement an LLM-based solution and compare it with the previous ones.
