# arXiv Recommender Architecture

## Purpose of this notebook:

1. Layout a clear blueprint of project goal
    - Describe the pipeline between user input and project output
    - Clearly indicate which aspects of the pipeline we will tune
2. Clearly list the next phases of the project.
    - Deadline: Night of **June 2**. 10 days left
    - Break these into subtasks and create a log to keep track of progress.

## Project goal

### Main goal (Rapid prototype)

User input: a list of papers they are interested in

Model output: top 10 papers within a fixed library that are 'most similar' to the input set

Components of this model: 

(Step 0) Prepare the library
- Clean and vectorize the papers
- Run clustering to organize the library into topics

1. Process the input set
    - Clean and vectorize
    - obtain the 3 most likely topics each input paper belongs to
1. For each input paper, find candidate papers to recommend.
    - Search among the top 3 most likely topics to find its nearest neighbors
1. Reduce the candidate recommendations to the top 10 'best' recommendations
    - Choose some scheme for doing this
1. Return the recommendations in a human-readable format

### By 'a model' we mean a choice of how to carry out these steps


## Current state of progress

1. Current library is 'df_experiment'
- ~4400 papers cleaned and pre-processed
- Roughly 1000 per each of the subjects: diff geo, math phys, pdes, quantum algebra, rep theory
2. Jenia and Ethan's code can vectorize this library according to different vectorization schemes
- Bag of words/word count
- tf-idf
- word2vec
3. Given an input paper, we can vectorize it using the same code and compute its nearest neighbors wrt cosine distance

### What we do not yet have

- Clustering into topics for any of the three vectorization schemes above
- A function that can take in the arxiv id of an input paper and output the top X closest papers in the library




## Suggested next steps

1. Build a bare-bones full model as above (with clustering) that can perform the goal task (maybe badly)
1. Spend the rest of the time looking into tuning each individual aspect of the model pipeline to achieve best results



## Model Architecture

Building a model consists of making the following choices:

1. The choice of (and pre-processing of) the library from which we will pull recommendations
- Per Andrew's advice, expand our library size and subject breadth
- Practical size is constrained by the speed at which we can vectorize and cluster it
- Perhaps give some exploratory breakdown of topics by arxiv subject tag before analyzing
2. The choice of method to vectorize the text
- Want to use sentence transformers to get the best results
- How to choose among all of the pre-trained models available?
- Ways to optimize the speed at which it acts on our library?
3. The choice of how to cluster by topic
- BERTopic combines this with the previous step; it can vectorize the text using a transformer of our choice and then cluster it
- We have design choices regarding how to do dimension reduction (PCA, t-SNE) and which clustering algorithm to use (HDBSCAN, K-means) 
4. The choice of the notion of distance or 'similarity' between papers in the embedding space.
- Used to pull the nearest neighbors of a new input
- Cosine distance is standard, what are other candidates?
5. The choice of reduction from all candidate recommendations to the top 10 best

## Default choices to be tuned later

1. BERTopic's default sentence transformer
2. UMAP and HDBSCAN default parameters will yield clustering with no further choices by us
3. Cosine distance for finding nearest neighbors
4. Take the top 10 closest in terms of distance to the *set* of inputs?


## Ways to tune this architecture

See https://maartengr.github.io/BERTopic/getting_started/parameter%20tuning/parametertuning.html#n_components for list of hyperparameters within BERTopic that can be tuned.

1. Sentence Transformer
- There are a whole zoo of sentence transformers on hugging face, they can be imported and plugged directly into BERTopic.
- What makes these different? What are they trained on? 
- Does this choice affect the performance of the recommender in a meaningful way?
2. UMAP
- n_neighbors - ~scale at which to approximate the topology of the high dim'l data. Perhaps we can get larger topic clusters by looking at larger scale features.
- n_components - # of dimensions after reducing. The lower, the more info is destroyed, the higher, the harder it is
for HDBSCAN to cluster well
3. HDBSCAN
- min_cluster_size - **one of the most important** by default, clusters can be as small as 10 points. Increasing this will decrease the number of clusters. 
- metric - the choice of distance used in the clustering algorithm. Something to note -- UMAP does *not* preserve absolute distances between the data. Regions of tightly packed data in high dimensions are treated the same as more spread out regions, therefore using euclidean distance to detect clustering after UMAP may not be capturing how clustered the data is in the original embedding.
4. More nuanced ways to create the 10 best recommendations?
- One possible idea: If clustering is effective on small sets of data (e.g. 30 papers of interest) detect small clusters of ~5 or so, and replace each by their means. To generate recs, take the closest papers to these means.
- Another idea: the "best" rec for each input paper? or 
- a collection of recs based on the aggregation of their input (so something like recommendations close to the average input)? or maybe 
- the user sets a threshold for similarity score, above which any paper with a higher score than the threshold (relative to either a single input paper or the aggregate collection) is recommended? or even
- Let the user choose. (this would require a little more coding on our end)