# WIP at the moment; use research_ver.ipynb for results

# Literature Review Assistant
Goal: this notebook provides an LLM "research assistant" that can help with exploring the vaguest of questions by providing the relevant papers on the matter. It does not read the papers for you, but it provides a short summary that can be more helpful than the standard abstract when checking out several papers in one sitting. 
Secondary Goal: this "assistant" helps the exploration of sub-topics and alternative keywords; this is helpful when the question is vague or lacks crucial keywords, or when the topic is vast. 
 

In [1]:
### set your query 
query = "I am looking for a paper where an LLM managed to translate an unfamiliar language after being shown the vocabulary in the prompt "
### set the desired number of subtopics
number_of_subtopics = 5

### Actual Code

In [2]:
from anaconda_navigator.api.external_apps.bundle.installers import retrieve_and_validate
from google import genai
from google.genai import types

import json

import os
from dotenv import load_dotenv

import pandas as pd

import requests 
import xml.etree.ElementTree as ET
import datetime

from chromadb import Documents, EmbeddingFunction, Embeddings
# from google.api_core import retry
import chromadb
from nltk.corpus.reader import documents
from IPython.display import Markdown

from numpy.array_api import result_type
from starlette.routing import request_response

  from numpy.array_api import result_type


In [3]:
load_dotenv()
GOOGLE_API_KEY = os.environ.get('GOOGLE_API_KEY')
client = genai.Client(api_key=GOOGLE_API_KEY)

In [4]:
def get_model_response(prompt:str) -> str:
    config = types.GenerateContentConfig(temperature=0.0)
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        config=config,
        contents=prompt,
    )
    
    return response.text

In [5]:
def generate_best_query(request: str, main_query = False) -> list:

    prompt = f'''You are a helpful research assistant doing a literature review. The researcher says: "{request}". What would be the most accurate arXiv API call to find this information? Please provide the API call alone, no need for an explanation. 
        INSTRUCTIONS: 
        Step 1 - consider how an arXiv API call is constructed. 
        Step 2 - define the best search query to use for the researcher's request
        Step 3 - construct the arXiv API call
        Step 4 - make sure the arXiv API call is valid, and fix it if it's not'''
    
    response = get_model_response(prompt)

    best_query = response.strip().replace("\n", "")
    if main_query:
        best_query_pages = [best_query, best_query.replace("start=0", "start=10"), best_query.replace("start=0", "start=20")]
    
        return best_query_pages
    
    return best_query
    

In [6]:
def generate_subtopic_query(request: str) -> list:
    
    prompt=f'''You are a helpful research assistant doing a literature review. The researcher says: "{request}". What would be {number_of_subtopics} relevant sub-topics to gain a better understanding of this matter? Please provide a list of these topics. Provide no explanation. Return a Python list. '''
    
    close_topics = get_model_response(prompt)
    json_start = close_topics.find("[")
    s = close_topics[json_start:].replace("`", "").replace("[", "").replace("]", "").replace('"', "").replace("\n", "")
    
    subtopics = [i.strip() for i in s.split(", ")]
    
    subtopic_queries = []
    
    for topic in subtopics:
        subtopic_query = f"I want to look into {topic}"
        r = generate_best_query(subtopic_query)
        subtopic_queries.append(r.strip().replace("\n", ""))
    
    return  subtopics, subtopic_queries

In [7]:
def generate_search_queries(request: str) -> list:
    main_query = generate_best_query(request, main_query=True)
    subtopic_results = generate_subtopic_query(request)
    subtopic_queries = subtopic_results[1]
    subtopic_list = subtopic_results[0]
    return main_query + subtopic_queries, subtopic_list 

In [8]:
def get_arxiv_metadata(url: str, topic: str) -> tuple:
    
    arxiv_entries = {}
    
    url = url.replace(" ", "").replace("`", "")
    
    arxiv_response = requests.get(url)
    if arxiv_response.status_code != 200:
        raise requests.HTTPError(f"API call failed with status code {response.status_code}")
        return 
    else:
        root = ET.fromstring(arxiv_response.text)
        ns = {'atom': 'http://www.w3.org/2005/Atom'}
        
        for entry in root.findall('atom:entry', ns):
            title = entry.find('atom:title', ns).text.strip()
            summary = entry.find('atom:summary', ns).text.strip()
            authors = [author.find('atom:name', ns).text.strip()
                        for author in entry.findall('atom:author', ns)]
            link = entry.find('atom:link', ns).attrib['href']
            id = entry.find('atom:id', ns).text.strip()
            updated = entry.find('atom:updated', ns).text.strip()
            updated = datetime.datetime.strptime(updated, '%Y-%m-%dT%H:%M:%SZ').date()
            if id not in arxiv_entries:
                arxiv_entries[id] = {
                    'summary': summary,
                    'metadatas': {
                        'authors': ", ".join(authors),
                        'title': title,
                        'published_url': link,
                        'updated': str(updated),
                        'id_url': id,
                        'original_search': url, 
                        'topic': topic
                    }
                }
                
        return arxiv_entries
        

In [9]:
# Define a helper to retry when per-minute quota is reached.
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specify whether to generate embeddings for documents, or queries
    document_mode = True

    # @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in response.embeddings]

In [10]:
DB_NAME = "arxiv_results"

embed_fn = GeminiEmbeddingFunction()
embed_fn.document_mode = True

chroma_client = chromadb.Client()
db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)


  embed_fn = GeminiEmbeddingFunction()


In [11]:
def generate_db_content(call_list: list, subtopics: list): 
    counter = 0
    subtopic_idx = 0
    n = len(subtopics)
    calls = len(call_list)
    for i in range(calls):
        if i > calls - n - 1: 
            subtopic = subtopics[subtopic_idx]
            subtopic_idx += 1
        else:
            subtopic = "main"
        api_call = call_list[i]
        response = get_arxiv_metadata(api_call, subtopic)
        if response is not None:
            for article_id in response:
                db.add(documents=[response[article_id]["summary"]], metadatas=[response[article_id]["metadatas"]],  ids=[article_id])
                counter += 1
        

In [12]:
full_result = generate_search_queries(query)
r, st = full_result[0], full_result[1]

In [13]:
generate_db_content(r, st)

In [14]:
embed_fn.document_mode = False

result = db.query(query_texts=[query], n_results= 3 + 1)

[all_passages] = result["documents"] 
all_titles = [j["title"] for i in result["metadatas"] for j in i]
all_authors = [j["authors"] for i in result["metadatas"] for j in i]
[all_ids] = result["ids"]


In [15]:
best_article = db.peek(1)

In [16]:
best_article = [best_article["metadatas"][0]["title"], best_article["metadatas"][0]["authors"], best_article["documents"], best_article["ids"]] 

In [17]:
exclude_best = '0' in result["ids"][0]

In [18]:
if exclude_best:
    q = all_ids.index('0')
    # best_passage = all_passages[q]
    # best_title = all_titles[q]
    # best_metadata = result["metadatas"][q]
    all_titles = all_titles[:q] + all_titles[q+1:]
    all_passages = all_passages[:q] + all_passages[q+1:]
    all_ids = all_ids[:q] + all_ids[q+1:]
    all_authors = all_authors[:q] + all_authors[q+1:]
    


In [19]:
more_article_data = list(zip(all_titles, all_authors, all_passages, all_ids))

In [44]:
def get_subtopic_articles(subtopics: list, ids: list) -> list:
    subtopic_contents = []

    for subtopic in subtopics:
        results = db.get(
        where={"topic": subtopic},
        limit=10) 

        idx = None
        subtopic_data = None
        for i in results["ids"]:
            if i not in ids:
                idx = results["ids"].index(i)
                break

        if idx is not None:
            article_content = results["documents"][idx]
            article_id = results["ids"][idx]
            article_title = results["metadatas"][idx]["title"]
            article_authors = results["metadatas"][idx]["authors"]
            subtopic_data = article_title, article_authors, article_content, article_id
        
        if subtopic_data:
            subtopic_contents.append(subtopic_data)
    return subtopic_contents


In [45]:
def generate_summary(request: str, subtopics: list, ids: list) -> str:
    prompt = f'''You are a helpful research assistant. The researcher says: "{request}". 
Here is the data on the main article on the topic:
 {best_article}
 Here is the data on a few more articles on the topic: 
 {more_article_data}
PLease provide a short summary for each article (a couple of sentences). Please highlight the main article, providing a summary of up to 4 sentences about it. 
For each article mentioned, be sure to include the article title, authors, and id.  
Please don't explain that you got it. 
'''
    
  
    response = get_model_response(prompt)
    if not subtopics: 
        return response
    else: 
        subtopic_string = "\n".join(subtopics)
        keyword_text = (f"\n\n\n----some helpful keywords/topics may be: {subtopic_string}")
        
        
        subtopic_articles = get_subtopic_articles(subtopics, ids=ids)
        subtopics_prompt = f'''You are a helpful research assistant. The researcher says: "{request}". 
Here are some articles that another assistant suggested to further explore this matter: 
{subtopic_articles}

Please provide a short summary of how the things discussed in the articles add to the concept the researcher was talking about. 
For each article mentioned, be sure to include the article title, authors, and id.  
If no article is relevant, simply return " ".
Please don't explain that you got it. 
'''
        subtopics_response = get_model_response(subtopics_prompt)
        return response + keyword_text + "\n\n\n" + subtopics_response
        

In [46]:
Markdown(generate_summary(query, st, all_ids))

0
1
0
0


Here are summaries of the provided articles:

**Main Article:**

*   **Title:** Killing it with Zero-Shot: Adversarially Robust Novelty Detection
*   **Authors:** Hossein Mirzaei, Mohammad Jafari, Hamid Reza Dehbashi, Zeinab Sadat Taghavi, Mohammad Sabokrou, Mohammad Hossein Rohban
*   **ID:** http://arxiv.org/abs/2501.15271v1

This paper addresses the vulnerability of novelty detection (ND) algorithms to adversarial attacks. It proposes a method that combines nearest-neighbor algorithms with robust features from ImageNet-pretrained models to enhance the robustness and performance of ND. The results demonstrate significant improvements over state-of-the-art methods under adversarial conditions, establishing a new standard for robust ND. The implementation is publicly available.

**Other Articles:**

*   **Title:** Augmenting Large Language Model Translators via Translation Memories
*   **Authors:** Yongyu Mu, Abudurexiti Reheman, Zhiquan Cao, Yuchun Fan, Bei Li, Yinqiao Li, Tong Xiao, Chunliang Zhang, Jingbo Zhu
*   **ID:** http://arxiv.org/abs/2305.17367v1

This paper explores using translation memories (TMs) as prompts for large language models (LLMs) to improve their translation capabilities. The study finds that LLMs can effectively utilize high-quality TM-based prompts, achieving results comparable to state-of-the-art NMT systems.

*   **Title:** Human-in-the-loop Machine Translation with Large Language Model
*   **Authors:** Xinyi Yang, Runzhe Zhan, Derek F. Wong, Junchao Wu, Lidia S. Chao
*   **ID:** http://arxiv.org/abs/2310.08908v1

This paper proposes a human-in-the-loop pipeline for machine translation using LLMs, where human feedback or automatic retrieval is used to guide the LLM's translation process. The results demonstrate the effectiveness of the pipeline in tailoring in-domain translations and improving translation performance.

*   **Title:** Adaptive Machine Translation with Large Language Models
*   **Authors:** Yasmin Moslem, Rejwanul Haque, John D. Kelleher, Andy Way
*   **ID:** http://arxiv.org/abs/2301.13294v3

This paper investigates the use of in-context learning with LLMs to improve real-time adaptive machine translation. The experiments show that LLMs can adapt to in-domain sentence pairs and terminology, surpassing strong encoder-decoder MT systems, especially for high-resource languages.

*   **Title:** Machine Translation for Ge'ez Language
*   **Authors:** Aman Kassahun Wassie
*   **ID:** http://arxiv.org/abs/2311.14530v3

This paper explores various methods to improve machine translation for Ge'ez, a low-resource ancient language, including transfer learning, vocabulary optimization, fine-tuning, and few-shot translation with LLMs. The study finds that GPT-3.5 achieves a reasonable BLEU score with no initial knowledge of Ge'ez, but still lower than the MNMT baseline.



----some helpful keywords/topics may be: In-context learning for low-resource languages
Zero-shot translation with large language models
Few-shot translation with large language models
LLM adaptation to novel languages
Prompt engineering for machine translation


Here's how the provided articles relate to the researcher's interest in LLMs translating unfamiliar languages after being shown vocabulary in the prompt:

*   **Iterative Translation Refinement with Large Language Models** by Pinzhen Chen, Zhicheng Guo, Barry Haddow, Kenneth Heafield (http://arxiv.org/abs/2306.03856v2): This paper explores iteratively prompting an LLM to refine translations. While it doesn't directly address translation from completely *unfamiliar* languages based solely on prompt-provided vocabulary, the concept of iterative refinement could be relevant. The LLM could potentially use the initial vocabulary to make a first-pass translation, and then iteratively refine it based on further prompting and context.

*   **Adapting Multilingual LLMs to Low-Resource Languages with Knowledge Graphs via Adapters** by Daniil Gurgurov, Mareike Hartmann, Simon Ostermann (http://arxiv.org/abs/2407.01406v3): This paper focuses on adapting LLMs to low-resource languages using knowledge graphs. While not exactly the same as translating a completely unfamiliar language from scratch, the techniques used to adapt to low-resource languages could be relevant. The paper explores methods for incorporating external knowledge into LLMs to improve their performance on languages with limited data, which is conceptually similar to providing vocabulary in the prompt.

*   **Optimizing Machine Translation through Prompt Engineering: An Investigation into ChatGPT's Customizability** by Masaru Yamada (http://arxiv.org/abs/2308.01391v2): This paper investigates how prompt engineering can influence the quality of translations produced by ChatGPT. The idea of providing context and instructions through prompts is directly relevant to the researcher's question. By carefully crafting prompts that include the vocabulary and context of the unfamiliar language, it might be possible to guide the LLM to produce reasonable translations.
