# Overview

In this notebook, we will use building a RAG system that suggests short and easy to read ML paper titles from original ML paper titles. Our use case is that the paper tiles can be too technical for a general audience so using RAG to generate short titles based on previously created short titles can make research paper titles more accessible and used for science communication such as in the form of newsletters or blogs.

In [1]:
%%capture
!pip install transformers==4.38.2
!pip install accelerate==0.27.2
# !pip install datasets==2.18.0
!pip install peft==0.9.0
!pip install bitsandbytes==0.42.0
!pip install sentence-transformers==2.5.1
!pip install chromadb==0.4.24

In [2]:
import os
import torch
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))


os.environ["MODEL_NAME"] = "mistralai/Mistral-7B-Instruct-v0.2"
os.environ["DATASET"]="/kaggle/input/weekly-top-trending-ml-papers/ml-potw-10232023.csv"


torch.backends.cudnn.deterministic=True
# https://github.com/huggingface/transformers/issues/28731
torch.backends.cuda.enable_mem_efficient_sdp(False)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `mistralai/Mistral-7B-Instruct-v0.2` from `transformers`...
config.json: 100%|█████████████████████████████| 596/596 [00:00<00:00, 2.82MB/s]
┌──────────────────────────────────────────────────────────────────┐
│  Memory Usage for loading `mistralai/Mistral-7B-Instruct-v0.2`   │
├───────┬─────────────┬──────────┬─────────────────────────────────┤
│ dtype │Largest Layer│Total Size│       Training using Adam       │
├───────┼─────────────┼──────────┼─────────────────────────────────┤
│float32│  864.03 MB  │ 27.49 GB │            109.96 GB            │
│float16│  432.02 MB  │ 13.74 GB │             54.98 GB            │
│  int8 │  216.01 MB  │ 6.87 GB  │             27.49 GB            │
│  int4 │   108.0 MB  │ 3.44 GB  │             13.74 GB            │
└───────┴─────────────┴──────────┴─────────────────────────────────┘


In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(os.getenv("MODEL_NAME"))
tokenizer

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

LlamaTokenizerFast(name_or_path='mistralai/Mistral-7B-Instruct-v0.2', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [5]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_enable_fp32_cpu_offload=True,
)

model = AutoModelForCausalLM.from_pretrained(
    os.getenv("MODEL_NAME"),
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

model.config.eos_token_id=tokenizer.eos_token_id
model.gradient_checkpointing_enable() # reducing memory usage
print(model.model.embed_tokens)

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Embedding(32000, 4096)


# Checking the Model

Tips: Please know that we do not fine-tune the model for fitting the specific tasks. So, the answer may not good. However, it is enough for us to illustrate our RAG's solution.

In [10]:
input_text = "The weather in Melbourne is "
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_length=20, do_sample=True)
print(tokenizer.decode(outputs[0]))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> The weather in Melbourne is 17C, sunny but with a slight breeze.



## Prompt with Chat Template

In [14]:
prompt = """[INST]
Given the following wedding guest data, write a very short 3-sentences thank you letter:

{
  "name": "John Doe",
  "relationship": "Bride's cousin",
  "hometown": "New York, NY",
  "fun_fact": "Climbed Mount Everest in 2020",
  "attending_with": "Sophia Smith",
  "bride_groom_name": "Tom and Mary"
}

Use only the data provided in the JSON object above.

The senders of the letter is the bride and groom, Tom and Mary.
[/INST]"""

input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_length=300, do_sample=True)

decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(decoded_output)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST]
Given the following wedding guest data, write a very short 3-sentences thank you letter:

{
  "name": "John Doe",
  "relationship": "Bride's cousin",
  "hometown": "New York, NY",
  "fun_fact": "Climbed Mount Everest in 2020",
  "attending_with": "Sophia Smith",
  "bride_groom_name": "Tom and Mary"
}

Use only the data provided in the JSON object above.

The senders of the letter is the bride and groom, Tom and Mary.
[/INST] Dear John and Sophia,

We were thrilled to have you both at our wedding. Your presence made the day even more special.

Thank you for sharing your fun fact about climbing Mount Everest – that's truly impressive!

Best wishes,
Tom and Mary (Bride and Groom)


In [15]:
messages = [{"role": "user","content": prompt}]
encoded_input = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
outputs = model.generate(encoded_input, max_length=300, do_sample=True)
decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(decoded_output)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] [INST]
Given the following wedding guest data, write a very short 3-sentences thank you letter:

{
  "name": "John Doe",
  "relationship": "Bride's cousin",
  "hometown": "New York, NY",
  "fun_fact": "Climbed Mount Everest in 2020",
  "attending_with": "Sophia Smith",
  "bride_groom_name": "Tom and Mary"
}

Use only the data provided in the JSON object above.

The senders of the letter is the bride and groom, Tom and Mary.
[/INST] [/INST] Dear John,
Thank you for joining us on our special day in New York. Your presence, as the bride's courageous cousin, added to the joy and happiness of the occasion. We were thrilled to hear about your amazing achievement of climbing Mount Everest in 2020.
Warm regards,
Tom and Mary, the Bride and Groom.


# Loading Data

In [16]:
import pandas as pd

ml_papers=pd.read_csv(os.getenv("DATASET"), header=0)
ml_papers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 420 entries, 0 to 419
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Title        420 non-null    object
 1   Description  415 non-null    object
 2   PaperURL     420 non-null    object
 3   TweetURL     416 non-null    object
 4   Abstract     414 non-null    object
dtypes: object(5)
memory usage: 16.5+ KB


In [17]:
def check_df(df):
    print("############# Shape #############")
    print(df.shape)
    print("############# Types #############")
    print(df.dtypes)
    print("############# NA #############")
    print(df.isnull().sum())
    print("############# Quantiles #############")
    numeric_columns=df.select_dtypes(include=['number']).columns
    # return values at the given quantile over requested axis
    print(df[numeric_columns].quantile([0,0.05,0.50,0.95,0.99], 1).T)
    

check_df(ml_papers)

############# Shape #############
(420, 5)
############# Types #############
Title          object
Description    object
PaperURL       object
TweetURL       object
Abstract       object
dtype: object
############# NA #############
Title          0
Description    5
PaperURL       0
TweetURL       4
Abstract       6
dtype: int64
############# Quantiles #############
     0.00  0.05  0.50  0.95  0.99
0     NaN   NaN   NaN   NaN   NaN
1     NaN   NaN   NaN   NaN   NaN
2     NaN   NaN   NaN   NaN   NaN
3     NaN   NaN   NaN   NaN   NaN
4     NaN   NaN   NaN   NaN   NaN
..    ...   ...   ...   ...   ...
415   NaN   NaN   NaN   NaN   NaN
416   NaN   NaN   NaN   NaN   NaN
417   NaN   NaN   NaN   NaN   NaN
418   NaN   NaN   NaN   NaN   NaN
419   NaN   NaN   NaN   NaN   NaN

[420 rows x 5 columns]


In [18]:
# remove rows with empty titles to descriptions
ml_papers=ml_papers.dropna(subset=["Title","Description"])

In [19]:
ml_papers.head()

Unnamed: 0,Title,Description,PaperURL,TweetURL,Abstract
0,Llemma,an LLM for mathematics which is based on conti...,https://arxiv.org/abs/2310.10631,https://x.com/zhangir_azerbay/status/171409802...,"We present Llemma, a large language model for ..."
1,LLMs for Software Engineering,a comprehensive survey of LLMs for software en...,https://arxiv.org/abs/2310.03533,https://x.com/omarsar0/status/1713940983199506...,This paper provides a survey of the emerging a...
2,Self-RAG,presents a new retrieval-augmented framework t...,https://arxiv.org/abs/2310.11511,https://x.com/AkariAsai/status/171511027707796...,"Despite their remarkable capabilities, large l..."
3,Retrieval-Augmentation for Long-form Question ...,explores retrieval-augmented language models o...,https://arxiv.org/abs/2310.12150,https://x.com/omarsar0/status/1714986431859282...,We present a study of retrieval-augmented lang...
4,GenBench,presents a framework for characterizing and un...,https://www.nature.com/articles/s42256-023-007...,https://x.com/AIatMeta/status/1715041427283902...,


# Pre-processing Data

We convert dataframe to list of dicts with Title and Description columns only. Furthermore, we will use SentenceTransformer to generate embeddings for the data and store it to the vector dataset.

In [20]:
ml_papers_dict=ml_papers.to_dict(orient="records")
ml_papers_dict[0]

{'Title': 'Llemma',
 'Description': 'an LLM for mathematics which is based on continued pretraining from Code Llama on the Proof-Pile-2 dataset; the dataset involves scientific paper, web data containing mathematics, and mathematical code; Llemma outperforms open base models and the unreleased Minerva on the MATH benchmark; the model is released, including dataset and code to replicate experiments.',
 'PaperURL': 'https://arxiv.org/abs/2310.10631',
 'TweetURL': 'https://x.com/zhangir_azerbay/status/1714098025956864031?s=20',
 'Abstract': 'We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finet

# Loading SentenceTransformer

In [31]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from sentence_transformers import SentenceTransformer

encoder=SentenceTransformer("all-MiniLM-L12-v2")
encoder.max_seq_length=256

class Embed(EmbeddingFunction):
    def __call__(self, input: Documents)-> Embeddings:
        batch_embeddings=encoder.encode(input, show_progress_bar=True, device="cuda")
        return batch_embeddings.tolist()

embed=Embed()

# Initialize Vector DB 

In [23]:
import chromadb

name=f"ml-papers-nov-2023"
# initialize the chromadb directory, and client
client=chromadb.PersistentClient(path="./chromadb")

collection=client.get_or_create_collection(name=name)

# Generate Embeddings

We generate embeddings and index titles in batches.

In [28]:
import random
from tqdm import tqdm

batch_size=50

# loop through batches and generated +store embeddings
for i in tqdm(range(0, len(ml_papers_dict), batch_size)):
    i_end=min(i+batch_size, len(ml_papers_dict))
    batch=ml_papers_dict[i:i+batch_size]
    
    # replace title with "No Title" if empty string
    batch_titles=[str(paper["Title"]) if str(paper["Title"]) != "" else "No Title" for paper in batch]
    batch_ids=[str(sum(ord(c)+random.randint(1, 10000) for c in paper["Title"])) for paper in batch]
    batch_metadata=[dict(url=paper["PaperURL"], abstract=paper['Abstract']) for paper in batch]
    
    # generate embeddings
    batch_embeddings=encoder.encode(batch_titles)
    
    collection.upsert(
        ids=batch_ids,
        metadatas=batch_metadata,
        documents=batch_titles,
        embeddings=batch_embeddings.tolist()
    )  

  0%|          | 0/9 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

 11%|█         | 1/9 [00:00<00:01,  7.88it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

 22%|██▏       | 2/9 [00:00<00:00,  8.04it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

 33%|███▎      | 3/9 [00:00<00:00,  8.31it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

 44%|████▍     | 4/9 [00:00<00:00,  8.28it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

 56%|█████▌    | 5/9 [00:00<00:00,  7.85it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

 67%|██████▋   | 6/9 [00:00<00:00,  7.62it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

 78%|███████▊  | 7/9 [00:00<00:00,  7.62it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

 89%|████████▉ | 8/9 [00:01<00:00,  7.50it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 9/9 [00:01<00:00,  8.18it/s]


# Testing the Retriever

In [32]:
collection=client.get_or_create_collection(name=name, embedding_function=embed)

retriever_results=collection.query(query_texts=["Software Engineering"], n_results=2)

print(retriever_results["documents"])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[['LLMs for Software Engineering', 'Communicative Agents for Software Development']]


# Inference

In [37]:
query = "S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models"

# query for user query
results=collection.query(query_texts=[prompt], n_results=10)

# concatenate titles into a single string
short_titles='\n'.join(results['documents'][0])



prompt = f'''[INST]

Your main task is to generate 5 SUGGESTED_TITLES based for the PAPER_TITLE

You should mimic a similar style and length as SHORT_TITLES but PLEASE DO NOT include titles from SHORT_TITLES in the SUGGESTED_TITLES, only generate versions of the PAPER_TILE.

PAPER_TITLE: {query}

SHORT_TITLES: {short_titles}

SUGGESTED_TITLES:

[/INST]
'''

encoded_input = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**encoded_input, max_length=2000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(decoded_output)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[INST]

Your main task is to generate 5 SUGGESTED_TITLES based for the PAPER_TITLE

You should mimic a similar style and length as SHORT_TITLES but PLEASE DO NOT include titles from SHORT_TITLES in the SUGGESTED_TITLES, only generate versions of the PAPER_TILE.

PAPER_TITLE: S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

SHORT_TITLES: ChemCrow: Augmenting large-language models with chemistry tools
Emergent autonomous scientific research capabilities of large language models
MusicLM: Generating Music From Text
A Survey of Large Language Models
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Eight Things to Know about Large Language Models
A Watermark for Large Language Models
Augmented Language Models: a Survey
REPLUG: Retrieval-Augmented Black-Box Language Models
Crowd Workers Widely Use Large Language Models for Text Production Tasks

SUGGESTED_TITLES:

[/INST]
1. ScaleScore: Evaluating Large Language Models with S

# Acknowledge

* https://github.com/dair-ai/Prompt-Engineering-Guide/blob/main/notebooks/pe-rag.ipynb
* https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/tree/main
* https://www.sbert.net/docs/quickstart.html
* https://www.kaggle.com/code/aisuko/semantic-search-in-publications
* https://www.kaggle.com/code/aisuko/titanic-question-with-tf-decision-forests
* https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py
* https://www.kaggle.com/code/aisuko/llm-prompt-recovery-with-gemma
