# **Building Gemma Research Assistant**

# **1. Scientific Research Assistant with Graph**

## **1.1 Data Preprocessing**

In [2]:
# https://www.kaggle.com/code/matthewmaddock/nlp-arxiv-dataset-transformers-and-umap

# This takes about 1 minute.
import json
import pandas as pd

cols = ['id', 'title', 'abstract', 'categories']
data = []
file_name = '../data/arxiv-metadata-oai-snapshot.json'


with open(file_name, encoding='latin-1') as f:
    for line in f:
        doc = json.loads(line)
        lst = [doc['id'], doc['title'], doc['abstract'], doc['categories']]
        data.append(lst)

df_data = pd.DataFrame(data=data, columns=cols)

print(df_data.shape)

df_data.head()

(2455227, 4)


Unnamed: 0,id,title,abstract,categories
0,704.0001,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph
1,704.0002,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG
2,704.0003,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph
3,704.0004,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO
4,704.0005,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA


In [5]:
topics = ['cs.AI', 'cs.CV', 'cs.IR', 'cs.LG', 'cs.CL']

df_data = df_data[df_data['categories'].isin(topics)]

In [6]:
len(df_data)

109131

In [7]:
def clean_text(x):
    
    # Replace newline characters with a space
    new_text = x.replace("\n", " ")
    # Remove leading and trailing spaces
    new_text = new_text.strip()
    
    return new_text

df_data['title'] = df_data['title'].apply(clean_text)
df_data['abstract'] = df_data['abstract'].apply(clean_text)

df_data['prepared_text'] = df_data['title'] + ' \n ' + df_data['abstract']
df_data.head()

Unnamed: 0,id,title,abstract,categories,prepared_text
1266,704.1267,Text Line Segmentation of Historical Documents...,There is a huge amount of historical documents...,cs.CV,Text Line Segmentation of Historical Documents...
1273,704.1274,Parametric Learning and Monte Carlo Optimization,This paper uncovers and explores the close rel...,cs.LG,Parametric Learning and Monte Carlo Optimizati...
1393,704.1394,Calculating Valid Domains for BDD-Based Intera...,In these notes we formally describe the functi...,cs.AI,Calculating Valid Domains for BDD-Based Intera...
2009,704.201,A study of structural properties on profiles HMMs,Motivation: Profile hidden Markov Models (pHMM...,cs.AI,A study of structural properties on profiles H...
2667,704.2668,Supervised Feature Selection via Dependence Es...,We introduce a framework for filtering feature...,cs.LG,Supervised Feature Selection via Dependence Es...


In [10]:
from llama_index.core import Document

arxiv_documents = [Document(text=prepared_text, doc_id=id) for prepared_text,id in list(zip(df_data['prepared_text'], df_data['id']))]

## **1.2 Creating Index**

The `VectorStoreIndex` is by far the most frequently used type of Index in llamaindex. This class takes your Documents and splits them up into Nodes. Then, it creates `vector_embeddings` of the text of every node. But what is `vector_embedding`?

Vector embeddings are like turning the essence of your words into a mathematical sketch. Imagine every idea or concept in your text getting its unique numerical fingerprint. This is handy because even if two snippets of text use different words, if they're sharing the same idea, their numerical sketches—or embeddings—will be close neighbors in the numerical space. This magic is done using tools known as embedding models.

Choosing the right embedding model is crucial. It's like picking the right artist to paint your portrait; you want the one who captures you best. A great place to start is the MTEB leaderboard, where the crème de la crème of embedding models are ranked. As we have quite a large dataset, the model size matters, we don't want to wait all day for the model to extract all the vector embeddings. When I last checked, the `BAAI/bge-small-en-v1.5` model was leading the pack, especially considering its size. It could be a solid choice if you're diving into the world of text embeddings.


In [17]:
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
import chromadb
import torch
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext


# Create embed model
device_type = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", cache_folder="../models", device=device_type)

modules.json: 100%|██████████| 349/349 [00:00<00:00, 725kB/s]
config_sentence_transformers.json: 100%|██████████| 124/124 [00:00<00:00, 230kB/s]
README.md: 100%|██████████| 94.8k/94.8k [00:00<00:00, 14.9MB/s]
sentence_bert_config.json: 100%|██████████| 52.0/52.0 [00:00<00:00, 118kB/s]
config.json: 100%|██████████| 743/743 [00:00<00:00, 1.61MB/s]
model.safetensors: 100%|██████████| 133M/133M [00:12<00:00, 10.8MB/s] 
tokenizer_config.json: 100%|██████████| 366/366 [00:00<00:00, 485kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 10.0MB/s]
tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 1.24MB/s]
special_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 196kB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 409kB/s]


OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 1.88 MiB is free. Process 2654 has 478.00 MiB memory in use. Process 726617 has 22.81 GiB memory in use. Including non-PyTorch memory, this process has 390.00 MiB memory in use. Of the allocated memory 174.43 MiB is allocated by PyTorch, and 13.57 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Great! Now we have to find somewhere to store all of the embeddings extracted by the model, and that's why we need a `vector store`. There are many to choose from, in this tutorial, I will choose the `chroma` vector store

In [16]:
chroma_client = chromadb.PersistentClient(path="../DB")
chroma_collection = chroma_client.get_or_create_collection("gemma_assistant")


# Create vector store
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [None]:
index = VectorStoreIndex.from_documents(
    arxiv_documents, storage_context=storage_context, embed_model=embed_model, show_progress=True
)

# **2. Basic Data Science Assistant**

# **3. Python Code Assistant**

In [None]:
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import PydanticSingleSelector
from llama_index.core.tools import QueryEngineTool

list_tool = QueryEngineTool.from_defaults(
    query_engine=list_query_engine,
    description="Useful for summarization questions related to the data source",
)
vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description="Useful for retrieving specific context related to the data source",
)

query_engine = RouterQueryEngine(
    selector=PydanticSingleSelector.from_defaults(),
    query_engine_tools=[
        list_tool,
        vector_tool,
    ],
)
query_engine.query("<query>")