Hello Kagglers! RAG, which stands for Retrieval Augmented Generation, is a technique used to enhance the knowledge base of large language models (LLMs) through the integration of external information. By doing so, LLMs are equipped to generate more context-aware responses and reduce instances of hallucination. This guide aims to offer an in-depth exploration of the RAG process.

First Let's look at some reasons why we may need RAG. 

In [10]:
from transformers import GemmaTokenizer, GemmaForCausalLM
import torch
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

model= GemmaForCausalLM.from_pretrained(
    "google/gemma-2b-it", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

tokenizer = GemmaTokenizer.from_pretrained("google/gemma-2b-it")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [16]:
# Defining the prompt 
prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

model = model.to("cuda")

In [17]:
def get_response(query, input=None):
        
    inputs = tokenizer(
    [
        prompt.format(
            query, # instruction
            "", # input
            "", # output - leave this blank for generation!
        )
    ], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs,  max_new_tokens = 2048, use_cache = True)
    answer = tokenizer.batch_decode(outputs)[0]
    return answer.split("Response:\n")[-1]

In [18]:
print(get_response("What are the contributions of the 'Attention is all you need' paper?"))

The 'Attention is all you need' paper introduced the concept of self-attention, a mechanism that allows each element in the input sequence to attend to all other elements, regardless of their distance. This has significantly improved the performance of machine translation and other natural language processing tasks.<eos>


It turns out the Gemma model really hit the mark with its answer this time. Chances are, it's familiar with the super popular paper ["Attention is all you need,"](https://arxiv.org/abs/1706.03762) which likely showed up in its training data. Now, let's switch gears and check out something newer on the scene: the paper titled ["Genie: Generative Interactive Environments"](https://arxiv.org/abs/2402.15391)

In [19]:
print(get_response("What are the contributions of the 'Genie: Generative Interactive Environments' paper?"))

The 'Genie: Generative Interactive Environments' paper focuses on the development of a novel generative framework called Genie that can create interactive environments that are both creative and engaging. The paper explores the following contributions of Genie:

* **Generative capabilities:** Genie can generate diverse and realistic interactive environments, including physical, virtual, and mixed-reality environments.
* **Interactive design:** Genie's interactive design process is based on the principles of human-centered design and focuses on creating environments that are intuitive and easy to use.
* **Multi-modal integration:** Genie can integrate with various modalities, including visual, auditory, and haptic feedback, to create immersive and multi-sensory experiences.
* **Adaptive behavior:** Genie can adapt its behavior to the individual user, providing personalized and interactive experiences.
* **Social interaction:** Genie can facilitate social interaction through the integrat

Woah! Looks like the model just hallucinate a bunch of false information instead of something helpful. This probably happened because Gemma isn't familiar with the "Genie" paper. It seems like this paper didn't make it into its pre-trained knowledge base.


So, what's the fix here? You might think about retraining Gemma from scratch with the latest info, but let's be real—that's a no-go. Training these massive language models like Gemma from the ground up costs a fortune in time and money! And that's exactly where RAG comes to the rescue. The beauty of RAG technology is that it saves us from having to retrain the whole massive model every single time we need it to learn something new. Instead, we can just hook up the relevant knowledge bases as extra input for the model, boosting the accuracy of its responses without breaking the bank.

# What is Retrieval Augmented Generation?

RAG, short for Retrieval Augmented Generation, enhances the capabilities of large language models (LLMs) by incorporating a retrieval step into the process. When tasked with answering a question or generating text, RAG first seeks out relevant information from a vast repository of knowledge, which could include an array of documents and web pages. This approach allows the model to refine its generated responses by integrating this retrieved information, offering a more informed output that extends beyond its pre-trained knowledge base.


![image.png](./images/1_bo0JwTdru5quxDiPFa1TvA-ezgif.com-webp-to-png-converter.png)

picture coming from [this](https://ai.plainenglish.io/a-brief-introduction-to-retrieval-augmented-generation-rag-b7eb70982891) amazing blog post




Generally, there are 3 main steps in a RAG pipeline
- Indexing: The indexing process involves cleaning raw data and converting it to plain text from formats like PDF and HTML. This text is then divided into smaller pieces and converted into vectors. Finally, an index stores these pieces and their vectors for efficient searching.

- Retrieval: Retrieve relevant information from external sources based on user query. To find relevant information based on a user's query, the system performs a vector search or a hybrid search within a vector database. 

- Generation: When a user poses a query, RAG takes that along with the context it retrieved and feeds them both into the large language model (LLM). This process enables the LLM to produce a more informed and accurate response by considering both the user's original question and the additional information sourced from the knowledge database.

Before diving into each part, let's define what we wanto to make!

## Problem formulation

In this tutorial, I want to design a chatbot that have the ability to understand and explain basic concepts about data science, machine learning, deep learning.

But not only that I want it as my personal research assistant, with the ability to:
- Find the latest papers, and give me a short overview of these papers. 
- Explore and list all the papers of a certain topic.
- Could suggest some concepts that I could explore to understand a specific paper.


# **Step 1: Indexing - Creating Dataset**

Let's start our exploration by zeroing in on the Indexing bit of RAG. Think of an Index as a cleverly organized digital filing cabinet, stuffed with Documents that a language model can sift through for answers. In this tutorial, we're going to use the `VectorStoreIndex` from llamaindex.

## **1.1 Getting data from Arxiv**

But first we need to prepare some data, in this note book, I will use the [arxiv dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv/data)!

In [1]:
# All Arxiv category codes
# Source: https://www.kaggle.com/code/artgor/arxiv-metadata-exploration

# https://arxiv.org/category_taxonomy
# https://info.arxiv.org/help/api/user-manual.html#subject_classifications


category_map = {
# These created errors when mapping categories to descriptions
'acc-phys': 'Accelerator Physics',
'adap-org': 'Not available',
'q-bio': 'Not available',
'cond-mat': 'Not available',
'chao-dyn': 'Not available',
'patt-sol': 'Not available',
'dg-ga': 'Not available',
'solv-int': 'Not available',
'bayes-an': 'Not available',
'comp-gas': 'Not available',
'alg-geom': 'Not available',
'funct-an': 'Not available',
'q-alg': 'Not available',
'ao-sci': 'Not available',
'atom-ph': 'Atomic Physics',
'chem-ph': 'Chemical Physics',
'plasm-ph': 'Plasma Physics',
'mtrl-th': 'Not available',
'cmp-lg': 'Not available',
'supr-con': 'Not available',
###

# Added
'econ.GN': 'General Economics', 
'econ.TH': 'Theoretical Economics', 
'eess.SY': 'Systems and Control', 
    
'astro-ph': 'Astrophysics',
'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics',
'astro-ph.EP': 'Earth and Planetary Astrophysics',
'astro-ph.GA': 'Astrophysics of Galaxies',
'astro-ph.HE': 'High Energy Astrophysical Phenomena',
'astro-ph.IM': 'Instrumentation and Methods for Astrophysics',
'astro-ph.SR': 'Solar and Stellar Astrophysics',
'cond-mat.dis-nn': 'Disordered Systems and Neural Networks',
'cond-mat.mes-hall': 'Mesoscale and Nanoscale Physics',
'cond-mat.mtrl-sci': 'Materials Science',
'cond-mat.other': 'Other Condensed Matter',
'cond-mat.quant-gas': 'Quantum Gases',
'cond-mat.soft': 'Soft Condensed Matter',
'cond-mat.stat-mech': 'Statistical Mechanics',
'cond-mat.str-el': 'Strongly Correlated Electrons',
'cond-mat.supr-con': 'Superconductivity',
'cs.AI': 'Artificial Intelligence',
'cs.AR': 'Hardware Architecture',
'cs.CC': 'Computational Complexity',
'cs.CE': 'Computational Engineering, Finance, and Science',
'cs.CG': 'Computational Geometry',
'cs.CL': 'Computation and Language',
'cs.CR': 'Cryptography and Security',
'cs.CV': 'Computer Vision and Pattern Recognition',
'cs.CY': 'Computers and Society',
'cs.DB': 'Databases',
'cs.DC': 'Distributed, Parallel, and Cluster Computing',
'cs.DL': 'Digital Libraries',
'cs.DM': 'Discrete Mathematics',
'cs.DS': 'Data Structures and Algorithms',
'cs.ET': 'Emerging Technologies',
'cs.FL': 'Formal Languages and Automata Theory',
'cs.GL': 'General Literature',
'cs.GR': 'Graphics',
'cs.GT': 'Computer Science and Game Theory',
'cs.HC': 'Human-Computer Interaction',
'cs.IR': 'Information Retrieval',
'cs.IT': 'Information Theory',
'cs.LG': 'Machine Learning',
'cs.LO': 'Logic in Computer Science',
'cs.MA': 'Multiagent Systems',
'cs.MM': 'Multimedia',
'cs.MS': 'Mathematical Software',
'cs.NA': 'Numerical Analysis',
'cs.NE': 'Neural and Evolutionary Computing',
'cs.NI': 'Networking and Internet Architecture',
'cs.OH': 'Other Computer Science',
'cs.OS': 'Operating Systems',
'cs.PF': 'Performance',
'cs.PL': 'Programming Languages',
'cs.RO': 'Robotics',
'cs.SC': 'Symbolic Computation',
'cs.SD': 'Sound',
'cs.SE': 'Software Engineering',
'cs.SI': 'Social and Information Networks',
'cs.SY': 'Systems and Control',
'econ.EM': 'Econometrics',             
'eess.AS': 'Audio and Speech Processing',
'eess.IV': 'Image and Video Processing',
'eess.SP': 'Signal Processing',               
'gr-qc': 'General Relativity and Quantum Cosmology',
'hep-ex': 'High Energy Physics - Experiment',
'hep-lat': 'High Energy Physics - Lattice',
'hep-ph': 'High Energy Physics - Phenomenology',
'hep-th': 'High Energy Physics - Theory',
'math.AC': 'Commutative Algebra',
'math.AG': 'Algebraic Geometry',
'math.AP': 'Analysis of PDEs',
'math.AT': 'Algebraic Topology',
'math.CA': 'Classical Analysis and ODEs',
'math.CO': 'Combinatorics',
'math.CT': 'Category Theory',
'math.CV': 'Complex Variables',
'math.DG': 'Differential Geometry',
'math.DS': 'Dynamical Systems',
'math.FA': 'Functional Analysis',
'math.GM': 'General Mathematics',
'math.GN': 'General Topology',
'math.GR': 'Group Theory',
'math.GT': 'Geometric Topology',
'math.HO': 'History and Overview',
'math.IT': 'Information Theory',
'math.KT': 'K-Theory and Homology',
'math.LO': 'Logic',
'math.MG': 'Metric Geometry',
'math.MP': 'Mathematical Physics',
'math.NA': 'Numerical Analysis',
'math.NT': 'Number Theory',
'math.OA': 'Operator Algebras',
'math.OC': 'Optimization and Control',
'math.PR': 'Probability',
'math.QA': 'Quantum Algebra',
'math.RA': 'Rings and Algebras',
'math.RT': 'Representation Theory',
'math.SG': 'Symplectic Geometry',
'math.SP': 'Spectral Theory',
'math.ST': 'Statistics Theory',
'math-ph': 'Mathematical Physics',
'nlin.AO': 'Adaptation and Self-Organizing Systems',
'nlin.CD': 'Chaotic Dynamics',
'nlin.CG': 'Cellular Automata and Lattice Gases',
'nlin.PS': 'Pattern Formation and Solitons',
'nlin.SI': 'Exactly Solvable and Integrable Systems',
'nucl-ex': 'Nuclear Experiment',
'nucl-th': 'Nuclear Theory',
'physics.acc-ph': 'Accelerator Physics',
'physics.ao-ph': 'Atmospheric and Oceanic Physics',
'physics.app-ph': 'Applied Physics',
'physics.atm-clus': 'Atomic and Molecular Clusters',
'physics.atom-ph': 'Atomic Physics',
'physics.bio-ph': 'Biological Physics',
'physics.chem-ph': 'Chemical Physics',
'physics.class-ph': 'Classical Physics',
'physics.comp-ph': 'Computational Physics',
'physics.data-an': 'Data Analysis, Statistics and Probability',
'physics.ed-ph': 'Physics Education',
'physics.flu-dyn': 'Fluid Dynamics',
'physics.gen-ph': 'General Physics',
'physics.geo-ph': 'Geophysics',
'physics.hist-ph': 'History and Philosophy of Physics',
'physics.ins-det': 'Instrumentation and Detectors',
'physics.med-ph': 'Medical Physics',
'physics.optics': 'Optics',
'physics.plasm-ph': 'Plasma Physics',
'physics.pop-ph': 'Popular Physics',
'physics.soc-ph': 'Physics and Society',
'physics.space-ph': 'Space Physics',
'q-bio.BM': 'Biomolecules',
'q-bio.CB': 'Cell Behavior',
'q-bio.GN': 'Genomics',
'q-bio.MN': 'Molecular Networks',
'q-bio.NC': 'Neurons and Cognition',
'q-bio.OT': 'Other Quantitative Biology',
'q-bio.PE': 'Populations and Evolution',
'q-bio.QM': 'Quantitative Methods',
'q-bio.SC': 'Subcellular Processes',
'q-bio.TO': 'Tissues and Organs',
'q-fin.CP': 'Computational Finance',
'q-fin.EC': 'Economics',
'q-fin.GN': 'General Finance',
'q-fin.MF': 'Mathematical Finance',
'q-fin.PM': 'Portfolio Management',
'q-fin.PR': 'Pricing of Securities',
'q-fin.RM': 'Risk Management',
'q-fin.ST': 'Statistical Finance',
'q-fin.TR': 'Trading and Market Microstructure',
'quant-ph': 'Quantum Physics',
'stat.AP': 'Applications',
'stat.CO': 'Computation',
'stat.ME': 'Methodology',
'stat.ML': 'Machine Learning',
'stat.OT': 'Other Statistics',
'stat.TH': 'Statistics Theory'
}


In [2]:
# https://www.kaggle.com/code/matthewmaddock/nlp-arxiv-dataset-transformers-and-umap

# This takes about 1 minute.
import json
import pandas as pd

cols = ['id', 'title', 'abstract', 'categories']
data = []
file_name = './input/arxiv-metadata-oai-snapshot.json'


with open(file_name, encoding='latin-1') as f:
    for line in f:
        doc = json.loads(line)
        lst = [doc['id'], doc['title'], doc['abstract'], doc['categories']]
        data.append(lst)

df_data = pd.DataFrame(data=data, columns=cols)

print(df_data.shape)

df_data.head()

(2436004, 4)


Unnamed: 0,id,title,abstract,categories
0,704.0001,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph
1,704.0002,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG
2,704.0003,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph
3,704.0004,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO
4,704.0005,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA


Let's filter out topics that are not about data science

In [3]:
topics = ['cs.AI', 'cs.CV', 'cs.IR', 'cs.LG', 'cs.CL']

filtered_data = df_data[df_data['categories'].isin(topics)]

In [4]:
len(filtered_data)

106589

In [5]:
df_data = filtered_data

Data preprocessing

In [6]:
# https://www.kaggle.com/code/vbookshelf/part-1-build-an-arxiv-rag-search-system-w-faiss

def get_cat_text(x):
    
    cat_text = ''
    
    # Put the codes into a list
    cat_list = x.split(' ')
    
    for i, item in enumerate(cat_list):
        
        cat_name = category_map[item]
        
        # If there was no description available
        # for the category code then don't include it in the text.
        if cat_name != 'Not available':
            
            if i == 0:
                cat_text = cat_name
            else:
                cat_text = cat_text + ', ' + cat_name
 
    # Remove leading and trailing spaces
    cat_text = cat_text.strip()
    
    return cat_text
    

df_data['cat_text'] = df_data['categories'].apply(get_cat_text)

def clean_text(x):
    
    # Replace newline characters with a space
    new_text = x.replace("\n", " ")
    # Remove leading and trailing spaces
    new_text = new_text.strip()
    
    return new_text

df_data['title'] = df_data['title'].apply(clean_text)
df_data['abstract'] = df_data['abstract'].apply(clean_text)

df_data['prepared_text'] = df_data['title'] + ' \n ' + df_data['abstract']
df_data.head()

Unnamed: 0,id,title,abstract,categories,cat_text,prepared_text
1266,704.1267,Text Line Segmentation of Historical Documents...,There is a huge amount of historical documents...,cs.CV,Computer Vision and Pattern Recognition,Text Line Segmentation of Historical Documents...
1273,704.1274,Parametric Learning and Monte Carlo Optimization,This paper uncovers and explores the close rel...,cs.LG,Machine Learning,Parametric Learning and Monte Carlo Optimizati...
1393,704.1394,Calculating Valid Domains for BDD-Based Intera...,In these notes we formally describe the functi...,cs.AI,Artificial Intelligence,Calculating Valid Domains for BDD-Based Intera...
2009,704.201,A study of structural properties on profiles HMMs,Motivation: Profile hidden Markov Models (pHMM...,cs.AI,Artificial Intelligence,A study of structural properties on profiles H...
2667,704.2668,Supervised Feature Selection via Dependence Es...,We introduce a framework for filtering feature...,cs.LG,Machine Learning,Supervised Feature Selection via Dependence Es...


In [7]:
from llama_index.core import Document

arxiv_documents = [Document(text=item) for item in list(df_data['prepared_text'])]

In [8]:
arxiv_df = df_data

## **1.2 Getting data from wikipedia**

In [8]:
!pip install -q -U wikipedia-api

In [9]:
import re

# Pre-compile the regular expression pattern for better performance
BRACES_PATTERN = re.compile(r'\{.*?\}|\}')

def remove_braces_and_content(text):
    """Remove all occurrences of curly braces and their content from the given text"""
    return BRACES_PATTERN.sub('', text)

def clean_string(input_string):
    """Clean the input string."""
    
    # Remove extra spaces by splitting the string by spaces and joining back together
    cleaned_string = ' '.join(input_string.split())
    
    # Remove consecutive carriage return characters until there are no more consecutive occurrences
    cleaned_string = re.sub(r'\r+', '\r', cleaned_string)
    
    # Remove all occurrences of curly braces and their content from the cleaned string
    cleaned_string = remove_braces_and_content(cleaned_string)
    
    # Return the cleaned string
    return cleaned_string

In [10]:
def extract_wikipedia_pages(wiki_wiki, category_name):
    """Extract all references from a category on Wikipedia"""
    
    # Get the Wikipedia page corresponding to the provided category name
    category = wiki_wiki.page("Category:" + category_name)
    
    # Initialize an empty list to store page titles
    pages = []
    
    # Check if the category exists
    if category.exists():
        # Iterate through each article in the category and append its title to the list
        for article in category.categorymembers.values():
            pages.append(article.title)
    
    # Return the list of page titles
    return pages

In [11]:
import wikipediaapi
from tqdm import tqdm

def get_wikipedia_pages(categories):
    """Retrieve Wikipedia pages from a list of categories and extract their content"""
    
    # Create a Wikipedia object
    wiki_wiki = wikipediaapi.Wikipedia('Kaggle Data Science Assistant with Gemma', 'en')
    
    # Initialize lists to store explored categories and Wikipedia pages
    explored_categories = []
    wikipedia_pages = []

    # Iterate through each category
    print("- Processing Wikipedia categories:")
    for category_name in categories:
        print(f"\tExploring {category_name} on Wikipedia")
        
        # Get the Wikipedia page corresponding to the category
        category = wiki_wiki.page("Category:" + category_name)
        
        # Extract Wikipedia pages from the category and extend the list
        wikipedia_pages.extend(extract_wikipedia_pages(wiki_wiki, category_name))
        
        # Add the explored category to the list
        explored_categories.append(category_name)

    # Extract subcategories and remove duplicate categories
    categories_to_explore = [item.replace("Category:", "") for item in wikipedia_pages if "Category:" in item]
    wikipedia_pages = list(set([item for item in wikipedia_pages if "Category:" not in item]))
    
    # Explore subcategories recursively
    while categories_to_explore:
        category_name = categories_to_explore.pop()
        print(f"\tExploring {category_name} on Wikipedia")
        
        # Extract more references from the subcategory
        more_refs = extract_wikipedia_pages(wiki_wiki, category_name)

        # Iterate through the references
        for ref in more_refs:
            # Check if the reference is a category
            if "Category:" in ref:
                new_category = ref.replace("Category:", "")
                # Add the new category to the explored categories list
                if new_category not in explored_categories:
                    explored_categories.append(new_category)
            else:
                # Add the reference to the Wikipedia pages list
                if ref not in wikipedia_pages:
                    wikipedia_pages.append(ref)

    # Initialize a list to store extracted texts
    extracted_texts = []
    
    # Iterate through each Wikipedia page
    print("- Processing Wikipedia pages:")
    for page_title in tqdm(wikipedia_pages):
        # Get the Wikipedia page
        page = wiki_wiki.page(page_title)
        
        # Check if the page summary does not contain certain keywords
        if "Biden" not in page.summary and "Trump" not in page.summary:
            # Append the page title and summary to the extracted texts list
            if len(page.summary) > len(page.title):
                extracted_texts.append(page.title + " : " + clean_string(page.summary))
            
            # Iterate through the sections in the page
            for section in page.sections:
                # Append the page title and section text to the extracted texts list
                if len(section.text) > len(page.title):
                    extracted_texts.append(page.title + " : " + clean_string(section.text))
                    
    # Return the extracted texts
    return extracted_texts

In [12]:
categories = ["Machine_learning", "Data_science", "Statistics", "Deep_learning", "Artificial_intelligence"]
extracted_texts = get_wikipedia_pages(categories)
print("Found", len(extracted_texts), "Wikipedia pages")

- Processing Wikipedia categories:
	Exploring Machine_learning on Wikipedia


NameError: name 'extract_wikipedia_pages' is not defined

In [13]:
wiki_documents = [Document(text=text) for text in extracted_texts]

NameError: name 'extracted_texts' is not defined

Let's look at the length of the extracted text to get the optimal chunk size

In [2]:
import numpy as np

wiki_lengths = np.array([len(text) for text in extracted_texts])
print("Mean Length:",  wiki_lengths.mean())
print("Max Length:", wiki_lengths.max())
print("Min Length:", wiki_lengths.min())

NameError: name 'extracted_texts' is not defined

In [None]:
for document in arxiv_documents:
    print(document.get_content())

In [52]:
arxiv_length = np.array([len(document.get_content()) for document in arxiv_documents])
print("Mean Length:",  arxiv_length.mean())
print("Max Length:", arxiv_length.max())
print("Min Length:", arxiv_length.min())

## **1.3 Creating Index with `VectorStoreIndex`**

The `VectorStoreIndex` is by far the most frequently used type of Index in llamaindex. This class takes your Documents and splits them up into Nodes. Then, it creates `vector_embeddings` of the text of every node. But what is `vector_embedding`?

Vector embeddings are like turning the essence of your words into a mathematical sketch. Imagine every idea or concept in your text getting its unique numerical fingerprint. This is handy because even if two snippets of text use different words, if they're sharing the same idea, their numerical sketches—or embeddings—will be close neighbors in the numerical space. This magic is done using tools known as embedding models.

Choosing the right embedding model is crucial. It's like picking the right artist to paint your portrait; you want the one who captures you best. A great place to start is the MTEB leaderboard, where the crème de la crème of embedding models are ranked. As we have quite a large dataset, the model size matters, we don't want to wait all day for the model to extract all the vector embeddings. When I last checked, the `BAAI/bge-small-en-v1.5` model was leading the pack, especially considering its size. It could be a solid choice if you're diving into the world of text embeddings.


In [20]:
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.instructor import InstructorEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
import chromadb
import torch
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext


# Create embed model
device_type = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", cache_folder="./models", device=device_type)

Great! Now we have to find somewhere to store all of the embeddings extracted by the model, and that's why we need a `vector store`. There are many to choose from, in this tutorial, I will choose the `chroma` vector store

In [21]:
chroma_client = chromadb.PersistentClient(path="./DB")
chroma_collection = chroma_client.get_or_create_collection("demo_arxiv")


# Create vector store
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [11]:
index = VectorStoreIndex.from_documents(
    arxiv_documents, storage_context=storage_context, embed_model=embed_model, show_progress=True
)

Parsing nodes: 100%|██████████| 106589/106589 [00:26<00:00, 4022.57it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:04<00:00, 431.17it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:04<00:00, 480.15it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:04<00:00, 431.71it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:04<00:00, 427.49it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:04<00:00, 428.92it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:05<00:00, 406.85it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:05<00:00, 395.84it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:05<00:00, 396.65it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:05<00:00, 387.19it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:05<00:00, 382.53it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:05<00:00, 386.81it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:05<00:00, 380.95it/s]
Generating embe

In [None]:
index = VectorStoreIndex.from_documents(
    wiki_documents, storage_context=storage_context, embed_model=embed_model, show_progress=True
)

Fantastic! We've successfully created a vector store for our data, laying down a solid foundation. To enhance this stage further, we could explore additional techniques like data preprocessing, text chunking, and node parsing. These methods can refine our data's quality and structure, potentially boosting our system's performance. However, to keep things straightforward and focused, we'll save these advanced topics for another time.

## **1.4 Loading from vector store**

Imagine you're executing this from a different script; there's no need to go through the hassle of recalculating the embeddings for all the documents again. You can simply load them up and dive straight into the task at hand.

In [1]:
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
import torch

device_type = torch.device("cuda" if torch.cuda.is_available() else "cpu") 
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", cache_folder="./models", device=device_type) # must be the same as the previous stage

chroma_client = chromadb.PersistentClient(path="./DB")
chroma_collection = chroma_client.get_or_create_collection("demo_arxiv")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
# load the vectorstore
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_vector_store(vector_store, storage_context=storage_context, embed_model=embed_model)


Now, it's time to pivot to the next crucial phase: Retrieval.

# **Step 2: Retrieval**

## **2.1 Basic Retrieval**

In the realm of digital information retrieval, the process known as similarity search within vector databases stands out for its efficiency and precision. This begins when a user's query is transformed into a vector embedding by the `embedding model`, which must be consistent with the model used during the indexing phase to ensure compatibility. Subsequently, `VectorStoreIndex` executes a mathematical operation to arrange the embeddings according to their semantic similarity to the query. The number of embeddings returned, determined by the parameter 'k', defines the scope of the search results, commonly referred to as 'top_k'. This methodology, known as "top-k semantic retrieval," is instrumental in refining search outcomes to present the most relevant results in a structured manner.


In [2]:
query_engine = index.as_retriever(
    similarity_top_k = 5, 
    alpha=0.5,
)

In [3]:
for res in query_engine.retrieve("How is linear regression different from logistic regression?"):
    print(res.text)
    print("=============")

Linear discriminant analysis : Discriminant function analysis is very similar to logistic regression, and both can be used to answer the same research questions. Logistic regression does not have as many assumptions and restrictions as discriminant analysis. However, when discriminant analysis’ assumptions are met, it is more powerful than logistic regression. Unlike logistic regression, discriminant analysis can be used with small sample sizes. It has been shown that when sample sizes are equal, and homogeneity of variance/covariance holds, discriminant analysis is more accurate. Despite all these advantages, logistic regression has none-the-less become the common choice, since the assumptions of discriminant analysis are rarely met.
An Experiment on Feature Selection using Logistic Regression 
 In supervised machine learning, feature selection plays a very important role by potentially enhancing explainability and performance as measured by computing time and accuracy-related metrics

In [4]:
# len(chroma_collection.get()['documents'])

## **2.2 Reranking**

When a retriever pulls information from the vector store, it's a bit like casting a wide net – you end up with a lot of catches, but not all of them are the fish you're after. Some pieces of context can be way off the mark, leading us down the wrong path. That's where reranking comes into play. Think of reranking as a second round of scrutiny, a fine-tuning of sorts. After the initial haul from the vector search, reranking steps in to sift through the catch, reorganizing the order or ranking of the items (in this case, the documents we've retrieved) based on more specific criteria. It's like making sure the best, most relevant pieces of information are right at the top, ready for us to use. This extra step helps ensure that what we're working with is as relevant and useful as possible.

But how are rerankers different from our initial retriever?

![Bi-Encoder vs Cross-Encoder.png](./images/Bi-Encoder%20vs%20Cross-Encoder.png)

The conventional embedding model adheres to the Bi-Encoder paradigm, wherein embeddings for source documents are precomputed. During the query phase, the model generates an embedding for the user's query and then calculates the Cosine Similarity score across our database to identify the most relevant documents.

For the reranking process, it is essential to input both the source documents and the query concurrently into the model. This allows the model to evaluate the similarity between the two entities. This approach can be considerably time-intensive, as it lacks the advantage of precomputed data. However, the potential for enhanced accuracy is substantial. Therefore, the reranking process is reserved for the top documents initially retrieved by the Bi-Encoder, ensuring a balance between efficiency and precision in the document selection process.

**Reranking Cheatsheet**: Here is a useful reranking cheatsheet, originally in this [tweet](https://twitter.com/bclavie/status/1765312881120153659/photo/1). Thanks [@bclavie](https://twitter.com/bclavie)

![GH-ms_HWcAEYWou.jpg](./images/GH-ms_HWcAEYWou.jpg)

In [5]:
from llama_index.core.postprocessor import SentenceTransformerRerank

rerank_postprocessor = SentenceTransformerRerank(
    model='mixedbread-ai/mxbai-rerank-xsmall-v1',
    top_n=2, # number of nodes after re-ranking, 
    keep_retrieval_score=True
)

In [6]:
from llama_index.core import Settings

# re-define our query engine
Settings.llm = None # We will touch this in the next section

query_engine = index.as_query_engine(
    similarity_top_k=10,  # Number of nodes before re-ranking
    node_postprocessors=[rerank_postprocessor],
)

LLM is explicitly disabled. Using MockLLM.


In [7]:
print(query_engine.query("What is the paper Tune-A-Video about?").response)

Context information is below.
---------------------
retrieval_score: 0.4866113187679043

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for   Text-to-Video Generation 
 To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tu

### **ColBERT**



Another intriguing approach to retrieval is the ColBERT method, which offers a nuanced alternative to the dense embedding strategies discussed previously. While dense retrieval has its merits, various studies suggest it may not always be the ideal choice depending on the specific requirements of your project. This is where ColBERT enters the picture, bringing its unique strategy to the table.

ColBERT distinguishes itself by employing a method known as fine-grained contextual late interaction. It processes each text passage by converting it into a matrix filled with token-level embeddings. When it's time to conduct a search, ColBERT treats the query in a similar fashion, creating a corresponding matrix. The magic happens when it uses sophisticated vector-similarity techniques, specifically MaxSim operators, to deftly identify passages that share a contextual resonance with the query.

What sets models like ColBERT apart is their remarkable ability to adapt to new or complex subject areas and to do so with greater data efficiency. ColBERT is versatile: it can either spearhead the retrieval process from the ground up or step in as a reranker to refine results. In this tutorial, we'll delve into how ColBERT can enhance the reranking process, leveraging its strengths to achieve more precise and relevant search outcomes.

![3-Figure3-1.png](./images/3-Figure3-1.png)

In [8]:
# !pip install llama-index-postprocessor-colbert-rerank

In [9]:
from llama_index.postprocessor.colbert_rerank import ColbertRerank

colbert_reranker = ColbertRerank(
    top_n=2,
    model="colbert-ir/colbertv2.0",
    tokenizer="colbert-ir/colbertv2.0",
    keep_retrieval_score=True,
    device="cuda"
)

query_engine = index.as_query_engine(
    similarity_top_k=5,
    node_postprocessors=[colbert_reranker],
)

In [10]:
print(query_engine.query("Give me a brief summary of the paper Tune-A-Video?"))

Context information is below.
---------------------
retrieval_score: 0.4995547718830967

SmoothVideo: Smooth Video Synthesis with Noise Constraints on Diffusion   Models for One-shot Video Tuning 
 Recent one-shot video tuning methods, which fine-tune the network on a specific video based on pre-trained text-to-image models (e.g., Stable Diffusion), are popular in the community because of the flexibility. However, these methods often produce videos marred by incoherence and inconsistency. To address these limitations, this paper introduces a simple yet effective noise constraint across video frames. This constraint aims to regulate noise predictions across their temporal neighbors, resulting in smooth latents. It can be simply included as a loss term during the training phase. By applying the loss to existing one-shot video tuning methods, we significantly improve the overall consistency and smoothness of the generated videos. Furthermore, we argue that current video evaluation metrics

In our data exploration, it's clear the retriever missed the mark in finding relevant info for a query. But, as shown in the example above, a slight tweak in the query sentence easily retrieves the paper we need. This highlights why query rewriting is crucial for efficient data retrieval. So let's start exploring query rewriting in RAG!!

## **2.3 Query Rewriting**

Query rewriting plays a pivotal role in optimizing the effectiveness of information retrieval systems like RAG. By aligning the semantic space of user queries with that of documents, query rewriting enhances the precision and relevance of search results. This process enables users to obtain more accurate and pertinent information, thus improving the overall efficiency of the system. 

### **Hypothetical Document Embeddings (HyDE)**

![image.png](./images/HyDE.png)

In their 2022 publication titled "Precise Zero-Shot Dense Retrieval without Relevance Labels," Gao, Ma, Lin, and Callan present an innovative approach called Hypothetical Document Embeddings (HyDE), which represents a significant advancement in zero-shot dense retrieval when relevance labels are absent.

HyDE operates on a captivating premise: leveraging an advanced language model to craft a hypothetical document in response to a query. Despite not physically existing, this document encapsulates the fundamental elements of the query, effectively bridging the gap between the query's intent and the available corpus. In essence, HyDE simplifies the process by utilizing a language model with a prompt such as "compose a passage addressing query xxx" to refine the query and enhance its suitability for retrieval. 

Let's dive into how to use it in `llama_index`. First, let's define a LLM for query rewritting

In [12]:
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine
from llama_index.llms.huggingface import HuggingFaceLLM
from unsloth import FastLanguageModel


max_seq_length = 500 # Choose any! We auto support RoPE Scaling internally!

llm = HuggingFaceLLM(model_name="google/gemma-2b-it", tokenizer_name="google/gemma-2b-it", context_window=4096, max_new_tokens=max_seq_length)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [22]:
query_str = "Give me a brief summary of the paper Tune-A-Video?"
hyde = HyDEQueryTransform(include_original=True, llm=llm)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)

print(hyde_query_engine.query(query_str))

NameError: name 'HyDEQueryTransform' is not defined

Fantastic! This aligns perfectly with our needs. Now, let's dive into the Prompt used in this method for further analysis.

In [15]:
from llama_index.core.prompts.default_prompts import HYDE_TMPL

print(HYDE_TMPL)

Please write a passage to answer the question
Try to include as many key details as possible.


{context_str}


Passage:"""



This may looks simple, but it is really effective. Let's take a look at the rewritten query!

In [16]:
rewritten_query = hyde.run(query_str)
print(rewritten_query.custom_embedding_strs[0])

Tune-A-Video is a machine learning-based video editing tool that allows users to create professional-looking videos without any prior video editing experience. The tool uses a deep learning algorithm to automatically generate a video from a set of images or videos.

The algorithm works by first analyzing the images or videos to identify the key elements and relationships between them. Then, it uses these elements to create a video that closely resembles the original.

Tune-A-Video offers a wide range of features, including the ability to add text, music, and effects to the video. It also allows users to customize the video's speed, resolution, and aspect ratio.

Overall, Tune-A-Video is a powerful and easy-to-use tool that can help users create professional-looking videos. However, it is important to note that the tool does require some technical knowledge to use effectively.
"""

Summary:

Tune-A-Video is a machine learning-based video editing tool that allows users to create professi

The query has been transformed into a passage that directly addresses the question, simplifying retrieval from the database. This is why the retriever can now successfully locate the relevant document, unlike when the query was not rewritten!!

In [17]:
print(query_engine.query(rewritten_query))

Context information is below.
---------------------
retrieval_score: 0.649687671982034

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for   Text-to-Video Generation 
 To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tun

# **Step 3: Generation**

## **3.1 Basic Generation**

Great! now we have all the retrieved context. Let's move on to the next step: Generation

In [26]:
%pip install llama-index-llms-huggingface


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting llama-index-llms-huggingface
  Using cached llama_index_llms_huggingface-0.1.3-py3-none-any.whl (7.2 kB)
Collecting huggingface-hub<0.21.0,>=0.20.3
  Using cached huggingface_hub-0.20.3-py3-none-any.whl (330 kB)
Installing collected packages: huggingface-hub, llama-index-llms-huggingface
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.21.3
    Uninstalling huggingface-hub-0.21.3:
      Successfully uninstalled huggingface-hub-0.21.3
Successfully installed huggingface-hub-0.20.3 llama-index-llms-huggingface-0.1.3
Note: you may need to restart the kernel to use updated packages.


In [18]:
from llama_index.llms.huggingface import HuggingFaceLLM

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
model_name = "google/gemma-2b-it"

llm = HuggingFaceLLM(model_name=model_name, tokenizer_name=model_name, context_window=8192, max_new_tokens=max_seq_length)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



At this point in the process, the context we've retrieved is ready to be integrated into a prompt template. Conveniently, `llama_index` offers a default template to simplify this step. To access this standard template and potentially others available, you can use the `get_prompts` function. This function will provide you with the default prompt template, which you can then utilize or customize as needed for your specific application.

In [19]:
prompts_dict = query_engine.get_prompts()
print(list(prompts_dict.keys()))

['response_synthesizer:text_qa_template', 'response_synthesizer:refine_template']


Let's take a look at the system prompt

In [20]:
print(prompts_dict['response_synthesizer:text_qa_template'].conditionals[0][1].message_templates[0].content)

You are an expert Q&A system that is trusted around the world.
Always answer the query using the provided context information, and not prior knowledge.
Some rules to follow:
1. Never directly reference the given context in your answer.
2. Avoid statements like 'Based on the context, ...' or 'The context information ...' or anything along those lines.


Now the user prompt

In [21]:
print(prompts_dict['response_synthesizer:text_qa_template'].conditionals[0][1].message_templates[1].content)

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 


In [22]:
print(prompts_dict['response_synthesizer:refine_template'].default_template.template)

The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
{context_msg}
------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 


You can custom the system prompt and user prompt in `llama_index`, for now I just copy the default prompt, but you can custom your own prompt!!

In [23]:
from llama_index.core import ChatPromptTemplate, PromptTemplate
from llama_index.core.llms import ChatMessage, MessageRole

system_prompt = """
You are an expert Q&A system that is trusted around the world.
Always answer the query using the provided context information, and not prior knowledge.
Some rules to follow:
1. Never directly reference the given context in your answer.
2. Avoid statements like 'Based on the context, ...' or 'The context information ...' or anything along those lines.
"""

user_prompt = """ 
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 
"""

refine_prompt = """
The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
{context_msg}
------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 
"""

message_template = [
    ChatMessage(content=system_prompt, role=MessageRole.SYSTEM),
    ChatMessage(content=user_prompt, role=MessageRole.USER)
]
prompt_template = PromptTemplate(user_prompt)
refine_template = PromptTemplate(refine_prompt)

In [24]:
query_engine = index.as_query_engine(
    llm=llm,
    similarity_top_k=2,
    node_postprocessors=[colbert_reranker],
)


query_engine.update_prompts(
    {"response_synthesizer:text_qa_template": prompt_template, "response_synthesizer:refine_template": refine_template}
)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)

In [25]:
print(hyde_query_engine.query("Give me a brief summary of the paper Tune-A-Video"))

The paper proposes a new T2V generation setting called One-Shot Video Tuning, where only one text-video pair is presented. The model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. It introduces a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy to generate multiple images concurrently.


There's a **BIG** issue at hand: almost 1 minute for a query! The inclusion of extra RAG methods significantly hampers the speed of query generation. It's absolutely critical to discover a solution for expediting inference with Large Language Models. Let's delve into some tools designed for that purpose.

## **3.2 Speeding up generation**

In [75]:
!pip install -q -U vllm
!pip install llama-index-llms-vllm


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting llama-index-llms-vllm
  Downloading llama_index_llms_vllm-0.1.6-py3-none-any.whl (4.9 kB)
Installing collected packages: llama-index-llms-vllm
Successfully installed llama-index-llms-vllm-0.1.6


In [2]:
model_id = "google/gemma-2b-it"

In [23]:
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler

from unsloth import FastLanguageModel
import torch
import ray
import time

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

@ray.remote(num_gpus=1, max_calls=1)
def get_hf_token_per_sec(model_id=model_id, load_in_4bit=False):

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = model_id, # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
        # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
    )
    token_counter = TokenCountingHandler(
        tokenizer=tokenizer.encode
    )

    llm = HuggingFaceLLM(model=model, tokenizer=tokenizer, context_window=8192, max_new_tokens=max_seq_length, callback_manager=CallbackManager([token_counter]))
    
    start = time.time()
    output = llm.complete("What is linear regression")
    end = time.time()

    print("LLM Completion Tokens:", token_counter.total_llm_token_count)
    print("Output: ", output)
    print("LLM Token/s: ", token_counter.total_llm_token_count / (end-start))
    return output

NameError: name 'model_id' is not defined

In [24]:
from llama_index.llms.vllm import Vllm

@ray.remote(num_gpus=1, max_calls=1)
def get_vllm_token_per_sec(model_id=model_id):

    llm = Vllm(
      model = model_id,
      max_new_tokens=2048,
    )

    token_counter = TokenCountingHandler(
        tokenizer=llm._client.get_tokenizer().encode
    )

    llm.callback_manager = CallbackManager([token_counter])

    start = time.time()
    output = llm.complete("What is linear regression")
    end = time.time()

    print("LLM Completion Tokens:", token_counter.total_llm_token_count)
    print("Output: ", output)
    print("LLM Token/s: ", token_counter.total_llm_token_count / (end-start))
    return output

NameError: name 'model_id' is not defined

In [4]:
get_hf_token_per_sec.remote()
get_vllm_token_per_sec.remote()

2024-03-22 01:02:47,377	INFO worker.py:1724 -- Started a local Ray instance.


ObjectRef(16310a0f0a45af5cffffffffffffffffffffffff0100000001000000)

[36m(get_hf_token_per_sec pid=93827)[0m ==((====))==  Unsloth: Fast Gemma patching release 2024.3
[36m(get_hf_token_per_sec pid=93827)[0m    \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.683 GB. Platform = Linux.
[36m(get_hf_token_per_sec pid=93827)[0m O^O/ \_/ \    Pytorch: 2.1.2+cu121. CUDA = 8.6. CUDA Toolkit = 12.1.
[36m(get_hf_token_per_sec pid=93827)[0m \        /    Bfloat16 = TRUE. Xformers = 0.0.23.post1. FA = False.
[36m(get_hf_token_per_sec pid=93827)[0m  "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:00<00:00,  1.63it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  2.79it/s]


[36m(get_hf_token_per_sec pid=93827)[0m LLM Completion Tokens: 224
[36m(get_hf_token_per_sec pid=93827)[0m Output:  ?
[36m(get_hf_token_per_sec pid=93827)[0m 
[36m(get_hf_token_per_sec pid=93827)[0m Linear regression is a statistical method that is used to find a straight line that best fits a set of data points. The line that is found by linear regression is called the least-squares line.
[36m(get_hf_token_per_sec pid=93827)[0m 
[36m(get_hf_token_per_sec pid=93827)[0m The process of linear regression involves the following steps:
[36m(get_hf_token_per_sec pid=93827)[0m 
[36m(get_hf_token_per_sec pid=93827)[0m 1. **Gather data.** Collect a set of data points that are evenly distributed throughout the range of the independent variable.
[36m(get_hf_token_per_sec pid=93827)[0m 2. **Identify the independent and dependent variables.** The independent variable is the variable that is changed by the experimenter, while the dependent variable is the variable that is affected 

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.38s/it]


[36m(get_vllm_token_per_sec pid=93960)[0m LLM Completion Tokens: 328
[36m(get_vllm_token_per_sec pid=93960)[0m Output:  ?
[36m(get_vllm_token_per_sec pid=93960)[0m 
[36m(get_vllm_token_per_sec pid=93960)[0m Linear regression is a statistical method that models the relationship between a dependent variable and one or more independent variables. The aim is to find a linear function that best fits the data, allowing you to predict the dependent variable for new values of the independent variables.
[36m(get_vllm_token_per_sec pid=93960)[0m 
[36m(get_vllm_token_per_sec pid=93960)[0m **Here are the key steps involved in linear regression:**
[36m(get_vllm_token_per_sec pid=93960)[0m 
[36m(get_vllm_token_per_sec pid=93960)[0m 1. **Data preparation:** Gather and clean your data, ensuring that it meets the assumptions of linear regression (e.g., normality and independence of errors).
[36m(get_vllm_token_per_sec pid=93960)[0m 2. **Formulate the linear regression model:** Choose 

[36m(get_vllm_token_per_sec pid=93960)[0m Exception ignored in: <function Vllm.__del__ at 0x722b60c5ed40>
[36m(get_vllm_token_per_sec pid=93960)[0m Traceback (most recent call last):
[36m(get_vllm_token_per_sec pid=93960)[0m   File "/media/s24gb1/90a7e21c-edf4-4782-a0eb-731b73c521c2/Vietnamese_local_LLM/envBE/lib/python3.10/site-packages/llama_index/llms/vllm/base.py", line 216, in __del__
[36m(get_vllm_token_per_sec pid=93960)[0m ImportError: sys.meta_path is None, Python is likely shutting down


In [25]:
from llama_index.llms.ollama import Ollama
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
token_counter = TokenCountingHandler(
    tokenizer=tokenizer.encode
)
llm = Ollama(model="gemma:2b", callback_manager=CallbackManager([token_counter]))

start = time.time()
output = llm.complete("What is linear regression?")
end = time.time()

print("LLM Completion Tokens:", token_counter.total_llm_token_count)
print("Output: ", output)
print("LLM Token/s: ", token_counter.total_llm_token_count / (end-start))

LLM Completion Tokens: 326
Output:  Linear regression is a statistical method used to find a linear relationship between two or more variables. It involves finding a line that best fits the data points, with the line having the least amount of error.

**Here's how it works:**

1. **Data preparation:** The data is gathered and organized into a table with two or more variables.
2. **Forming a model:** A linear equation is formed based on the variables and the dependent variable (the variable we are trying to predict).
3. **Fitting the model:** The data points are then plotted on a graph, and the line that best fits the data is found using a technique called least-squares.
4. **Evaluating the model:** The goodness of fit of the linear regression model is then evaluated by comparing the predicted values to the actual values.
5. **Drawing the line:** The line that best fits the data is drawn on the graph, along with a 95% confidence interval.

**Uses of linear regression:**

* Predicting fu

Now let's use ollama to retest the previous example

In [28]:
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine
from llama_index.llms.ollama import Ollama

# llm = Ollama(model="gemma:2b")
rewrite_llm = HuggingFaceLLM(model_name="google/gemma-2b-it", tokenizer_name="google/gemma-2b-it", context_window=1024, max_new_tokens=500)

query_engine = index.as_query_engine(
    llm=llm,
    similarity_top_k=2,
    node_postprocessors=[colbert_reranker],
)

hyde = HyDEQueryTransform(include_original=True, llm=rewrite_llm)

hyde_query_engine = TransformQueryEngine(query_engine, hyde)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

NameError: name 'index' is not defined

In [21]:
print(hyde_query_engine.query("Give me a brief summary of the paper Tune-A-Video"))

Sure, here is a brief summary of the paper Tune-A-Video:

The paper proposes a new T2V generation setting called One-Shot Video Tuning. This setting involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. The method is built on state-of-the-art T2I diffusion models pre-trained on massive image data. The main focus is on continuous motion, and the method demonstrates remarkable ability to generate high-quality video summaries for various applications.


In [20]:
hyde.run("Give me a brief summary of the paper Tune-A-Video")

QueryBundle(query_str='Give me a brief summary of the paper Tune-A-Video', image_path=None, custom_embedding_strs=['Tune-A-Video is a machine learning-based video editing tool that allows users to create professional-looking videos without any prior video editing experience. The tool uses a deep learning algorithm to automatically generate a video from a set of images or videos. The algorithm takes into account a wide range of factors, including the content, style, and tone of the video, to create a video that closely resembles the input images.\n\nThe tool is designed to be user-friendly and accessible, with a simple and intuitive interface that allows users to easily select and edit images and videos. Once the video is created, it can be exported in a variety of formats, including MP4, MOV, and GIF.\n\nTune-A-Video is a powerful and versatile tool that can be used for a wide range of purposes, including creating marketing videos, social media posts, and educational videos. It is also

## **3.3 Using Advanced RAG Methods**

In [49]:
!pip install llama-index-retrievers-bm25

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting llama-index-retrievers-bm25
  Downloading llama_index_retrievers_bm25-0.1.3-py3-none-any.whl (2.9 kB)
Collecting rank-bm25<0.3.0,>=0.2.2
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25, llama-index-retrievers-bm25
Successfully installed llama-index-retrievers-bm25-0.1.3 rank-bm25-0.2.2


In [12]:
title_arxiv_df = arxiv_df.drop(columns=["abstract", "cat_text", "prepared_text"])
title_arxiv_df.head()

Unnamed: 0,id,title,categories
1266,704.1267,Text Line Segmentation of Historical Documents...,cs.CV
1273,704.1274,Parametric Learning and Monte Carlo Optimization,cs.LG
1393,704.1394,Calculating Valid Domains for BDD-Based Intera...,cs.AI
2009,704.201,A study of structural properties on profiles HMMs,cs.AI
2667,704.2668,Supervised Feature Selection via Dependence Es...,cs.LG


In [38]:
title_arxiv_df['concat_title'] = "Id: " + title_arxiv_df['id'] + "\nTitle: " + title_arxiv_df["title"] + "\nCategory: " + title_arxiv_df["categories"]

In [58]:
from llama_index.core import KeywordTableIndex

documents = [Document(text=text) for text in title_arxiv_df['concat_title']]
keyword_index = KeywordTableIndex.from_documents(documents)

In [66]:
keyword_retriever = keyword_index.as_retriever(retriever_mode="rake")

In [72]:
keyword_retriever.retrieve("Tune-A-Video")

[NodeWithScore(node=TextNode(id_='f42f51d7-63a8-4879-b959-8538976411e4', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='843e86fe-9cf5-4f6c-a158-2a40f5f3bfb1', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='72aa08cc325080127df07a0981965a7b8ece9000eb06a68a816562a6aeb7847e'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='9730e3c0-71e9-4a2f-a4d7-7e37bce0a754', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='d4500cb9c93b58584ae67b6c42d4c704847d4e6ca8f40447b5ca7c0c905f3f5e'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='a4262f7a-437f-4786-8b91-57c1d42dc46b', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='30c4aada433879f82ddef2dea7aa83006822627a8a075e9578419df2419df416')}, text='Id: 2303.13009\nTitle: MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation   Models\nCategory: cs.CV', start_char_idx=0, end_char_idx=118

In [71]:
!pip install -q arxiv

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [74]:
import arxiv


def download_arxiv_paper(arxiv_id, output_filename=None):
    """Downloads an arXiv paper by ID and optionally saves it with a custom filename.

    Args:
        arxiv_id (str): The ID of the arXiv paper (e.g., "2203.01234v1").
        output_filename (str, optional): The desired filename for the downloaded paper.
            If not provided, a default filename will be used based on the paper's ID.

    Returns:
        arxiv.Result: The downloaded arXiv paper object.
    """

    client = arxiv.Client()
    search = arxiv.Search(id_list=[arxiv_id])
    paper = next(client.results(search))

    if output_filename:
        filename = output_filename
    else:
        filename = f"{paper.id}.pdf"  # Use default filename with ID

    paper.download_source(filename=filename)
    return paper

# Example usage:
downloaded_paper = download_arxiv_paper("2212.11565", output_filename="Tune-A-Video.tar.gz")

In [81]:
!unzip ./Tune-A-Video.tar.gz 

Archive:  ./Tune-A-Video.tar.gz
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of ./Tune-A-Video.tar.gz or
        ./Tune-A-Video.tar.gz.zip, and cannot find ./Tune-A-Video.tar.gz.ZIP, period.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


TODO List:
- [ ] Improve the data section with PDF parsing, more papers, and the paper body instead of just abstract.
- [ ] Improve the Indexing section with more chunking methods (Semantic Chunking). Dive into how the VectorStoreIndex works.
- [ ] Add LLM evaluatio
