# Docuquery 

### RAG pipeline to find relevant passages and chat with lightweight LLM 

### Resources 

Rag paper: https://arxiv.org/abs/2005.11401 <br>
Inspiration for the project: https://github.com/mrdbourke/simple-local-rag/blob/main/video_notebooks/00-simple-local-rag-video.ipynb

## RAG (Retrieval Augmented Retrieval)

Main idea of RAG is to take information and pass it to LLM so it can generate outputs based on the context loaded

Retrieval - Finds relevant passages after indexing process and provides most appropriate texts with highest similarity scores <br>
Augmented - Relevant information and augment prompt to LLM with relevant information <br>
Generation - The LLM generates output based on the above 2 steps

## Why RAG?

1. Basically to identify if large pdfs/articles/Research paper published by various scholars contribute to Community based Research
2. The LLM which is pretriained on internet knowledge can be used for custom data
3. RAG helps to find relevant factors/keywords from the pdfs related Community/Non-Community based research
4. LLM can be augmented to ask prompts based on the chunked pdfs

## Approach

1. Upload PDF dir
2. Format the text in pdfs for converting to embedding models
3. Similarly embedding all chunks as dataframe and which can be used later
4. Prompt is provided by incorporating that retrieves part of text 

In [1]:
## Required libraries

import fitz
from tqdm.auto import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np 
import spacy
import os

pip install langchain faiss-cpu tiktoken pypdf  

## Formating text

In [2]:
def text_formatter(text: str)-> str:
    cleaned_text = text.replace("\n", " ").strip()
    return cleaned_text

In [3]:
# def get_tfidf(text: str, vectorizer):
#     tfidf_matrix = vectorizer.transform([text])
#     return tfidf_matrix.sum(axis=1).tolist()[0]

In [4]:
# nlp = spacy.load("en_core_web_sm")

# def split_into_sentences(text: str) -> list:
#     doc = nlp(text)
#     sentences = [sent.text.strip() for sent in doc.sents]
#     return sentences

In [5]:
# page number not be needed
##  Function to read pdfs and
## create page wise data

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    texts = []
   
    # all_text = ""
    # for page in doc:
    #     all_text += page.get_text()

    
    # vectorizer.fit([all_text])

    for page in tqdm(doc, desc=f"Processing {os.path.basename(pdf_path)}"):
        text = page.get_text()
        text = text_formatter(text)

        #sentences = split_into_sentences(text)
        
        #sentence_tfidf_values = [get_tfidf(sentence, vectorizer) for sentence in sentences]

        page_data = {
            "page_word_count": len(text.split(" ")),
            "page_sentence_count_raw": len(text.split(". ")),
            "page_token_count": len(text)/ 4,
            "text" : text,
           # "sentence_tfidf_values": sentence_tfidf_values,
            "paragraph_count": text.count("\n\n") + 1

        }

        texts.append(page_data)

    return texts


In [6]:
def process_dir(folder_path: str) -> dict:
    pdf_results = {}

    for filename in os.listdir(folder_path):
        if filename.endswith(".pdf"):
            pdf_path = os.path.join(folder_path, filename)
            pdf_results[filename] = open_and_read_pdf(pdf_path)
    
    return pdf_results

In [7]:
folder_path = "rag_indexing"
texts_res = process_dir(folder_path = folder_path)

for pdf_name, pages in texts_res.items():
    print(f"Results for {pdf_name}:")
    for i, page in enumerate(pages):
        print(f"Page {i+1}:")
        print(page)
        print("\n---\n")

Processing Community-Engagement-Toolkit_Building_Purpose_and_Participation.pdf:   0%|          | 0/68 [00:00<?…

Processing CRC-Guidelines-May-12-2021.pdf:   0%|          | 0/28 [00:00<?, ?it/s]

Processing ea68a318-51e8-40ab-b056-55f6c99a5859_661bf397-e4a4-40db-bcf7-c07013d7e340.pdf:   0%|          | 0/1…

Processing guide_for_researchers.pdf:   0%|          | 0/16 [00:00<?, ?it/s]

Results for Community-Engagement-Toolkit_Building_Purpose_and_Participation.pdf:
Page 1:
{'page_word_count': 24, 'page_sentence_count_raw': 2, 'page_token_count': 40.25, 'text': 'Community  Engagement  Toolkit Building Purpose  and Participation U.S. Department of Housing and Urban Development  Office of Community Planning and Development', 'paragraph_count': 1}

---

Page 2:
{'page_word_count': 221, 'page_sentence_count_raw': 8, 'page_token_count': 349.75, 'text': 'This Toolkit was created in partnership  by Enterprise Community Partners. Content and design by:  The Practice of Democracy and We All Rise  Original illustrations by: Emma Silverblatt  This material is based upon work supported, in whole or in part, by  Federal award number 19FC115253 under the 2019 Community Compass  cooperative agreement and under 17FC104469 under the 2017 Community  Compass cooperative agreement awarded to Enterprise Community  Partners by the U.S. Department of Housing and Urban Development. The  subs

In [8]:
print(texts_res['Community-Engagement-Toolkit_Building_Purpose_and_Participation.pdf'])

[{'page_word_count': 24, 'page_sentence_count_raw': 2, 'page_token_count': 40.25, 'text': 'Community  Engagement  Toolkit Building Purpose  and Participation U.S. Department of Housing and Urban Development  Office of Community Planning and Development', 'paragraph_count': 1}, {'page_word_count': 221, 'page_sentence_count_raw': 8, 'page_token_count': 349.75, 'text': 'This Toolkit was created in partnership  by Enterprise Community Partners. Content and design by:  The Practice of Democracy and We All Rise  Original illustrations by: Emma Silverblatt  This material is based upon work supported, in whole or in part, by  Federal award number 19FC115253 under the 2019 Community Compass  cooperative agreement and under 17FC104469 under the 2017 Community  Compass cooperative agreement awarded to Enterprise Community  Partners by the U.S. Department of Housing and Urban Development. The  substance and findings of the work are dedicated to the public. Neither the  United States Government, no

### Token count important?

1. Embedding model don't deal with infinite tokens
2. LLMs don't deal with infinite tokens

faster --> lesser tokens 

https://spacy.io/api/sentencizer

### texts to chunks

In [10]:
# Spacy (to convert passages to sentences)

from spacy.lang.en import English
nlp = English()
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x11892c74dc0>

In [11]:

# spaCy model
# Dynamically process text fields for all PDFs
# Iterate over each PDF and its pages
# Iterate over pages of the PDF
### sentence segmentation
# Convert sentences to strings
# Add sentence count
# print processed information
## helper fucntion for more comprehensive understanding and sentence to string extraction

import spacy
from tqdm.auto import tqdm

nlp = spacy.load("en_core_web_sm")

def process_text(text):
    doc = nlp(text)
    sentences = [str(sent) for sent in doc.sents]
    return sentences, len(sentences)


for pdf_name, pages in texts_res.items():  
    print(f"Processing text from {pdf_name}:\n")
    for page in tqdm(pages, desc=f"Processing pages of {pdf_name}"): 
        page["sentences"], page["sentence_count_spacy"] = process_text(page["text"])
        
        print(f"Page sentences: {page['sentences']}")
        print(f"Sentence count (spaCy): {page['sentence_count_spacy']}")
        print("\n---\n")

Processing text from Community-Engagement-Toolkit_Building_Purpose_and_Participation.pdf:



Processing pages of Community-Engagement-Toolkit_Building_Purpose_and_Participation.pdf:   0%|          | 0/68…

Page sentences: ['Community  Engagement  Toolkit Building Purpose  and Participation U.S. Department of Housing and Urban Development  Office of Community Planning and Development']
Sentence count (spaCy): 1

---

Page sentences: ['This Toolkit was created in partnership  by Enterprise Community Partners.', 'Content and design by:  The Practice of Democracy and We All Rise  Original illustrations by: Emma Silverblatt  This material is based upon work supported, in whole or in part, by  Federal award number 19FC115253 under the 2019 Community Compass  cooperative agreement and under 17FC104469 under the 2017 Community  Compass cooperative agreement awarded to Enterprise Community  Partners by the U.S. Department of Housing and Urban Development.', 'The  substance and findings of the work are dedicated to the public.', 'Neither the  United States Government, nor any of its employees, makes any warranty,  express or implied, or assumes any legal liability or responsibility for the  accura

Processing pages of CRC-Guidelines-May-12-2021.pdf:   0%|          | 0/28 [00:00<?, ?it/s]

Page sentences: ['01 IN IT TOGETHER Community-Based Research Guidelines  for Communities and Higher Education The Community Research Collaborative Salt Lake City, UT, May 2021']
Sentence count (spaCy): 1

---

Page sentences: ['Cover photo: Claudia Loayza and a young participant contribute to a mural at Poplar Grove Park during an  Earth Day Placemaking Event with Re-Imagining Nature SLC.', 'Re-Imagining Nature SLC is a partnership with  SLC Public Lands, the College of Architecture & Urban Planning, and University Neighborhood Partners at  the University of Utah, aimed at developing a community-driven plan for the future of Salt Lake City’s natural  lands, urban forest, and city parks.', 'Photographer: Izzy Fuller.', 'Used with permission.', 'This document should be cited as: Community Research Collaborative.', '(2021).', 'In it together: Community- based research guidelines for communities and higher education.', 'Salt Lake City, UT: University of Utah.']
Sentence count (spaCy): 8

-

Processing pages of ea68a318-51e8-40ab-b056-55f6c99a5859_661bf397-e4a4-40db-bcf7-c07013d7e340.pdf:   0%|      …

Page sentences: ['© Journal of Higher Education Outreach and Engagement, Volume 27, Number 1, p. 203, (2023)', 'Copyright © 2023 by the University of Georgia.', 'eISSN 2164-8212    Sense(making) & Sensibility: Reflections on   an Interpretivist Inquiry of Critical                        Service Learning Laura Weaver, Kiesha Warren-Gordon, Susan Crisafulli,   Adam J. Kuban, Jessica E. Lee, and Cristina Santamaría Graff Abstract Critical service learning, as outlined by Mitchell (2008), highlights the  importance of shifting from the charity- and project-based model to a  social-change model of service learning.', 'Her call for greater attention  to social change, redistribution of power, the development of authentic  relationships, and, more recently with Latta (2020), futurity as the  central strategies to enacting “community-based pedagogy” has received  significant attention.', 'However, little research has occurred on how to  measure the effectiveness of these components.', 'This re

Processing pages of guide_for_researchers.pdf:   0%|          | 0/16 [00:00<?, ?it/s]

Page sentences: ['Cover COMMUNITY-ENGAGED RESEARCH A Quick-Start Guide for Researchers Community Engagement Program Clinical & Translational Science Institute at the University of California, San Francisco Contributors:  Margaret Handley, PhD, MPH Rena Pasick, DrPH Michael Potter, MD Geraldine Oliva, MD, MPH Ellen Goldstein, MA Tung Nguyen, MD Series Editor:  ', 'Paula Fleisher, MA']
Sentence count (spaCy): 2

---

Page sentences: ['About this Guide  This Quick-Start Guide is intended for academic researchers  at UCSF who are interested in community-based partnerships  for research.', 'The Guide is a product of the Community  Engagement Program of the UCSF Clinical & Translational  Science Institute (CTSI).', 'One of the Program’s primary aims  is to help academic researchers develop effective and  mutually-satisfying collaborations with community-based  organizations, clinicians or other community stakeholders.  ', 'In this Quick-Start Guide, you will find: n  reasons why community-en

In [12]:
'''
Now with updated dataframe to get the 
List to store data for each page
Processed results to DataFrame
Append data for each page
Convert list of dictionaries to DataFrame
Display the DataFrame
'''

import pandas as pd
data = []

for pdf_name, pages in texts_res.items():
    for page_number, page in enumerate(pages, start=1):
       
        data.append({
            "pdf_name": pdf_name,
            "page_number": page_number,
            "text": page["text"],
            "sentences": page["sentences"],
            "sentence_count_spacy": page["sentence_count_spacy"]
        })


df = pd.DataFrame(data)
print(df)

                                              pdf_name  page_number  \
0    Community-Engagement-Toolkit_Building_Purpose_...            1   
1    Community-Engagement-Toolkit_Building_Purpose_...            2   
2    Community-Engagement-Toolkit_Building_Purpose_...            3   
3    Community-Engagement-Toolkit_Building_Purpose_...            4   
4    Community-Engagement-Toolkit_Building_Purpose_...            5   
..                                                 ...          ...   
125                          guide_for_researchers.pdf           12   
126                          guide_for_researchers.pdf           13   
127                          guide_for_researchers.pdf           14   
128                          guide_for_researchers.pdf           15   
129                          guide_for_researchers.pdf           16   

                                                  text  \
0    Community  Engagement  Toolkit Building Purpos...   
1    This Toolkit was created i

In [13]:
df.describe(include="all").round(3)

Unnamed: 0,pdf_name,page_number,text,sentences,sentence_count_spacy
count,130,130.0,130,130,130.0
unique,4,,130,130,
top,Community-Engagement-Toolkit_Building_Purpose_...,,Community Engagement Toolkit Building Purpos...,[Community Engagement Toolkit Building Purpo...,
freq,68,,1,1,
mean,,23.531,,,18.985
std,,18.98,,,12.364
min,,1.0,,,1.0
25%,,9.0,,,10.0
50%,,17.0,,,18.0
75%,,35.75,,,26.0


In [14]:
print(df.head())

                                            pdf_name  page_number  \
0  Community-Engagement-Toolkit_Building_Purpose_...            1   
1  Community-Engagement-Toolkit_Building_Purpose_...            2   
2  Community-Engagement-Toolkit_Building_Purpose_...            3   
3  Community-Engagement-Toolkit_Building_Purpose_...            4   
4  Community-Engagement-Toolkit_Building_Purpose_...            5   

                                                text  \
0  Community  Engagement  Toolkit Building Purpos...   
1  This Toolkit was created in partnership  by En...   
2  P14 P19 P23 P28 P31 P35 . U.S. Department of H...   
3  TOOLKIT INTRODUCTION US Department of Housing ...   
4  U.S. Department of Housing and Urban Developme...   

                                           sentences  sentence_count_spacy  
0  [Community  Engagement  Toolkit Building Purpo...                     1  
1  [This Toolkit was created in partnership  by E...                     6  
2  [P14 P19 P23 P

## Chunking sentences together

concept of splitting larger pieces of text into smaller ones by splitting into groups of 10 sentences (we can also take as pdf size)

LangChain is hepful in this
https://python.langchain.com/docs/modules/data_connection/document_transformers/ 

1. main purpose is to fit smaller groups of text can be easier to inspect large passages of text. 

2. Text chunks can fit into embedding model context window (eg: tokens as limit)

3. More specific context passed to LLM 

### Demonstration of how chunking is performed 

In [15]:
num_sentence_chunk_size = 10

def split_list(input_list: list[str], slice_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return[input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
res = split_list(test_list) 

print(res)    

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [20, 21, 22, 23, 24]]


In [16]:
# import spacy
# from tqdm.auto import tqdm

# # Load spaCy model
# nlp = spacy.load("en_core_web_sm")

# # Dynamically process text fields for all PDFs
# for pdf_name, pages in texts_res.items():  # Iterate over each PDF and its pages
#     print(f"Processing text from {pdf_name}:\n")
#     for page in tqdm(pages, desc=f"Processing pages of {pdf_name}"):  # Iterate over pages of the PDF
#         # Extract sentences using spaCy
#         page["sentences"] = list(nlp(page["text"]).sents)
#         page["sentences"] = [str(sentence) for sentence in page["sentences"]]  # Convert sentences to strings
        
#         # Add sentence count
#         page["sentence_count_spacy"] = len(page["sentences"])

#         # Optionally print processed information
#         print(f"Page sentences: {page['sentences']}")
#         print(f"Sentence count (spaCy): {page['sentence_count_spacy']}")
#         print("\n---\n")

for pdf_name, pages in tqdm(texts_res.items(), desc="Processing PDFs"):
    for page in pages:
        page["sentence_chunks"] = split_list(input_list=page["sentences"], slice_size = num_sentence_chunk_size)

        page["num_chunks"] = len(page["sentence_chunks"])

Processing PDFs:   0%|          | 0/4 [00:00<?, ?it/s]

In [17]:
import random

random.sample(texts_res.items(), k=1)

since Python 3.9 and will be removed in a subsequent version.
  random.sample(texts_res.items(), k=1)


[('ea68a318-51e8-40ab-b056-55f6c99a5859_661bf397-e4a4-40db-bcf7-c07013d7e340.pdf',
  [{'page_word_count': 579,
    'page_sentence_count_raw': 20,
    'page_token_count': 865.0,
    'text': '© Journal of Higher Education Outreach and Engagement, Volume 27, Number 1, p. 203, (2023) Copyright © 2023 by the University of Georgia. eISSN 2164-8212    Sense(making) & Sensibility: Reflections on   an Interpretivist Inquiry of Critical                        Service Learning Laura Weaver, Kiesha Warren-Gordon, Susan Crisafulli,   Adam J. Kuban, Jessica E. Lee, and Cristina Santamaría Graff Abstract Critical service learning, as outlined by Mitchell (2008), highlights the  importance of shifting from the charity- and project-based model to a  social-change model of service learning. Her call for greater attention  to social change, redistribution of power, the development of authentic  relationships, and, more recently with Latta (2020), futurity as the  central strategies to enacting “community

In [18]:
df = pd.DataFrame(texts_res.items())
print(df.columns)

RangeIndex(start=0, stop=2, step=1)


In [19]:
# converting a nested dictionary or dictionary
# Pandas requires all columns (keys in your case) to have values (lists)
#  of the same length to create a DataFrame directly.
# Convert the flattened list of rows into a DataFrame to display how the dataframe looks like
import pandas as pd

rows = []
for pdf_name, pages in texts_res.items():
    for page_number, page in enumerate(pages, start=1): 
        row = {
            "pdf_name": pdf_name,
            "page_number": page_number,
            **page 
        }
        rows.append(row)

df = pd.DataFrame(rows)
print(df.head)

<bound method NDFrame.head of                                               pdf_name  page_number  \
0    Community-Engagement-Toolkit_Building_Purpose_...            1   
1    Community-Engagement-Toolkit_Building_Purpose_...            2   
2    Community-Engagement-Toolkit_Building_Purpose_...            3   
3    Community-Engagement-Toolkit_Building_Purpose_...            4   
4    Community-Engagement-Toolkit_Building_Purpose_...            5   
..                                                 ...          ...   
125                          guide_for_researchers.pdf           12   
126                          guide_for_researchers.pdf           13   
127                          guide_for_researchers.pdf           14   
128                          guide_for_researchers.pdf           15   
129                          guide_for_researchers.pdf           16   

     page_word_count  page_sentence_count_raw  page_token_count  \
0                 24                        2     

### Splitting each chunk into own item

1. Embed each chunk of sentences into its own numerical representation
2. To attain good granularity
3. Specifically to give generation with references

In [20]:
'''
The purpose of splitting the text into chunks is to prepare the data for efficient processing by LLMs (Large Language Models) or other NLP tools,
which often have input size limitations (e.g., token limits).
'''

import re
from tqdm.auto import tqdm

pages_and_chunks = []

# Iterate over PDFs and their pages dynamically
# Iterate over pages and add page numbers
# Iterate over sentence chunks for each page
# Join the sentences into a paragraph-like structure
# Add stats for the chunk

for pdf_name, pages in tqdm(texts_res.items(), desc="Processing PDFs and chunks"):
    for page_number, page in enumerate(pages, start=1):  
        for sentence_chunk in page["sentence_chunks"]:  
            chunk_dict = {
                "pdf_name": pdf_name,  
                "page_number": page_number,  
            }

            joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
            joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk)  

            chunk_dict["sentence_chunk"] = joined_sentence_chunk

            
            chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
            chunk_dict["chunk_word_count"] = len(joined_sentence_chunk.split(" "))
            chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4  # Approximation: 1 token -> 4 chars
            pages_and_chunks.append(chunk_dict)

print(f"Total chunks processed: {len(pages_and_chunks)}")


Processing PDFs and chunks:   0%|          | 0/4 [00:00<?, ?it/s]

Total chunks processed: 309


### Updated df

In [21]:
import pandas as pd
rows = []

for chunk in pages_and_chunks:
    row = {
        "pdf_name": chunk["pdf_name"],
        "page_number": chunk["page_number"],
        "sentence_chunk": chunk["sentence_chunk"],
        "chunk_char_count": chunk["chunk_char_count"],
        "chunk_word_count": chunk["chunk_word_count"],
        "chunk_token_count": chunk["chunk_token_count"]
    }
    rows.append(row)

df = pd.DataFrame(rows)
print(df.head())

                                            pdf_name  page_number  \
0  Community-Engagement-Toolkit_Building_Purpose_...            1   
1  Community-Engagement-Toolkit_Building_Purpose_...            2   
2  Community-Engagement-Toolkit_Building_Purpose_...            3   
3  Community-Engagement-Toolkit_Building_Purpose_...            4   
4  Community-Engagement-Toolkit_Building_Purpose_...            5   

                                      sentence_chunk  chunk_char_count  \
0  Community Engagement Toolkit Building Purpose ...               158   
1  This Toolkit was created in partnership by Ent...              1381   
2  P14 P19 P23 P28 P31 P35 . U. S. Department of ...              1237   
3  TOOLKIT INTRODUCTION US Department of Housing ...              1571   
4  U. S. Department of Housing and Urban Developm...              1676   

   chunk_word_count  chunk_token_count  
0                21              39.50  
1               203             345.25  
2               2

In [22]:
import pandas as pd
min_token_length = 30

filtered_chunks = df[df["chunk_token_count"] <= min_token_length]

for _, row in filtered_chunks.sample(5).iterrows(): 
    print(f'Chunk token count: {row["chunk_token_count"]} | Text: {row["sentence_chunk"]}')

Chunk token count: 14.5 | Text: What caused those changes?How has the community responded?
Chunk token count: 6.5 | Text: © University of Utah, 2021
Chunk token count: 4.75 | Text: PART 1 ACTION ITEMS
Chunk token count: 7.0 | Text: org/10.1057/9781137315984_20
Chunk token count: 27.0 | Text: “Solidarity is something that is made and remade and never just is.”– Ruth Wilson Gilmore community research


In [23]:
### Filter our rows with under 30 tokens

pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'pdf_name': 'Community-Engagement-Toolkit_Building_Purpose_and_Participation.pdf',
  'page_number': 1,
  'sentence_chunk': 'Community Engagement Toolkit Building Purpose and Participation U. S. Department of Housing and Urban Development Office of Community Planning and Development',
  'chunk_char_count': 158,
  'chunk_word_count': 21,
  'chunk_token_count': 39.5},
 {'pdf_name': 'Community-Engagement-Toolkit_Building_Purpose_and_Participation.pdf',
  'page_number': 2,
  'sentence_chunk': 'This Toolkit was created in partnership by Enterprise Community Partners. Content and design by: The Practice of Democracy and We All Rise Original illustrations by: Emma Silverblatt This material is based upon work supported, in whole or in part, by Federal award number 19FC115253 under the 2019 Community Compass cooperative agreement and under 17FC104469 under the 2017 Community Compass cooperative agreement awarded to Enterprise Community Partners by the U. S. Department of Housing and Urban Deve

### Embedding the chunks

Sample code of embedding works

In [24]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device="cpu")

sentences = ["The Sentence Transformer library provides an easy way for embeddings",
             "It can be embedded one by one or a list",
             "It also useful retrieval and generation"]

embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

for sentence, embedding in embeddings_dict.items():
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding}")
    print("")




Sentence: The Sentence Transformer library provides an easy way for embeddings
Embedding: [-2.64196470e-02  5.11652902e-02 -1.90862268e-02  5.39608598e-02
 -1.24395741e-02  1.05894147e-03  7.31898146e-03 -6.29600435e-02
 -2.59763026e-03 -2.16463730e-02  3.31635997e-02  4.45705988e-02
 -3.20829563e-02  6.88580144e-03  3.46016511e-02 -6.21919073e-02
  4.56626937e-02 -3.34361312e-03 -1.56595241e-02  1.71901975e-02
  2.32384242e-02  1.96844582e-02  1.62895191e-02  4.46745828e-02
 -1.02375690e-02 -2.92083211e-02  1.20528685e-02 -2.93937717e-02
  5.77370673e-02 -1.83068018e-03 -4.29832228e-02 -4.43677045e-03
  3.75034474e-02 -6.27005065e-04  9.70025440e-07  3.14456318e-03
 -3.55627164e-02 -1.14210565e-02  1.81245822e-02  1.24332365e-02
  5.40259890e-02 -6.69255704e-02  1.67771559e-02  4.03443687e-02
 -4.56294306e-02 -2.72983219e-02  4.89208214e-02  2.21724231e-02
  6.84331730e-02  5.08508123e-02 -1.38701461e-02 -4.28568721e-02
  2.82898574e-04 -1.49486400e-02 -1.37374271e-02  1.65402908e-02


In [25]:
%%time

embedding_model.to("cuda")

for i in tqdm(pages_and_chunks_over_min_token_len):
    i["embedding"] = embedding_model.encode(i["sentence_chunk"])

  0%|          | 0/295 [00:00<?, ?it/s]

CPU times: total: 25.5 s
Wall time: 10.8 s


In [None]:
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124


In [26]:
%%time

text_chunks = [i["sentence_chunk"] for i in pages_and_chunks_over_min_token_len]

CPU times: total: 0 ns
Wall time: 0 ns


In [27]:
text_chunks[50]

'U. S. Department of Housing and Urban Development How do I ensure that traditionally underrepresented voices are included?  •When recruiting members for the team, it’s important that each person understands the needs of the stakeholders they represent and can advocate for their interests. • Not every person or organization will have the capacity or resources to participate in a community advisory team. In order to have a truly representative team, it’s important to address any barriers that might prevent individuals from joining - such as lack of financial flexibility, time to participate in meetings, access to technology, and trust, including cultural or learned preconceptions around development, or negative past experiences.• Developing relationships with community leaders is an essential first step to create local interest in joining the team. PART 1 - Section 3 Page 25 Community Engagement ToolkitWho should be included in the community advisory team?  • The purpose of an advisory 

In [28]:
len(text_chunks)

295

## Saving the embeddings to filed

In [30]:
pwd

'c:\\Users\\kirth\\OneDrive - Indiana University\\IUB\\3rd semester\\Learning\\ML\\RA\\Code\\RAGS'

In [31]:
text_chunk_embeddings = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_save_path = "text_chunk_embeddings_df.csv"
text_chunk_embeddings.to_csv(embeddings_save_path, index=False)

In [32]:
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_save_path)
text_chunks_and_embedding_df_load.head()

Unnamed: 0,pdf_name,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,Community-Engagement-Toolkit_Building_Purpose_...,1,Community Engagement Toolkit Building Purpose ...,158,21,39.5,[-2.65781600e-02 2.88857743e-02 -2.10008025e-...
1,Community-Engagement-Toolkit_Building_Purpose_...,2,This Toolkit was created in partnership by Ent...,1381,203,345.25,[-3.48760672e-02 1.18880965e-01 -5.03364652e-...
2,Community-Engagement-Toolkit_Building_Purpose_...,3,P14 P19 P23 P28 P31 P35 . U. S. Department of ...,1237,202,309.25,[-2.13978160e-02 2.99834404e-02 -2.97792666e-...
3,Community-Engagement-Toolkit_Building_Purpose_...,4,TOOLKIT INTRODUCTION US Department of Housing ...,1571,230,392.75,[-2.29254626e-02 6.30211383e-02 -1.68476477e-...
4,Community-Engagement-Toolkit_Building_Purpose_...,5,U. S. Department of Housing and Urban Developm...,1676,257,419.0,[-4.10287678e-02 6.03567548e-02 -2.53214575e-...



we can use a vector database if large embeddings

pinacone or neo4j since it is free




In [8]:
embeddings.shape

torch.Size([295, 768])

In [9]:
embeddings = embeddings_df["embedding"].tolist()
embeddings

[array([-2.65781600e-02,  2.88857743e-02, -2.10008025e-02, -1.94012057e-02,
         8.79778154e-03,  4.70395312e-02,  3.40749808e-02, -2.19353065e-02,
         2.24597715e-02,  3.88242267e-02,  2.47355439e-02,  1.66259520e-02,
         2.11001001e-02,  2.13402119e-02, -6.22774661e-02, -1.68733113e-02,
        -1.57537945e-02,  4.63023707e-02, -7.45775923e-02,  3.88828665e-02,
         1.66905075e-02,  3.38276327e-02, -4.03729035e-03,  4.26370800e-02,
        -6.58789203e-02,  4.37134458e-03, -2.58870255e-02,  1.65710300e-02,
         3.31382565e-02, -3.86113338e-02,  7.16766417e-02,  1.51744569e-02,
        -4.16849256e-02,  6.30697096e-03,  2.09601853e-06, -3.08243856e-02,
         1.11437859e-02,  1.60794724e-02, -8.63101333e-02, -4.91894782e-02,
         7.00778812e-02, -3.45991105e-02,  7.99492374e-03,  1.84970908e-02,
         1.17193274e-02, -3.59803028e-02,  3.98098864e-02, -1.22310044e-02,
        -3.17217112e-02, -2.57870704e-02,  1.16351871e-02, -5.08566247e-03,
        -5.9

# Please refer to RAG_LLM for further work 

### below is reference use

## video reference 
We can see that concerns regarding which db to use and what is the threshold

if we replicate it is still faster with gpu (embeddings*100/1000)

In [15]:
# larger_embeds = torch.randn(100*embeddings.shape[0], 768).to(device) # or even 10000
## It is significantly faster if we exhaustive search

### if more than 10M embeddings then we can use indexing that embeddings FAISS
### FAISS vector search followed bby extension is k nearest neighbor


# start_time = timer()
# dot_scores = util.dot_score(a=query_embedding, b=larger_embeds)[0]
# end_time = timer()


# print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds")


In [16]:
## making the text wrap pretty

import textwrap

def print_wraps(text, wrap_length=100):
    wrap_text = textwrap.fill(text, wrap_length)
    print(wrap_text)

In [17]:
## now wrapping multiple iterables (eg: lists, tuples or strings)


query = "Community engaged research"
print(f"Query: '{query}'\n")
print("Results: ")


for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score: .4f}")
    print("Text:")
    
    print_wraps(pages_and_chunks[idx]["sentence_chunk"])
    print("\n")


Query: 'Community engaged research'

Results: 
Score:  0.8392
Text:
2 Community-based research (CBR) is in high demand. More and more, communities and academic
researchers are partnering in order to learn about and address real-world issues. CBR is being used
to: • Translate scientific knowledge into practice • Support organizing and movement building •
Impact policy • Guide community and economic development • Foster learning and personal
transformation • Build trust with communities harmed by past research •Improve organizations •
Strengthen communities •Enrich our understanding of the world  Today, we are wrestling with deep-
rooted inequities and global challenges that defy simple answers. CBR can be a powerful way to
address these challenges by harness-ing our collective knowledge and resources. Unfortunate-ly, not
everything that goes under the name “community-based research” lives up to the promise. More support
is needed to help this work flourish.


Score:  0.8302
Text:
Commun

For effective ranking we can write an effecient algorithm that ranks the top 25 semantic searches and ranks them 

https://www.mixedbread.ai/
https://pymupdf.readthedocs.io/en/latest/pixmap.html

## Another section:

We can also find pages of the query passed. This is another approach for getting the results

Approach
1. We can pass in the pdf and it can be read 
2. using docload or pixmap (any other related functions)

3. loading the pdf in appropriate file
4. Getting the image of the file -> get_pixmap(dpi=300)
5. img_array -> np.frombuffer h,w,n
6. matplotlib pyplot

In [20]:
# import fitz

# pdf_path = "load.pdf"

# doc = fitz.open(pdf_path)
# page = doc.load_page()

# img = page.get_pixmap(dpi=300)

# ##save image
# img.save("filename.png")
# doc.close()

# img_array = np.frombuffer(img.sales_mv, dtype=np.uint8).reshape((img.h, img.w, img.n))



# ## displaying the image
# import matplotlib.pyplot as plt 
# plt.figure(figsize=(13,10))
# plt.imshow(img_array)
# plt.title(f"Query: '{query}' | Most relevant page: ")
# plt.axis("off")
# plt.show()






## Similarity measures: dot product and cosine (example showcase)

similarity measures between vectors are dot product and cosine similarity

closer vecotrs will have higher scores and futher away will be lower 

Vectors (direction of semantic search) amd magnitude (how long will it take)


In [23]:
## understanding the concepts

import torch

def dot_product(vec1, vec2):
    return torch.dot(vec1, vec2)

def cosine_sim(vec1, vec2):
    dot_prdt = torch.dot(vec1, vec2)

    norm_vec1 = torch.sqrt(torch.sum(vec1**2))
    norm_vec2 = torch.sqrt(torch.sum(vec2**2))
    
    return dot_prdt / (norm_vec1 * norm_vec2)


# example evaluation

vec1= torch.tensor([1,2,3], dtype=torch.float32)
vec2= torch.tensor([1,2,3], dtype=torch.float32)
vec3= torch.tensor([7,8,9], dtype=torch.float32) # higher values then magnitude value is also higher
vec4= torch.tensor([-1,-2,-4], dtype=torch.float32)



print("Dot product between vector1 and vector2:", dot_product(vec1, vec2))
print("Dot product between vector1 and vector3:", dot_product(vec1, vec3))
print("Dot product between vector1 and vector4:", dot_product(vec1, vec4))

# Cosine similarity
print("Cosine similarity between vector1 and vector2:", cosine_sim(vec1, vec2))
print("Cosine similarity between vector1 and vector3:", cosine_sim(vec1, vec3))
print("Cosine similarity between vector1 and vector4:", cosine_sim(vec1, vec4))



Dot product between vector1 and vector2: tensor(14.)
Dot product between vector1 and vector3: tensor(50.)
Dot product between vector1 and vector4: tensor(-17.)
Cosine similarity between vector1 and vector2: tensor(1.0000)
Cosine similarity between vector1 and vector3: tensor(0.9594)
Cosine similarity between vector1 and vector4: tensor(-0.9915)


##### Functionizing it to perform better function call and reusing it

In [18]:

def retrieve_relevant_resources(query: str, 
                                embeddings: torch.tensor,
                                model: SentenceTransformer = embedding_model,
                                n_resources_return: int=5,
                                print_time: bool=True):

    query_embedding = model.encode(query, convert_to_tensor=True)

    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()
    
    if print_time:
        print(f"[INFO] Time taken to get scores on ({len(embeddings)}) embeddings: {end_time-start_time: .5f} seconds.")

    scores, indices = torch.topk(input= dot_scores, k=n_resources_return)

    return scores, indices


def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_return: int=5):
    """
    Finds relevant passages given a query and prints them out along with their scores.
    """
    scores, ind = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_return=n_resources_return)

    # Loop through zipped together scores and indices from torch.topk
    for score, idx in zip(scores, ind):
        print(f"Score: {score:.4f}")
        print("Text:")
        print_wraps(pages_and_chunks[idx]["sentence_chunk"])
        print("\n")



In [19]:
query = "community engaged research"

print_top_results_and_scores(query=query, embeddings=embeddings)

[INFO] Time taken to get scores on (295) embeddings:  0.00009 seconds.
Score: 0.2065
Text:
What internalized biases or assumptions have you come up against in this work? •What abilities or
capacities did you discover in yourself?


Score: 0.1975
Text:
Community Engagement Toolkit PART 1 - Introduction Page 13 U. S. Department of Housing and Urban
Development Picture this: A city holds a public hearing to invite feedback on a proposal for a large
mixed- use development in a larger neighborhood. The City informs the community about the meeting
through neighborhood online forums on Facebook. In theory, all community stakeholders are invited to
participate. In reality, not everyone has access to the information. Lower-income community members
may not have access to the Internet or forums in which the city is posting. New community members or
immigrant community members may not have been invited to participate in certain social groups, and
senior community members may not have Facebook at a

## Local LLM

We have take which model to load and parameters required for it 
https://github.com/mrdbourke/simple-local-rag/blob/main/00-simple-local-rag.ipynb 

there is table wise comparison on LLM diff size and diff levels of numerical precision

<b>Quantization is reducing the size of the model</b>

according to this tutorial it will Gemma 7b

VRAM is important



In [2]:
## checking GPU

import torch

gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes/ (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")


Available GPU memory: 6 GB


In [1]:
!nvidia-smi

Sat Jan 11 14:26:43 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 561.17                 Driver Version: 561.17         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce GTX 1660 Ti   WDDM  |   00000000:01:00.0  On |                  N/A |
| N/A   48C    P0             24W /   80W |     536MiB /   6144MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

To use Gemma AI we have to accept terms and conditions in: 
https://huggingface.co/google/gemma-7b

In [46]:
## small snippet from https://github.com/mrdbourke/simple-local-rag/blob/main/video_notebooks/00-simple-local-rag-video.ipynb

# Note: the following is Gemma focused, however, there are more and more LLMs of the 2B and 7B size appearing for local use.
if gpu_memory_gb < 5.1:
    print(f"Your available GPU memory is {gpu_memory_gb}GB, you may not have enough memory to run a Gemma LLM locally without quantization.")
elif gpu_memory_gb < 8.1:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in 4-bit precision.")
    use_quantization_config = True 
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb < 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
    use_quantization_config = False 
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb > 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommend model: Gemma 7B in 4-bit or float16 precision.")
    use_quantization_config = False 
    model_id = "google/gemma-7b-it"

print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")

GPU memory: 6 | Recommended model: Gemma 2B in 4-bit precision.
use_quantization_config set to: True
model_id set to: google/gemma-2b-it


Quantization techniques:
1. Reduces memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). 

2. This enables loading larger models you normally wouldn’t be able to fit into memory, and speeding up inference.

3. Transformers supports the AWQ and GPTQ quantization algorithms and it supports 8-bit and 4-bit quantization with bitsandbytes.

To get a model running local we are going to need a few stuff to remember

1. A quantization config (optional) -> a config on what precision to load the model in (eg: 8bit, 4bit, etc)

2. A model ID -> this will tell transformers which model/tokenizer to load
3. Tokenizer -> text into numbers ready for LLM (tokenizer diff from an embedding model)

4. LLM model -> we will use to generate text based on input


to make LLM faster there is something imp called flash_attn since I have an older GPU, I can only run the Scaled Dot Product Attention

In [56]:
pip install transformers accelerate bitsandbytes


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
pip install bitsandbytes

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available
from transformers import BitsAndBytesConfig

# Define quantization configuration
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)

# Enable or disable quantization
use_quantization_config = True

# Check for Flash Attention 2 compatibility
if (is_flash_attn_2_available()) and (torch.cuda.get_device_capability(0)[0] >= 8):
    attn_implementation = "flash_attention_2"
else:
    attn_implementation = "sdpa"
print(f"Using attention implementation: {attn_implementation}")


model_id = "google/gemma-2-2b-it"  # Ensure this is a valid model ID or local path

# Instantiate tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path = model_id)

# Instantiate the model
llm_model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_id,
    torch_dtype=torch.float16,
    quantization_config=quantization_config if use_quantization_config else None,
    low_cpu_mem_usage=True,
    attn_implementation=attn_implementation,
    device_map="auto"
)

# Move model to GPU if not using quantization
if not use_quantization_config:
    llm_model.to("cuda")



Using attention implementation: sdpa


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [4]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it")

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

: 

In [None]:
# import torch
# from transformers import AutoTokenizer
# from transformers.utils import is_flash_attn_2_available
# from transformers import BitsAndBytesConfig
# from langchain_community.llms import Ollama

# # 1. Create a quantization config (optional)
# quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

# # 2. Check for Flash Attention 2 support
# if (is_flash_attn_2_available()) and (torch.cuda.get_device_capability(0)[0] >= 8):
#     attn_implementation = "flash_attention_2"
# else:
#     attn_implementation = "sdpa"
# print(f"Using attention implementation: {attn_implementation}")

# # 3. Load the tokenizer (use a Hugging Face tokenizer if you need it for preprocessing)
# model_id = "meta-llama/Llama-2-7b-chat-hf"  # Replace with your local Ollama equivalent
# tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)

# # 4. Load the model via Ollama
# ollama = Ollama(base_url="http://localhost:11434", model="llama2")

# # 5. Define a function to generate text using Ollama
# def generate_text(prompt: str):
#     response = ollama(prompt)
#     return response["response"]

# # 6. Test the generation
# query = "What are the key benefits of community-based research?"
# output = generate_text(query)
# print(f"Response: {output}")


# # Use a pipeline as a high-level helper
# from transformers import pipeline

# messages = [
#     {"role": "user", "content": "Who are you?"},
# ]
# pipe = pipeline("text-generation", model="google/gemma-2-2b-it")
# pipe(messages)

In [None]:
from langchain_community.llms import Ollama

# Define the base URL and model to use in Ollama
ollama = Ollama(base_url="http://localhost:11434", model="llama2")

# Prompt to send to the model
prompt = "What are examples of community-based research?"

# Generate response from the model
response = ollama(prompt)
print(f"Response:\n{response['response']}")


In [5]:
def get_model_mem_size(model: torch.nn.Module):
    """
    Get how much memory a PyTorch model takes up.

    See: https://discuss.pytorch.org/t/gpu-memory-that-model-uses/56822
    """
    # Get model parameters and buffer sizes
    mem_params = sum([param.nelement() * param.element_size() for param in model.parameters()])
    mem_buffers = sum([buf.nelement() * buf.element_size() for buf in model.buffers()])

    # Calculate various model sizes
    model_mem_bytes = mem_params + mem_buffers # in bytes
    model_mem_mb = model_mem_bytes / (1024**2) # in megabytes
    model_mem_gb = model_mem_bytes / (1024**3) # in gigabytes

    return {"model_mem_bytes": model_mem_bytes,
            "model_mem_mb": round(model_mem_mb, 2),
            "model_mem_gb": round(model_mem_gb, 2)}

get_model_mem_size(llm_model)

{'model_mem_bytes': 2192283136, 'model_mem_mb': 2090.72, 'model_mem_gb': 2.04}

In [6]:
def get_model_mem_size(model: torch.nn.Module, precision="float32"):
    """
    Get how much memory a PyTorch model takes up in different precisions.
    """
    # Get model parameters and buffer sizes
    mem_params = sum([param.nelement() * param.element_size() for param in model.parameters()])
    mem_buffers = sum([buf.nelement() * buf.element_size() for buf in model.buffers()])
    
    # Calculate memory in bytes
    total_mem = mem_params + mem_buffers
    
    if precision == "float16":
        total_mem = total_mem / 2
    elif precision == "4bit":
        total_mem = total_mem / 8

    # Convert to MB and GB
    mem_mb = total_mem / (1024 ** 2)
    mem_gb = total_mem / (1024 ** 3)

    return {"model_mem_bytes": total_mem, "model_mem_mb": round(mem_mb, 2), "model_mem_gb": round(mem_gb, 2)}

# Test the model in different precisions
print(get_model_mem_size(llm_model, precision="float32"))
print(get_model_mem_size(llm_model, precision="float16"))
print(get_model_mem_size(llm_model, precision="4bit"))


{'model_mem_bytes': 2192283136, 'model_mem_mb': 2090.72, 'model_mem_gb': 2.04}
{'model_mem_bytes': 1096141568.0, 'model_mem_mb': 1045.36, 'model_mem_gb': 1.02}
{'model_mem_bytes': 274035392.0, 'model_mem_mb': 261.34, 'model_mem_gb': 0.26}


### Generating text with local LLM

1. Generate text with LLM instance by calling the generate method 

2. Tokenized text comes from passing a string of text to our tokenizer 

3. Important aspect is that it should be paired with the LLM model

4. For some LLM there's a specific template that should be passed to get ideal outputs

5. Tokenizer has apply_chat_template() which is important as it converts the string to tokens 

In [15]:
input_text = "What are examples community based research?"
print(f"Input text: \n {input_text}")

# prompt template for instruction tuned model 
dialogue_template = [
    {"role": "user",
     "content": input_text}
]

prompt = tokenizer.apply_chat_template(conversation=dialogue_template, 
                                       tokenize = False, # not tokenized content
                                       add_generation_prompt = True)

print(f"\n Prompt (formatted) :\n {prompt}")

Input text: 
 What are examples community based research?

 Prompt (formatted) :
 <bos><start_of_turn>user
What are examples community based research?<end_of_turn>
<start_of_turn>model



In [1]:
%%time

# # Tokenize the input text (turn it into numbers) and send it to GPU
# input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
# print(f"Model input (tokenized):\n{input_ids}\n")

# # Generate outputs passed on the tokenized input
# # See generate docs: https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/text_generation#transformers.GenerationConfig 
# outputs = llm_model.generate(**input_ids,
#                              max_new_tokens=256) # define the maximum number of new tokens to create
# print(f"Model output (tokens):\n{outputs[0]}\n")


from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it").to("cuda")

# Define your prompt
prompt = "<start_of_turn>user\nWhat are examples of community-based research?<end_of_turn>\n<start_of_turn>model\n"

# Tokenize the input and move to GPU
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate the output
outputs = model.generate(
    **input_ids,
    max_new_tokens=256,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    no_repeat_ngram_size=3,
    temperature=0.7,
    top_k=50,
    top_p=0.9
)

# Decode the output
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Model output (decoded):\n{decoded_output}")


RuntimeError: [enforce fail at alloc_cpu.cpp:114] data. DefaultCPUAllocator: not enough memory: you tried to allocate 2359296000 bytes.

In [19]:
outputs_decoded = tokenizer.decode(outputs[0])
print(f"Model output (decoded):\n{outputs_decoded}\n")

Model output (decoded):
<bos><bos><start_of_turn>user
What are examples community based research?<end_of_turn>
<start_of_turn>model
CommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunityCommunity Community Community Community Community Community Community Community Community Community Community Community Community Community Community Community Community Community Community Community Community Community 