# Local Retrieval Augmented Generation (RAG) from Scratch (step by step tutorial)
- [video link ](https://www.youtube.com/watch?v=qN_2fnOPY-M&t=513s)
- [source code](https://github.com/mrdbourke/simple-local-rag)

## Requirements and setup
- Check if you have GPU
- Environment setupt
- Data source (e.g. PDF)
- Internet connection

## Import PDF Document
download pdf file if we cannot import from local

In [35]:
import os
import urllib.request

# directory of data
data_dir = 'data'

# Get PDF document path
pdf_path = 'data/Jamie Ward - The Student’s Guide to Cognitive Neuroscience-Routledge (2020).pdf'

if not os.path.exists(pdf_path):
    print("[INFO] File doesn't exist, downloading...")
    
    # Enter the URL of the PDF
    url = "https://download.library.lol/main/3042000/8fa1d36b0def1145a47a1542b8c29e7e/Jamie%20Ward%20-%20The%20Student%E2%80%99s%20Guide%20to%20Cognitive%20Neuroscience-Routledge%20%282020%29.pdf"


    urllib.request.urlretrieve(url, pdf_path)
    print('[INFO]File is downloaded')
else:
    print(f'File {pdf_path} exists.')



File data/Jamie Ward - The Student’s Guide to Cognitive Neuroscience-Routledge (2020).pdf exists.


## Processing PDF File
Use [PyMuPDF](https://github.com/pymupdf/pymupdf) to open PDFs.

In [3]:
import fitz # requires !pip install PyMuPDF, see: https://github.com/pymupdf/PyMuPDF
from tqdm.auto import tqdm # pip install tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # replace \n to blank and  remove leading and trailing spaces

    # Potentially more text formatting functions can go here
    return cleaned_text

# This only focues on text, rather than images/figuers etc.
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text() #get plain text encoded as UTF-8
        text = text_formatter(text = text) 
        pages_and_texts.append({
            "page_number": page_number - 13,
            "page_char_count": len(text),
            "page_word_count": len(text.split(" ")),
            "page_setence_count_raw": len(text.split(". ")),
            "page_token_count": len(text)/4, # 1 token ~= 4 characters
            "text": text,
        })

    return pages_and_texts


pages_and_texts = open_and_read_pdf(pdf_path = pdf_path)
pages_and_texts[:2]

  from .autonotebook import tqdm as notebook_tqdm
539it [00:01, 472.31it/s]


[{'page_number': -13,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_setence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''},
 {'page_number': -12,
  'page_char_count': 2472,
  'page_word_count': 387,
  'page_setence_count_raw': 11,
  'page_token_count': 618.0,
  'text': 'The Student’s Guide to   Cognitive Neuroscience Reflecting recent changes in the way cognition and the brain are studied, this  thoroughly updated fourth edition of this bestselling textbook provides a  comprehensive and student-friendly guide to cognitive neuroscience. Jamie  Ward provides an easy-to-follow introduction to neural structure and function,  as well as all the key methods and procedures of cognitive neuroscience, with  a view to helping students understand how they can be used to shed light on  the neural basis of cognition. The book presents a comprehensive overview of the latest theories and  findings in all the key topics in cognitive neuroscience, including vision,  hearing, attentio

## Preview sample

In [5]:
import random

random.sample(pages_and_texts, k=1)

[{'page_number': 69,
  'page_char_count': 2234,
  'page_word_count': 373,
  'page_setence_count_raw': 15,
  'page_token_count': 558.5,
  'text': '70\u2003 THE STUDENT’S GUIDE TO COGNITIVE NEUROSCIENCE either had or had not been diagnosed as schizophrenic). Although both groups  showed a number of similar frontal and temporal lobe activities, there was a  strong correlation between activity in these regions in controls and a striking  absence of correlation in the schizophrenics. Friston and Frith (1995) argued  that schizophrenia is best characterized in terms of a failure of communication  between distant brain regions (i.e., a functional disconnection). One commonly used procedure for measuring functional integration  does not use any task at all. These are known as resting state paradigms.  Participants are merely asked to lie back and rest. In the absence of a  task, the fluctuations in brain activity are little more than noise. However,  in brain regions that are functionally conn

## Preview RAG data

In [6]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count,text
0,-13,0,1,1,0.0,
1,-12,2472,387,11,618.0,The Student’s Guide to Cognitive Neuroscienc...
2,-11,647,99,2,161.75,The Student’s Guide to Cognitive Neuroscienc...
3,-10,75,13,1,18.75,THE STUDENT’S GUIDE TO COGNITIVE ­NEUROSCIE...
4,-9,1461,238,5,365.25,Fourth edition published 2020 by Routledge 2 P...


## Statistics

In [7]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count
count,539.0,539.0,539.0,539.0,539.0
mean,256.0,3082.05,514.65,35.69,770.51
std,155.74,1115.53,181.24,47.41,278.88
min,-13.0,0.0,1.0,1.0,0.0
25%,121.5,2382.5,406.0,16.0,595.62
50%,256.0,3121.0,525.0,21.0,780.25
75%,390.5,3709.5,624.0,25.0,927.38
max,525.0,5251.0,871.0,194.0,1312.75


## Token count
why would we care about token count?

Token count is important to think about because:
1. Embedding models don't deal with infinte tokens.
2. LLMs don't deal with infinte tokens.

For example an embedding model may gave been trained to embed sequences of 384 tokens into numerical space(sentence-transformers `all-mpnet-base-v2`, see: [pretrained_model](https//ww.sbert.net/docs/pretrained_models.html))

As for LLMs, they can't accept infinete tokens in their context window

## Further text processing (splitting pages into sentences)
Two ways to do this:
1. We've done this by splitting on ". ".
2. We can do this with a NLP library such as [spaCy](https://spacy.io/usage) and [nltk](https://www.nltk.org/).


In [8]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io./api/sentencizer
nlp.add_pipe("sentencizer")

# Create document instance
doc = nlp("Cognitive Neuroscience Reflecting recent changes in the way.")
assert len(list(doc.sents)) == 1

# Print out our sentences split
list(doc.sents)

[Cognitive Neuroscience Reflecting recent changes in the way.]

In [9]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings (the default type is a spaCy datatype)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the senteces
    item["page_sentence_count_spacy"] = len(item["sentences"])


100%|██████████| 539/539 [00:02<00:00, 224.46it/s]


### Select random sample

In [10]:
random.sample(pages_and_texts, k=1)

[{'page_number': 487,
  'page_char_count': 5046,
  'page_word_count': 803,
  'page_setence_count_raw': 159,
  'page_token_count': 1261.5,
  'text': '488\u2003 References Meaney, M. J. (2001). Maternal care, gene expression,  and the transmission of individual differences in  stress reactivity across generations. Annual Review of  Neuroscience, 24, 1161–1192. doi: 10.1146/annurev. neuro.24.1.1161. Mechelli, A., Gorno-Tempini, M. L., & Price, C.  J. (2003). Neuroimaging studies of word and  pseudoword reading: Consistencies, inconsistencies,  and limitations. Journal of Cognitive Neuroscience,  15, 260–271. Mechelli, A., Josephs, O., Ralph, M. A. L., McClelland, J. L.,   & Price, C. J. (2007). Dissociating stimulus-driven  semantic and phonological effect during reading and  naming. Human Brain Mapping, 28(3), 205–217. Medina, J., & Fischer-Baum, S. (2017). Single-case  cognitive neuropsychology in the age of big data.  Cognitive Neuropsychology, 34(7–8), 440–448. doi:  10.1080/02643294.

In [11]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count,page_sentence_count_spacy
count,539.0,539.0,539.0,539.0,539.0,539.0
mean,256.0,3082.05,514.65,35.69,770.51,29.35
std,155.74,1115.53,181.24,47.41,278.88,23.33
min,-13.0,0.0,1.0,1.0,0.0,0.0
25%,121.5,2382.5,406.0,16.0,595.62,18.0
50%,256.0,3121.0,525.0,21.0,780.25,23.0
75%,390.5,3709.5,624.0,25.0,927.38,29.0
max,525.0,5251.0,871.0,194.0,1312.75,102.0


### Chunking our sentences together
The concept of splitting larger pieces of text into smaller ones is often reffered to as text splitting or chunking.

There is 100 % coorrect way to do this.

We''ll keep it simple and split into groups of 10 sentences (however, you could alsl try 5, 7, 8, whatever you like).

There are frameworkds such as LangChain which can help with this, however, we'll stick with Python for now.
https://python.langchain.com/docs/modules/data_connection/document_transformers.


Why we do this:
1. So our texts are easier to filter (smaller groups of text can be easier to inspect that large passages of text).
2. So our text chunk can fit into our embedding model context window (e.g. 384 tokens as a limit).
3. So our contexts passed to an LLM can be more specific and focused

In [20]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list, slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"], slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])





100%|██████████| 539/539 [00:00<00:00, 180209.63it/s]


In [21]:
# Sample an example from the group (note: many samples have only 1 chunk as they have <=10 sentences total)
random.sample(pages_and_texts, k=1)

[{'page_number': 349,
  'page_char_count': 3446,
  'page_word_count': 547,
  'page_setence_count_raw': 20,
  'page_token_count': 861.5,
  'text': '350\u2003 THE STUDENT’S GUIDE TO COGNITIVE NEUROSCIENCE (e.g., Chinese)? The evidence suggests that the same reading system is  indeed used across other languages (Rueckl, et al., 2015), but the different  routes and components may be weighted differently according to the  culture-specific demands. Functional imaging suggests that reading uses similar brain regions across  different languages, albeit to varying degrees. Italian speakers activate more  strongly areas involved in phonemic processing when reading words, whereas  English speakers activate more strongly regions implicated in lexical retrieval  (Paulesu et al., 2000). Studies of Chinese speakers also support a common  network for reading Chinese logographs and reading Roman-alphabetic  transcriptions of Chinese (the latter being a system, called pinyin, used to  help in teaching C

In [40]:
# Create a DataFrame to get stats
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,539.0,539.0,539.0,539.0,539.0,539.0,539.0
mean,256.0,3082.05,514.65,35.69,770.51,29.35,3.38
std,155.74,1115.53,181.24,47.41,278.88,23.33,2.35
min,-13.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,121.5,2382.5,406.0,16.0,595.62,18.0,2.0
50%,256.0,3121.0,525.0,21.0,780.25,23.0,3.0
75%,390.5,3709.5,624.0,25.0,927.38,29.0,3.0
max,525.0,5251.0,871.0,194.0,1312.75,102.0,11.0


### Splitting each chunk into its own item
1. Embed each chunk of sentences into its own numerical representation
2. Create new list of dictionaries containing a single chunk of sentences with relative information sucha as page number as well statistics about each chunk.

In [33]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        
        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        # split sentence using ". " 
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo 
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters
        
        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

100%|██████████| 539/539 [00:00<00:00, 12276.97it/s]


1821

In [44]:
# View a random sample
random.sample(pages_and_chunks, k=1)

[{'page_number': 113,
  'sentence_chunk': 'Are these assumptions plausible? • •Critically evaluate the role of group studies in neuropsychological research. • •What are the advantages and disadvantages of using single cases to draw inferences about normal cognitive functioning? • •How have TMS and tDCS studies contributed to our knowledge of brain plasticity? • •Compare and contrast lesion methods arising from organic brain damage with TMS and tES. ONLINE RESOURCES Visit the companion website at www.routledge.com/cw/ward for: • • References to key papers and readings • • Video lectures and interviews on key topics with leading psychologist Elizabeth Warrington and author Jamie Ward, as well as demonstrations of and lectures on brain stimulation • • Multiple-choice questions and interactive flashcards to test your knowledge • • Downloadable glossary',
  'chunk_char_count': 817,
  'chunk_word_count': 122,
  'chunk_token_count': 204.25}]

In [46]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1821.0,1821.0,1821.0,1821.0
mean,310.96,897.43,138.21,224.36
std,161.79,498.37,76.85,124.59
min,-12.0,5.0,1.0,1.25
25%,169.0,523.0,77.0,130.75
50%,338.0,829.0,130.0,207.25
75%,466.0,1270.0,200.0,317.5
max,524.0,4439.0,587.0,1109.75


### Filter chunks of text for short chunks
remove item with short sentence in chunk, these chunks may not contain much useful information

In [47]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token ocunt: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token ocunt: 23.25 | Text: For example, the superior temporal sulcus lies between the superior and medial temporal gyri.
Chunk token ocunt: 28.0 | Text: From Barraclough et al. (2005). © 2005 by the Massachusetts Institute of Technology. Reproduced with permission.
Chunk token ocunt: 12.0 | Text: Dysgraphia Difficulties in spelling and writing.
Chunk token ocunt: 11.25 | Text: A comprehensive selection of advanced topics.
Chunk token ocunt: 18.25 | Text: THE STUDENT’S GUIDE  TO COGNITIVE ­NEUROSCIENCE JAMIE WARD Fourth Edition


Lookds like many of these are headers and footers of different pages.

They don't seem to offer too much information.

Let's filter our DataFrame/list of dictionaries to only include chunks with over 30 tokens in length.

In [48]:
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -12,
  'sentence_chunk': 'The Student’s Guide to  Cognitive Neuroscience Reflecting recent changes in the way cognition and the brain are studied, this thoroughly updated fourth edition of this bestselling textbook provides a comprehensive and student-friendly guide to cognitive neuroscience. Jamie Ward provides an easy-to-follow introduction to neural structure and function, as well as all the key methods and procedures of cognitive neuroscience, with a view to helping students understand how they can be used to shed light on the neural basis of cognition. The book presents a comprehensive overview of the latest theories and findings in all the key topics in cognitive neuroscience, including vision, hearing, attention, memory, speech and language, numeracy, executive function, social and emotional behavior and developmental neuroscience. Throughout, case studies, newspaper reports, everyday examples and student- friendly pedagogy are used to help students understand t

### Embedding our text chunks
- text -> numbers
- similar meaning texts have dimilar numerical representation
- 