### RAG(Retrieval Augmented Generation) practice 
- Retrieval -Find relevant information given a query
- Augmentation -Take the relevant information from retrieval and augment our input(prompt) to an LLM with that relevant information.
- Generation - Take the first two steps and pass them to an LLM for generative outputs.

### Why RAG?
- The main goal of RAG is to improve the generation outputs of LLMs. 
- Most of what I practiced were based on fine-tuning. It can provide up-to-date information to the model with less efforts.
### Steps
- Import PDF
- Process text for embedding
- Embed text chunks with embedding model.
- Save embeddings to file for later
### Practiced with 
- Daniel Bourke- Local Retrieval Generation(RAG) from Scratch(step by step tutorial)
- https://www.youtube.com/watch?v=qN_2fnOPY-M&t=513s
### The project is about
- Using RAG to augment pdf document information to a local LLM model.
- Getting a LLM model that is knowledgable on up-to-date Virginia Tech ECE Graduate students policy.

### Import PDF Document

In [5]:
import os 
import requests

# Get PDF docu path
pdf_path = "ECE Graduate Policy Manual AY2023-2024.pdf"

# download 
if not os.path.exists(pdf_path):
    print(f"[INFO] File doesn't exist, downloading...")

    # Enter the URL of the PDF 
    # Did not put URL of VT ECE graduate policy link on purpose. Let's edit this if I need to actually download a pdf file from a site later on.
    url = "example_url" 

    filename = pdf_path 

    response = requests.get(url)

    if response.status_code == 200: 
        with open(filename,"wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been downloaded and saved as {filename}")
    else:
        print(f"[INFO] failed to download the file. status code: {response.status_code}")
else:
    print(f"File {pdf_path} exists.")


File ECE Graduate Policy Manual AY2023-2024.pdf exists.


In [37]:
# Use PyMuPDF to open a pdf instead of pypdf
import fitz 
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip()

    # potentially more text formatting can be put in here. 
    return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number-4,
                                "page_char_count":len(text),
                                "page_word_count":len(text.split(" ")),
                                "page_sentence_count_raw":len(text.split(". ")),
                                "page_token_count":len(text)/4, # 1 token = ~4 characters
                                "text":text})
    return pages_and_texts
pages_and_texts = open_and_read_pdf(pdf_path = pdf_path)
pages_and_texts[:15]
        

0it [00:00, ?it/s]

[{'page_number': -4,
  'page_char_count': 67,
  'page_word_count': 11,
  'page_sentence_count_raw': 1,
  'page_token_count': 16.75,
  'text': 'ECE Graduate Student Policy Manual  For the 2023-2024 Academic Year'},
 {'page_number': -3,
  'page_char_count': 5340,
  'page_word_count': 389,
  'page_sentence_count_raw': 50,
  'page_token_count': 1335.0,
  'text': '1  Table of Contents  1  General Information ............................................................................................................................. 5  1.1  ECE Graduate Advising Offices ..................................................................................................... 5  1.1.1  Graduate Academic Advisors ................................................................................................ 5  1.1.2  Graduate Program Director.................................................................................................... 6  1.1.3  Assistant Graduate Program Directors ..........

In [38]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 37,
  'page_char_count': 3050,
  'page_word_count': 519,
  'page_sentence_count_raw': 35,
  'page_token_count': 762.5,
  'text': '41 the ECE Graduate Program Director. Other Virginia Tech Institute faculty with ECE courtesy  appointments or ECE collegiate and research professors may serve as co-Chair with approval  of the ECE Graduate Program Director.    A diverse set of research experience should be included on the Advisory Committee. At least  one of the committee members shall be an ECE tenured/tenure-track faculty with a different  primary technical area from the Ph.D. student (an “out-of-area” member). In addition, one  committee member must be a tenured/tenure-track faculty member from another academic  department within Virginia Tech.    ECE research expertise must be represented in the Advisory Committee. For an ECE PhD  committee, at least 3 of the committee members shall be tenured/tenure- track/collegiate/research professors of any rank with at most one of 