<a href="https://colab.research.google.com/github/JYL480/RAGResume/blob/main/ResumeRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Install packages

In [1]:
# Import the necessary packages
# Perform Google Colab installs (if running in Google Colab)
import os


print("[INFO] Running in Google Colab, installing requirements.")
!pip install -U torch # requires torch 2.1.1+ (for efficient sdpa implementation)
!pip install PyMuPDF # for reading PDFs with Python
!pip install tqdm # for progress bars
!pip install sentence-transformers # for embedding models, helps to give dense vectors formed from the words
!pip install accelerate # for quantization model loading
!pip install bitsandbytes # for quantizing models (less storage space), reduce the size of the weights!
!pip install flash-attn --no-build-isolation # for faster attention mechanism = faster LLM inference

[INFO] Running in Google Colab, installing requirements.


### Let's read my resume first!

In [2]:
# from google.colab import drive
# drive.mount('/content/drive')



In [3]:

# import shutil
# import os

# # Define the source path in Google Drive
# source_path = '/content/drive/My Drive/Colab Notebooks/Lee Jun Yang_Resume.pdf'

# # Define the destination path in the local Colab environment
# destination_path = 'Lee Jun Yang_Resume.pdf'

# if os.path.exists(destination_path):
#   print("File already exists...")
# else:
#     # Copy the file from Google Drive to the local Colab environment
#   print("Downloading file...")
#   shutil.copyfile(source_path, destination_path)

#   print(f"File copied to {destination_path}")




In [4]:
import torch
import numpy as np
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
JunYang_csv = pd.read_csv("about_JunYang.csv")
JunYang_csv

Unnamed: 0,question,answer
0,what is his educational background| tell me ab...,I am currently pursing a bacholer degree in Co...
1,How can I contact Lee Jun Yang? | What is Dev'...,Lee Jun Yang's contact information: email - c2...
2,What motivates Lee Jun Yang?,Lee Jun Yang is deeply motivated by the transf...
3,Where does he see himself in 5 years?,"In 5 years, Lee Jun Yang aims to be in a posit..."
4,Tell me about his technical skills ; What tech...,Lee Jun Yang has demonstrated versatile techni...
5,"what can you do, hey, hello","Hello, I'm a Resume Bot. I can assist you with..."
6,What programming languages or technologies is ...,"Python, Retrieval-Augmented Generation (RAG), ..."
7,what languages does Lee Jun Yang speak,Lee Jun Yang speaks English and Chinese
8,Where can I see Lee Jun Yang's Github profile?,You can see his resume on his LinkedIn @ https...
9,"Tell me about Lee Jun Yang's hobbies , any hob...",Lee Jun Yang loves to Music and Sports.


In [5]:
#  Convert in teh list of dictionary

about_jy = []
for index, row in JunYang_csv.iterrows():
  about_jy.append({ "question" :row["question"],
                    "answer": row["answer"],
                   "char_count": len(row["answer"]) + len(row["question"]),
                    "token_count": (len(row["answer"]) + len(row["question"])) / 4,})

about_jy

[{'question': 'what is his educational background| tell me about his education| where did he go to school',
  'answer': 'I am currently pursing a bacholer degree in Computer Engineering, I dabble in both software and hardware, Studying in Nanyang Technological University Singapore.',
  'char_count': 251,
  'token_count': 62.75},
 {'question': "How can I contact Lee Jun Yang? | What is Dev's email? | How to contact Lee Jun Yang?",
  'answer': "Lee Jun Yang's contact information: email - c220096@e.ntu.edu.sg, phone - +65 92484864, LinkedIn - https://www.linkedin.com/in/lee-jun-yang-0b0337215/",
  'char_count': 235,
  'token_count': 58.75},
 {'question': 'What motivates Lee Jun Yang?',
  'answer': 'Lee Jun Yang is deeply motivated by the transformative potential of AI and data science, He sees these fields as catalysts for positive change, offering endless possibilities to solve complex problems, improve lives, and drive innovation. This sense of purpose and the opportunity to make a mean

In [6]:

# Here we will preprocess that data!
#  note that are a lot of packages avaialble to read! But this is the best in terms of the data used!
def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

In [7]:
# Now we deal with reading part!

import fitz # PyMuPDF, note both are the same thing, it fitz is just another alias for PyMuPDF
from tqdm.auto import tqdm

# Nota that this will accept an input of the path to the pdf
# and then return a list of dictionary of texts that have been cleaned!!
def open_and_read_pdf(pdf_path) -> list[dict]:
  doc = fitz.open(pdf_path)
  # you will initialise a list, so what you will return a list of dictionaries!!
  # This is a standard to do because list of dictionaries are very good with data manipulation!
  pages_and_texts = []
  for page_num, page in tqdm(enumerate(doc)):
    # we get the text!
    text = page.get_text()
    # we will clean it with text_formatter()
    text = text_formatter(text)
    # We append the text to the list
    pages_and_texts.append({  "char_count": len(text),
                              "word_count": len(text.split(" ")),
                              "sentence_count_raw": len(text.split(". ")),
                              "token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                              "text": text})
  return pages_and_texts

pdf_path = "Lee Jun Yang_Resume.pdf"
resume_texts = open_and_read_pdf(pdf_path=pdf_path)
resume_texts

FileNotFoundError: no such file: 'Lee Jun Yang_Resume.pdf'

### Read the about_JunYang.csv as well!

In [None]:
# combined_list = about_jy + resume_texts
# combined_list

In [None]:
import pandas as pd

combined_list = resume_texts

df = pd.DataFrame(combined_list)
df.describe().round(2)

In [8]:
from spacy.lang.en import English

nlp = English()
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x79f334f29300>

In [9]:
#  Using the spacy sentencizer to put all into lsit of sentnvese!
for item in tqdm(combined_list):
  # we will add a new key = sentnces within our list of dictioanries!
  item["sentences"] = list(nlp(item["text"]).sents) # we will start the sentencizer!! with .sents!

  # Now we would have to convert the data type from spaCy to string
  item["sentences"] = [str(sentences) for sentences in item["sentences"]]

  # We will add another new key to count the number of sentneces within each pages
  item["page_sentence_count_spacy"] = len(item["sentences"])

NameError: name 'combined_list' is not defined

In [10]:
#  Using the spacy sentencizer to put all into lsit of sentnvese!
# For the other CSV about Lee Jun Yang
for item in tqdm(about_jy):
  # we will add a new key = sentnces within our list of dictioanries!
  item["sentences"] = list(nlp(item["answer"]).sents) # we will start the sentencizer!! with .sents!

  # Now we would have to convert the data type from spaCy to string
  item["sentences"] = [str(sentences) for sentences in item["sentences"]]

  # We will add another new key to count the number of sentneces within each pages
  item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/20 [00:00<?, ?it/s]

In [None]:
about_jy

In [None]:
combined_list

In [11]:
# WE need to chunk the sentences!
# Define split size to turn groups of sentences into chunks

## THIS CHUNKING is for about jun yang's file!!!

num_sentence_chunk_size = 3

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(about_jy):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/20 [00:00<?, ?it/s]

In [None]:
# WE need to chunk the sentences!
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(combined_list):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

In [None]:
about_jy

In [14]:
import re

# Split each chunk into its own item
texts_and_chunks = []

for item in tqdm(about_jy):
    question = item["question"]
    answer = item["answer"]
    combined_text = f"Question: {question}\nAnswer: {answer}"

    joined_sentence_chunk = re.sub(r'\s+', ' ', combined_text).strip()
    joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk)

    chunk_dict = {
        "sentence_chunk": joined_sentence_chunk,
        "chunk_char_count": len(joined_sentence_chunk),
        "chunk_word_count": len(joined_sentence_chunk.split(" ")),
        "chunk_token_count": len(joined_sentence_chunk) / 4 # 1 token = ~4 characters
    }

    texts_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(texts_and_chunks)

  0%|          | 0/20 [00:00<?, ?it/s]

20

In [15]:
texts_and_chunks

[{'sentence_chunk': 'Question: what is his educational background| tell me about his education| where did he go to school Answer: I am currently pursing a bacholer degree in Computer Engineering, I dabble in both software and hardware, Studying in Nanyang Technological University Singapore.',
  'chunk_char_count': 270,
  'chunk_word_count': 41,
  'chunk_token_count': 67.5},
 {'sentence_chunk': "Question: How can I contact Lee Jun Yang? | What is Dev's email? | How to contact Lee Jun Yang? Answer: Lee Jun Yang's contact information: email - c220096@e.ntu.edu.sg, phone - +65 92484864, LinkedIn - https://www.linkedin.com/in/lee-jun-yang-0b0337215/",
  'chunk_char_count': 254,
  'chunk_word_count': 36,
  'chunk_token_count': 63.5},
 {'sentence_chunk': 'Question: What motivates Lee Jun Yang? Answer: Lee Jun Yang is deeply motivated by the transformative potential of AI and data science, He sees these fields as catalysts for positive change, offering endless possibilities to solve complex pr

In [None]:
df_chunks= pd.DataFrame(texts_and_chunks)
df_chunks.describe().round(2)

### I wont use a minimum chunk because all the information on resume is important!!

In [None]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
device

In [None]:
# !pip uninstall torch
# !pip install torch

In [None]:
!pip install --upgrade torchvision


In [12]:
from sentence_transformers import SentenceTransformer

In [13]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device=device) # choose the device to load the model to (note: GPU will often be *much* faster than CPU)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [16]:
# Turn text chunks into a single list
text_chunks = [item["sentence_chunk"] for item in texts_and_chunks]
# This optional, to do in batch, but you wont get in it their respective dictionaries, so dont care aobut this

In [17]:
text_chunks

['Question: what is his educational background| tell me about his education| where did he go to school Answer: I am currently pursing a bacholer degree in Computer Engineering, I dabble in both software and hardware, Studying in Nanyang Technological University Singapore.',
 "Question: How can I contact Lee Jun Yang? | What is Dev's email? | How to contact Lee Jun Yang? Answer: Lee Jun Yang's contact information: email - c220096@e.ntu.edu.sg, phone - +65 92484864, LinkedIn - https://www.linkedin.com/in/lee-jun-yang-0b0337215/",
 'Question: What motivates Lee Jun Yang? Answer: Lee Jun Yang is deeply motivated by the transformative potential of AI and data science, He sees these fields as catalysts for positive change, offering endless possibilities to solve complex problems, improve lives, and drive innovation. This sense of purpose and the opportunity to make a meaningful impact fuels his passion and drives him to constantly learn, explore, and push boundaries in the field.',
 "Questio

In [18]:
# embedding_model.to("cuda") # requires a GPU installed, for reference on my local machine, I'm using a NVIDIA RTX 4090

# Create embeddings one by one on the GPU
for item in tqdm(texts_and_chunks):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/20 [00:00<?, ?it/s]

In [19]:
df_encoded = pd.DataFrame(texts_and_chunks)
df_encoded.head()

Unnamed: 0,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,Question: what is his educational background| ...,270,41,67.5,"[0.03420846, 0.027942171, -0.027132034, 0.0176..."
1,Question: How can I contact Lee Jun Yang? | Wh...,254,36,63.5,"[0.068790354, 0.0007197743, -0.06437515, 0.070..."
2,Question: What motivates Lee Jun Yang? Answer:...,452,71,113.0,"[0.023538038, 0.0976473, -0.007967283, -0.0043..."
3,Question: Where does he see himself in 5 years...,482,80,120.5,"[0.03986174, 0.08381377, -0.03277232, -0.01365..."
4,Question: Tell me about his technical skills ;...,532,77,133.0,"[0.056806814, 0.011950685, -0.050960135, 0.005..."


- Now save the list of dictionaries with embeddings as csv file!

In [23]:
# save the embedding file!
text_chunks_and_embedding_df = pd.DataFrame(texts_and_chunks)
embeddings_df_save_path = "embeddings_aboutMe.csv"
text_chunks_and_embedding_df.to_csv(embeddings_df_save_path, index=False)

In [24]:
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load.head()

Unnamed: 0,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,Question: what is his educational background| ...,270,41,67.5,[ 3.42084616e-02 2.79421713e-02 -2.71320343e-...
1,Question: How can I contact Lee Jun Yang? | Wh...,254,36,63.5,[ 6.87903538e-02 7.19774282e-04 -6.43751472e-...
2,Question: What motivates Lee Jun Yang? Answer:...,452,71,113.0,[ 2.35380381e-02 9.76473019e-02 -7.96728302e-...
3,Question: Where does he see himself in 5 years...,482,80,120.5,[ 3.98617387e-02 8.38137716e-02 -3.27723213e-...
4,Question: Tell me about his technical skills ;...,532,77,133.0,[ 5.68068139e-02 1.19506847e-02 -5.09601347e-...


In [25]:
import random

import torch
import numpy as np
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("embeddings_aboutMe.csv")

# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

torch.Size([20, 768])

In [26]:
query = "School"
model = embedding_model
from time import perf_counter as timer
from sentence_transformers import SentenceTransformer, util

query_embedding = model.encode(query, convert_to_tensor=True)
print(query_embedding.shape)
# Get the time to do the semantic search, which compares to our source PDF embeddings!
start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0] # The index zero is just to remove the outer list
end_time = timer()
top_results_dot_product = torch.topk(dot_scores, k=3)
print(top_results_dot_product)


torch.Size([768])
torch.return_types.topk(
values=tensor([0.2327, 0.1540, 0.1281]),
indices=tensor([ 0,  9, 18]))


In [27]:
import torch
from sentence_transformers import SentenceTransformer, util
from time import perf_counter as timer

def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                print_time: bool = True):

  query_embedding = model.encode(query, convert_to_tensor=True)
  print(query_embedding.shape)
  # Get the time to do the semantic search, which compares to our source PDF embeddings!
  start_time = timer()
  dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0] # The index zero is just to remove the outer list
  end_time = timer()


  if print_time:
    print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

  # Get the top k scores of the semantic search
  score, indices = torch.topk(dot_scores,k = 4, largest =True);
  return score, indices

In [28]:
query = "what languages does Lee Jun Yang speak "


score, indices = retrieve_relevant_resources(query, embeddings)
print(indices)

torch.Size([768])
[INFO] Time taken to get scores on 20 embeddings: 0.00123 seconds.
tensor([ 7,  6, 16, 10])


In [30]:
pages_and_chunks[7]

{'sentence_chunk': 'Question: what languages does Lee Jun Yang speak Answer: Lee Jun Yang speaks English and Chinese',
 'chunk_char_count': 96,
 'chunk_word_count': 16,
 'chunk_token_count': 24.0,
 'embedding': array([ 7.51518384e-02,  4.51845564e-02, -1.28374342e-02,  5.55227026e-02,
        -2.16057096e-02, -6.71264017e-03,  2.66289413e-02,  3.12163569e-02,
         4.31197286e-02,  1.01535711e-02, -7.91229531e-02, -7.94991665e-03,
         3.71767543e-02, -4.07420173e-02,  3.43333371e-02, -1.44123323e-02,
        -1.35094824e-03, -2.71343496e-02, -1.83021519e-02, -4.24703658e-02,
        -2.91371741e-03,  1.00673856e-02,  3.04876771e-02,  5.14196232e-03,
        -4.68940195e-03,  1.47710685e-02, -7.53920572e-03, -1.74753796e-02,
        -1.39201749e-02,  3.58336158e-02, -2.08167126e-03, -4.62253625e-03,
        -5.03846146e-02,  1.93334185e-02,  1.81045993e-06,  1.89903509e-02,
         1.21273529e-02, -2.81468406e-03,  4.51334845e-03, -1.26878107e-02,
         1.53201846e-02,  1.77

In [31]:
def prompt_formatter(query: str, context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    # Create a base prompt with examples to help the model
    base_prompt = """Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: What is your major?
Answer: I am currently studying Computer Engineering.
\nExample 2:
Query: What are your hobbies?
Answer: I like sports and music. More specifically, basketball and playing the French Horn!
\nExample 3:
Query: How do I contact you?
Answer: These are details I can provide you, email: [email@example.com], phone: [123-456-7890].
\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

    # Update base prompt with context items and query
    prompt = base_prompt.format(context=context, query=query)

    return prompt


In [None]:
pages_and_chunks

In [32]:
query =   "Aspirations in 5 years"
print(f"Query: {query}")

# Get relevant resources
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
print(indices)

# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# print(context_items)
# Format prompt with context items
prompt = prompt_formatter(query=query,
                          context_items=context_items)
print(prompt)

Query: Aspirations in 5 years
torch.Size([768])
[INFO] Time taken to get scores on 20 embeddings: 0.00068 seconds.
tensor([ 3, 14,  2, 12])
Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.

Example 1:
Query: What is your major?
Answer: I am currently studying Computer Engineering.

Example 2:
Query: What are your hobbies?
Answer: I like sports and music. More specifically, basketball and playing the French Horn!

Example 3:
Query: How do I contact you?
Answer: These are details I can provide you, email: [email@example.com], phone: [123-456-7890].

Now use the following context items to answer the user query:
- Question: Where does he see himself in 5 years? Answer: In 5 years, Lee Jun Yang aims to 

In [33]:
#  Try using gemimni instead?
!pip install -q -U google-generativeai

In [34]:
import google.generativeai as genai

genai.configure(api_key="")

model = genai.GenerativeModel('gemini-1.5-flash')

In [35]:
# prompt = "The quick brown fox jumps over the lazy dog."

# Call `count_tokens` to get the input token count (`total_tokens`).
# print("total_tokens: ", model.count_tokens(prompt))
# ( total_tokens: 10 )

response = model.generate_content(prompt)
print(response.text)



In 5 years, Lee Jun Yang aims to be in a position where he's leveraging AI to make a tangible positive impact on industries. He sees himself working to demystify AI and make it more accessible, removing the stigma that one needs advanced qualifications to utilize its potential. Through practical applications and advocacy, he hopes to empower individuals and organizations to embrace AI as a tool for innovation and progress. 

