### 1: Dependencies

In [1]:
# Langchain dependencies
from langchain.document_loaders.pdf import PyPDFDirectoryLoader  # Importing PDF loader from Langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Importing text splitter from Langchain
from langchain.embeddings import OpenAIEmbeddings  # Importing OpenAI embeddings from Langchain
from langchain.schema import Document  # Importing Document schema from Langchain
from langchain.vectorstores.chroma import Chroma  # Importing Chroma vector store from Langchain

import os  # Importing os module for operating system functionalities
import shutil  # Importing shutil module for high-level file operations

### 2: Read PDF

In [2]:
# Directory to your pdf files:
DATA_PATH = r"C:\Users\callu\repos\job_application\data"

def load_documents():
    """
    Load PDF documents from the specified directory using PyPDFDirectoryLoader.

    Returns:
        List of Document objects: Loaded PDF documents represented as Langchain Document objects.
    """
    document_loader = PyPDFDirectoryLoader(DATA_PATH)  # Initialize PDF loader with specified directory
    return document_loader.load()  # Load PDF documents and return them as a list of Document objects

In [3]:
documents = load_documents()
print(documents[0])

page_content='Callum Macpherson Email : callumjmac@outlook.com\nGithub Mobile : +447928252465\nExperience\n•Freelance Sydney, Australia\nAI Engineer & Consultant Jun. 2023 - Jan. 2024\n◦LLMs : Developing LLM systems, including data collection, curation, fine-tuning, prompt engineering, evaluation\nand deployment using OpenAI API.\n◦Prototyping : Integrated LLM system into proof-of-concept applications using Streamlit.\n◦Continual Learning : Delivered presentations and reports on Distribution Drift Metrics for Deep Learning\nmodels.\n•Self Elected Career Break Around the world\nTravelled to 16 countries Oct. 2022 - Mar. 2024\n◦Highlights : Everest Base Camp Three Passes Trek, Salkantay Trek to Machu Picchu, Ha Giang Motorbike Loop,\nDeep Water Free Solo Climbing in Ha Long Bay, Surfing, Scuba Diving.\n•Advai London, UK\nMachine Learning Researcher Sep. 2021 - Oct. 2022\n◦Adversarial Attacks : Implemented adversarial (poisoning and evasion) attacks against Computer Vision (YOLO,\nFaster 

In [4]:
type(documents)

list

### 3: Split into chunks of text

Is this step necessary or useful for my application?

In [5]:
def split_text(documents: list[Document]):
    """
    Split the text content of the given list of Document objects into smaller chunks.

    Args:
        documents (list[Document]): List of Document objects containing text content to split.

    Returns:
        list[Document]: List of Document objects representing the split text chunks.
    """
    # Initialize text splitter with specified parameters
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=400,  # Size of each chunk in characters
        chunk_overlap=100,  # Overlap between consecutive chunks
        length_function=len,  # Function to compute the length of the text
        add_start_index=True,  # Flag to add start index to each chunk
    )
    # Split documents into smaller chunks using text splitter
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    # Print example of page content and metadata for a chunk
    document = chunks[10]
    print(document.page_content)
    print(document.metadata)

    return chunks  # Return the list of split text chunks

In [6]:
chunks = split_text(documents)

Split 1 documents into 12 chunks.
custom environment built with PyGame.
•2020 Presidential Election Tweets : Sentiment Analysis using NLTK and Voting Correlation Analysis.
Skills
•Programming Languages : Python, R, SQL, C++, Matlab, JavaScript, HTML, CSS
•Key Frameworks : PyTorch, Lightning, TensorFlow, Keras, JAX, Flax, Hugging Face, Timm, Django, Sci-kit Learn,
{'source': 'C:\\Users\\callu\\repos\\job_application\\data\\24.04.23 - Callum Macpherson - CV.pdf', 'page': 0, 'start_index': 2929}


In [7]:
for chunk in chunks:
    print(chunk)
    print("\n")

page_content='Callum Macpherson Email : callumjmac@outlook.com\nGithub Mobile : +447928252465\nExperience\n•Freelance Sydney, Australia\nAI Engineer & Consultant Jun. 2023 - Jan. 2024\n◦LLMs : Developing LLM systems, including data collection, curation, fine-tuning, prompt engineering, evaluation\nand deployment using OpenAI API.\n◦Prototyping : Integrated LLM system into proof-of-concept applications using Streamlit.' metadata={'source': 'C:\\Users\\callu\\repos\\job_application\\data\\24.04.23 - Callum Macpherson - CV.pdf', 'page': 0, 'start_index': 0}


page_content='◦Prototyping : Integrated LLM system into proof-of-concept applications using Streamlit.\n◦Continual Learning : Delivered presentations and reports on Distribution Drift Metrics for Deep Learning\nmodels.\n•Self Elected Career Break Around the world\nTravelled to 16 countries Oct. 2022 - Mar. 2024' metadata={'source': 'C:\\Users\\callu\\repos\\job_application\\data\\24.04.23 - Callum Macpherson - CV.pdf', 'page': 0, 'st

### 4: Save to a RDB using Chroma

In [8]:
CHROMA_PATH = "chroma"

In [9]:
def save_to_chroma(chunks: list[Document]):
    # Clear out the database first.
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)
    
    # print(chunks)

    # Create a new DB from the documents.
    db = Chroma.from_documents(
        chunks, OpenAIEmbeddings(), persist_directory=CHROMA_PATH
    )
    db.persist()
    print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")

### 5: Create a Chroma Database

In [10]:
def generate_data_store():
    documents = load_documents()
    chunks= split_text(documents)
    save_to_chroma(chunks)

In [11]:
generate_data_store()

Split 1 documents into 12 chunks.
custom environment built with PyGame.
•2020 Presidential Election Tweets : Sentiment Analysis using NLTK and Voting Correlation Analysis.
Skills
•Programming Languages : Python, R, SQL, C++, Matlab, JavaScript, HTML, CSS
•Key Frameworks : PyTorch, Lightning, TensorFlow, Keras, JAX, Flax, Hugging Face, Timm, Django, Sci-kit Learn,
{'source': 'C:\\Users\\callu\\repos\\job_application\\data\\24.04.23 - Callum Macpherson - CV.pdf', 'page': 0, 'start_index': 2929}


  warn_deprecated(


Saved 12 chunks to chroma.


#### Embedding example

In [12]:
ex = "apple"
ex_1 = "orange"
ex_2 = "iphone"

In [13]:
embedding_function = OpenAIEmbeddings()
vector = embedding_function.embed_query(ex)
vector_1 = embedding_function.embed_query(ex_1)
vector_2 = embedding_function.embed_query(ex_2)

In [14]:
vector, len(vector)

([0.007788693935724774,
  -0.023086208665530836,
  -0.007563429358468463,
  -0.027796285004661327,
  -0.004546249248985117,
  0.013031215302746993,
  -0.022075930433763543,
  -0.008491792648179218,
  0.01889492183715097,
  -0.029625708000381175,
  -0.002952331420639518,
  0.020123638051757233,
  -0.004467747604137248,
  0.009058367388528664,
  -0.02172096801315829,
  0.002046153398631913,
  0.030663290747172698,
  9.96731824885703e-05,
  0.0020973498299636266,
  -0.025502683390884448,
  -0.02110660990585516,
  -0.008130003633156626,
  0.02122948115478676,
  -0.012410031532349089,
  0.0011160836931474653,
  0.005030909512939308,
  0.010095949203997619,
  -1.3579071383542974e-05,
  0.015877740330683773,
  -0.012921996311327509,
  0.020642427562507858,
  -0.016082526987333194,
  -0.01847169719766258,
  0.005382458636335888,
  -0.019290840098969995,
  -0.009222196341319175,
  -0.012089200221185408,
  -0.008778492849901327,
  -0.005652093283337213,
  -0.006092383477546389,
  0.0104782170709

In [15]:
from langchain.evaluation import load_evaluator

evaluator = load_evaluator("pairwise_embedding_distance")

In [16]:
# run an evaluation

x = evaluator.evaluate_string_pairs(prediction=ex, prediction_b=ex_1)

In [17]:
x

{'score': 0.13536251856245263}

In [18]:
evaluator.evaluate_string_pairs(prediction=ex, prediction_b=ex_2)

{'score': 0.09712033240791274}

In [19]:
evaluator.evaluate_string_pairs(prediction=ex, prediction_b=ex)

{'score': 2.220446049250313e-16}

Bigger distance = strings are more different

### 6: Query vectory database for relevant data

In [20]:
query_text = "Your Academic and Professional Background: Tell us about yourself, including your educational and work background in detail. If you have any special skills that may be relevant, please let us know about those too. The more detail, the better! Answer in first person"

In [21]:
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""


In [22]:
# Use same embedding function as before
embedding_function = OpenAIEmbeddings()
 
# Prepare the database
db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)

# Search the DB.
results = db.similarity_search_with_relevance_scores(query_text, k=3)
if len(results) == 0 or results[0][1] < 0.7:
    print(f"Unable to find matching results.")

Unable to find matching results.


In [23]:
from langchain.prompts import ChatPromptTemplate

In [24]:
context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=query_text)
print(prompt)

Human: 
Answer the question based only on the following context:

models.
•Self Elected Career Break Around the world
Travelled to 16 countries Oct. 2022 - Mar. 2024
◦Highlights : Everest Base Camp Three Passes Trek, Salkantay Trek to Machu Picchu, Ha Giang Motorbike Loop,
Deep Water Free Solo Climbing in Ha Long Bay, Surfing, Scuba Diving.
•Advai London, UK
Machine Learning Researcher Sep. 2021 - Oct. 2022

---

resulting in additional business from returning customers for the company.
Education
•University of Leeds West Yorkshire, UK
Master of Science in Data Science and Analytics; Grade: 88.2% . Sep. 2020 – Sep. 2021
◦Relevant Courses : Machine Learning, Data Science, Artificial Intelligence, Python Programming, Data Mining
and Text Analytics, Statistical Theory and Methods, Statistical Learning.

---

◦Prototyping : Integrated LLM system into proof-of-concept applications using Streamlit.
◦Continual Learning : Delivered presentations and reports on Distribution Drift Metrics for De

In [25]:
from langchain.chat_models import ChatOpenAI

In [26]:
model = ChatOpenAI()
response_text = model.predict(prompt)

sources = [doc.metadata.get("source", None) for doc, _score in results]
formatted_response = f"Response: {response_text}\nSources: {sources}"
print(formatted_response)

  warn_deprecated(
  warn_deprecated(


Response: I completed my Master of Science in Data Science and Analytics at the University of Leeds in West Yorkshire, UK, achieving a grade of 88.2%. During my studies, I focused on courses such as Machine Learning, Data Science, Artificial Intelligence, Python Programming, Data Mining, and Text Analytics, as well as Statistical Theory and Methods.

After graduation, I worked as a Machine Learning Researcher at Advai in London, UK, from September 2021 to October 2022. During my time there, I focused on prototyping and continual learning, integrating a LLM system into proof-of-concept applications using Streamlit and delivering presentations and reports on Distribution Drift Metrics for Deep Learning models.

In addition to my academic and professional background, I also took a self-elected career break to travel around the world from October 2022 to March 2024. During this time, I visited 16 countries and engaged in various adventurous activities such as the Everest Base Camp Three Pa

In [27]:
response_text

'I completed my Master of Science in Data Science and Analytics at the University of Leeds in West Yorkshire, UK, achieving a grade of 88.2%. During my studies, I focused on courses such as Machine Learning, Data Science, Artificial Intelligence, Python Programming, Data Mining, and Text Analytics, as well as Statistical Theory and Methods.\n\nAfter graduation, I worked as a Machine Learning Researcher at Advai in London, UK, from September 2021 to October 2022. During my time there, I focused on prototyping and continual learning, integrating a LLM system into proof-of-concept applications using Streamlit and delivering presentations and reports on Distribution Drift Metrics for Deep Learning models.\n\nIn addition to my academic and professional background, I also took a self-elected career break to travel around the world from October 2022 to March 2024. During this time, I visited 16 countries and engaged in various adventurous activities such as the Everest Base Camp Three Passes 