# Imports

In [1]:
import os
import sys
import numpy as np

sys.path.append("..")  # Add parent directory to Python path

from services.rag import RagService, RagConfig, Document
from services.chat import ChatService, ChatConfig

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/richardcollins/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# Constants

In [2]:
cwd = os.getcwd()
abs_base_path = os.path.dirname(os.path.abspath("__file__"))

# Code

## RagService

Let's just try to initiate each step of the RAG service and check the output.

In [3]:
rag_service = RagService()

### Loading Markdown

In [4]:
print(rag_service.load_markdown(abs_base_path + "/../thoughts/recent-thought.md"))

title: New thought to test out my portfolio website date: 2024-11-11 00:00:00+00:00 draft: False

So, this is where I'm going to write my thoughts. I'm going to write about my experiences and what I've learned. I'm going to write about my projects and what I'm working on. I'm going to write about my life and what I'm thinking. This is a 2nd-level header I can write some more stuff here. And this is a 3rd-level header. I can write even more stuff here! I'll write a little bit more in this section. What about some bullet points? I can write bullet points I can write more bullet points I can write even more bullet points With nesting? I can write bullet points I can write more bullet points I can write even more bullet points And what about a code block? python
print("Hello, world!") Let's white a more complicated one: python
def print_hello():
    print("Hello, world!") Now we're cooking with gas!


### Loading Docs

In [5]:
print(rag_service.load_docs())
rag_service.documents

/Users/richardcollins/portfolio-v1/services/../cv
/Users/richardcollins/portfolio-v1/services/../thoughts
{'/Users/richardcollins/portfolio-v1/services/../cv/richard_collins.md': 'name: Richard Collins title: Senior Data Scientist email: placeholder@email.com location: Tokyo, Japan\n\nRichard Collins Senior Data Scientist Professional Summary With over eight years of experience as a data scientist, I specialize in developing sophisticated forecasting solutions using machine learning and artificial intelligence. My expertise lies in transforming complex data into actionable insights that drive business decisions and operational efficiency. I have particular expertise in weather forecasting and risk mitigation for transport and logistics sectors. Core Competencies Machine Learning & AI Statistical Analysis Data Visualization Predictive Modeling Business Intelligence Weather Forecasting Risk Analysis Python Programming Big Data Technologies Professional Experience Senior Data Scientist | 

{'/Users/richardcollins/portfolio-v1/services/../cv/richard_collins.md': 'name: Richard Collins title: Senior Data Scientist email: placeholder@email.com location: Tokyo, Japan\n\nRichard Collins Senior Data Scientist Professional Summary With over eight years of experience as a data scientist, I specialize in developing sophisticated forecasting solutions using machine learning and artificial intelligence. My expertise lies in transforming complex data into actionable insights that drive business decisions and operational efficiency. I have particular expertise in weather forecasting and risk mitigation for transport and logistics sectors. Core Competencies Machine Learning & AI Statistical Analysis Data Visualization Predictive Modeling Business Intelligence Weather Forecasting Risk Analysis Python Programming Big Data Technologies Professional Experience Senior Data Scientist | Current Company 2021 - Present - Lead development of weather-related disruption prediction models for tran

### Create chunks

In [15]:
file_path = "/Users/richardcollins/portfolio-v1/services/../cv/richard_collins.md"
content = rag_service.documents[file_path]
metadata = {
    "source": str(file_path),
    "type": "markdown" if file_path.endswith(".md") else "txt",
}
doc_chunks = rag_service.create_chunks(content, metadata)
doc_chunks


[Document(content='name: Richard Collins title: Senior Data Scientist email: placeholder@email.com location: Tokyo, Japan\n\nRichard Collins Senior Data Scientist Professional Summary With over eight years of experience as a data scientist, I specialize in developing sophisticated forecasting solutions using machine learning and artificial intelligence. My expertise lies in transforming complex data into actionable insights that drive business decisions and operational efficiency.', metadata={'source': '/Users/richardcollins/portfolio-v1/services/../cv/richard_collins.md', 'type': 'markdown'}, embeddings=None, id=None),
 Document(content='I have particular expertise in weather forecasting and risk mitigation for transport and logistics sectors.', metadata={'source': '/Users/richardcollins/portfolio-v1/services/../cv/richard_collins.md', 'type': 'markdown'}, embeddings=None, id=None),
 Document(content='Core Competencies Machine Learning & AI Statistical Analysis Data Visualization Pred

All seems to work fine. The next step is to parse docs, which includes the above methods, aswell as computing and saving embeddings. I don't need to do this if the rag_service is initialised with the vector store already created. So, I'll skip this for now.

### Query embeddings

In [8]:
query = "Tell me the details of the Manimflow project."
query_embedding = rag_service.embedding_model.encode([query])[0]
distances, indices = rag_service.index.search(
    np.array([query_embedding], dtype=np.float32), rag_service.config.top_k
)
print(distances)
print(indices)

relevant_chunks = []
for dist, idx in zip(distances[0], indices[0]):
    # Convert L2 distance to similarity score (inverse relationship)
    similarity = 1 / (1 + dist)
    print(similarity)
    if similarity >= rag_service.config.similarity_threshold:
        chunk = next((c for c in rag_service.chunks if c.id == idx), None)
        if chunk:
            relevant_chunks.append((chunk, similarity))
print(relevant_chunks)

context_parts = []
for chunk, similarity in relevant_chunks:
    source = chunk.metadata.get("source", "unknown")
    context_parts.append(
        f"[Source: {source}] (Similarity: {similarity:.2f})\n{chunk.content}"
    )

"\n\n".join(context_parts)

[[1.3557386 1.4403323 1.5598363]]
[[2 8 1]]
0.4244953082195524
0.4097802592947259
0.39064998505364507
[(Document(content='Core Competencies Machine Learning & AI Statistical Analysis Data Visualization Predictive Modeling Business Intelligence Weather Forecasting Risk Analysis Python Programming Big Data Technologies Professional Experience Senior Data Scientist | Current Company 2021 - Present - Lead development of weather-related disruption prediction models for transport and logistics\n- Design and implement routing optimization algorithms incorporating weather risk factors\n- Manage end-to-end machine learning projects from conception to deployment\n- Collaborate with cross-functional teams to integrate predictive solutions into operations Data Scientist | Previous Company 2018 - 2021 - Developed and maintained sales forecasting models\n- Created business intelligence dashboards for executive decision-making\n- Led a team of 3 junior data scientists\n- Reduced forecast error by 35%

'[Source: /Users/richardcollins/portfolio-v1/services/../cv/richard_collins.md] (Similarity: 0.42)\nCore Competencies Machine Learning & AI Statistical Analysis Data Visualization Predictive Modeling Business Intelligence Weather Forecasting Risk Analysis Python Programming Big Data Technologies Professional Experience Senior Data Scientist | Current Company 2021 - Present - Lead development of weather-related disruption prediction models for transport and logistics\n- Design and implement routing optimization algorithms incorporating weather risk factors\n- Manage end-to-end machine learning projects from conception to deployment\n- Collaborate with cross-functional teams to integrate predictive solutions into operations Data Scientist | Previous Company 2018 - 2021 - Developed and maintained sales forecasting models\n- Created business intelligence dashboards for executive decision-making\n- Led a team of 3 junior data scientists\n- Reduced forecast error by 35% through advanced mode

In [9]:
context = rag_service.get_relevant_context(query)
context


'[Source: /Users/richardcollins/portfolio-v1/services/../cv/richard_collins.md] (Similarity: 0.42)\nCore Competencies Machine Learning & AI Statistical Analysis Data Visualization Predictive Modeling Business Intelligence Weather Forecasting Risk Analysis Python Programming Big Data Technologies Professional Experience Senior Data Scientist | Current Company 2021 - Present - Lead development of weather-related disruption prediction models for transport and logistics\n- Design and implement routing optimization algorithms incorporating weather risk factors\n- Manage end-to-end machine learning projects from conception to deployment\n- Collaborate with cross-functional teams to integrate predictive solutions into operations Data Scientist | Previous Company 2018 - 2021 - Developed and maintained sales forecasting models\n- Created business intelligence dashboards for executive decision-making\n- Led a team of 3 junior data scientists\n- Reduced forecast error by 35% through advanced mode