## Repurposing previous local RAG playground, integrating Pinecone.

#### UTILS

In [38]:
def flatten(container):
    for i in container:
        if isinstance(i, (list,tuple)):
            for j in flatten(i):
                yield j
        else:
            yield i


#### Environment variables

In [39]:
# importing os module for environment variables
import os
# import pandas as pd
# importing necessary functions from dotenv library
from dotenv import load_dotenv, dotenv_values

load_dotenv()

True

#### Data Loading, Splitting
Load our Obsidian markdown notes. We'll split them first by headings to maintain a sense of structure that could be used in our metadata.

In [None]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain_core.documents.base import Document


# can't split document with this splitter, we create a function that splits the content of a document into smaller docs using markdown header splitter
def split_md_by_headings(doc: Document, markdown_splitter: MarkdownHeaderTextSplitter) -> list[Document] :
    """ splits the content of a document into smaller docs using markdown header splitter"""
    initial_metadata : dict = doc.metadata
    header_splits : list[Document] = markdown_splitter.split_text(doc.page_content)
    for doc in header_splits:
        doc.metadata.update(initial_metadata)
    return header_splits

# get directory with md files
MD_FILES_DIRECTORY : str = os.getenv("MD_FILES_DIRECTORY")
print(f"Markdown files directory: {MD_FILES_DIRECTORY}")

# Markdown spliter
headers_to_split_on : list[tuple] = [("#", "Header 1"),("##", "Header 2"),]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)

# we use a raw text loader instead of a markdown loader because that one strips the doc of all it's headings
loader = DirectoryLoader(path=MD_FILES_DIRECTORY, glob="*.md", loader_cls=TextLoader, show_progress=True)
markdowns : list[Document] = loader.load()

# split our docs by heading
markdowns_split_headers : list[list[Document]] = [split_md_by_headings(doc, md_splitter) for doc in markdowns]
markdowns_split_headers : list[Document] = list(flatten(markdowns_split_headers))

for doc in markdowns_split_headers:
    print(doc.dict()['metadata'], len(doc.page_content))

In [41]:
len(markdowns_split_headers)

79

We haven't really introduced token-limit concerns in the mix yet. We'll experiment with different chunk size limits. We use a special function again so we can preserve heading metadata (just in case). For now we use a character limit instead of a token limit.

In [42]:
from langchain.text_splitter import CharacterTextSplitter

def split_md_by_chunk(doc: Document, text_splitter: CharacterTextSplitter) -> list[Document] :
    """ splits the content of a document into smaller docs using text splitter"""
    initial_metadata : dict = doc.metadata
    header_splits : list[Document] = text_splitter.create_documents([doc.page_content])
    for doc in header_splits:
        doc.metadata.update(initial_metadata)
    return header_splits

chunk_size : int = 512

character_text_splitter = CharacterTextSplitter(
    separator = ".",
    chunk_size = chunk_size,
    chunk_overlap  = 20
)

# split our docs by chunk size
markdowns_split_chunks : list[list[Document]] = [split_md_by_chunk(doc, character_text_splitter) for doc in markdowns_split_headers]
markdowns_split_chunks : list[Document] = list(flatten(markdowns_split_chunks))
print(f"Final number of chunks: {len(markdowns_split_chunks)}\n"
      f"Number of chunks bigger than our {chunk_size} character limit: {len([_ for doc in markdowns_split_chunks if len(doc.page_content) > chunk_size ])}")




Final number of chunks: 290
Number of chunks bigger than our 512 character limit: 5


#### Generate embeddings, store in dataframe
We'll use NomicEmbed as a model, an open-source model showing good results on Massive Text Embedding Benchmark (MTEB) in comparison to other small embedding model like `text-embedding-3-small`. We login using `nomic login` on our terminal and set `NOMIC_API_KEY` as an environment variable. In the future, we'll try and exploit our own local LLaMa 3 as an embedding model, using "[LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders](https://arxiv.org/abs/2404.05961)" as a basis.

In [None]:
from langchain_nomic.embeddings import NomicEmbeddings

dimensionality : int = 768
embed_model = NomicEmbeddings(model="nomic-embed-text-v1.5", dimensionality=dimensionality) # not quite clear yet what specific dimensionality I have to work with
embed_model.embed_query("My query to look up")

In [44]:
# generate embedding from a text
def get_embeddings(text: str) -> list[float]:
    """ take a string and return embeddings in the form of a vector of floats"""
    return embed_model.embed_query(text)

We streamline the generation of a dataframe based on our list of langchain documents.

In [45]:
import pandas as pd
def document_list_to_dataframe(docs: list[Document]) -> pd.DataFrame :
    """ extract data from list of documents and store in dataframe ('id', 'embeddings', 'metadata'(text inside here)) """
    columns = ['embeddings', 'metadata']
    df = pd.DataFrame(columns=columns)

    for doc in docs:
        # get document data
        text : str = doc.page_content
        doc_metadata : dict = doc.dict()['metadata']
        # generate embeddings
        embeds : list[float] = get_embeddings(text)
        # insert text in metadata dict
        doc_metadata.update({'text': text})
        df = pd.concat([df, pd.DataFrame([[embeds,doc_metadata]], columns=columns)], ignore_index=True)

    df.reset_index()
    # df['id'] = range(1, len(df) + 1)
    # df['id'] = str(df['id'])
    # df.set_index('id', inplace=True)
    return df

In [46]:
def get_pinecone_dicts(df: pd.DataFrame) -> list[dict]:
    """ dataframe data into list of dicts to upsert in Pinecone index """
    dicts = df.to_dict(orient='records')

    pinecone_dicts = []
    for i, df_dict in enumerate(dicts):
        pc_dict = {
            'id': str(i),
            'values': df_dict['embeddings'],
            'metadata': df_dict['metadata']
        }
        pinecone_dicts.append(pc_dict)
    return pinecone_dicts

In [47]:
df = document_list_to_dataframe(markdowns_split_chunks[:int(len(markdowns_split_chunks)*0.7)])
df.head()

Unnamed: 0,embeddings,metadata
0,"[0.002105713, 0.01939392, -0.18383789, -0.0800...",{'source': 'data\computer_science_notes\Advanc...
1,"[0.016067505, 0.020019531, -0.171875, -0.01194...",{'Header 1': 'Components of modern object dete...
2,"[0.026245117, 0.10760498, -0.17236328, -0.0895...",{'Header 1': 'Components of modern object dete...
3,"[0.018203735, 0.09442139, -0.1652832, -0.06201...",{'Header 1': 'Components of modern object dete...
4,"[0.029800415, 0.09063721, -0.18261719, -0.0787...",{'Header 1': 'Components of modern object dete...


In [48]:
vector_dicts = get_pinecone_dicts(df)

### Getting Pinecone index instance and execute data ingestion script.

In [49]:
from pinecone import Pinecone, ServerlessSpec

PINECONE_API_KEY : str = os.getenv("PINECONE_API_KEY")
pc = Pinecone(api_key=PINECONE_API_KEY)

In [None]:
from pinecone import PineconeApiException

index_name : str = "markdown-notes"

pc.create_index(
    name=index_name,
    dimension=dimensionality, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

Connect to the index.

In [51]:
index : Pinecone.Index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1}},
 'total_vector_count': 1}

Upsert the data to Pinecone.

In [52]:
index.upsert(vectors=vector_dicts)

{'upserted_count': 203}

Check our vector count.

In [54]:
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 204}},
 'total_vector_count': 204}

Testing index with a query.

In [55]:
query = "batch size"
embedded_query = embed_model.embed_query(query)
index.query(
    vector=embedded_query,
    top_k=3,
    include_values=False,
    include_metadata=True
)

{'matches': [{'id': '45',
              'metadata': {'source': 'data\\computer_science_notes\\Batch '
                                     'size.md',
                           'text': 'Advantages of using a batch size < number '
                                   'of all samples:  \n'
                                   '- It requires less memory. Since you train '
                                   'the network using fewer samples, the '
                                   'overall training procedure requires less '
                                   "memory. That's especially important if you "
                                   'are not able to fit the whole dataset in '
                                   "your machine's memory.  \n"
                                   '- Typically networks train faster with '
                                   "mini-batches. That's because we update the "
                                   'weights after each propagation'},
              'score': 0.7

### Create chain

We'll use our local LLaMa 3.

In [58]:
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
# MODEL = "gpt-3.5-turbo"
MODEL = "llama3:8b"

In [61]:
from langchain_openai.chat_models import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings

# test openai and llama message response
if MODEL.startswith('gpt'):
    model = ChatOpenAI(api_key=OPENAI_API_KEY, model=MODEL)
    embeddings = OpenAIEmbeddings()
else:
    model = Ollama(model=MODEL,
                   keep_alive=1, # keep model loaded to gain time
                   temperature=0,
                   )
    embeddings = OllamaEmbeddings(model="llama3:8b")

model.invoke("Why is the sky blue ?")

"What a great question!\n\nThe short answer: The sky appears blue because of a phenomenon called Rayleigh scattering, named after the British physicist Lord Rayleigh. Here's what happens:\n\n1. **Sunlight**: When sunlight enters Earth's atmosphere, it contains all the colors of the visible spectrum (red, orange, yellow, green, blue, indigo, and violet).\n2. **Molecules**: The atmosphere is made up of tiny molecules of gases like nitrogen (N2) and oxygen (O2). These molecules are much smaller than the wavelength of light.\n3. **Scattering**: When sunlight hits these molecules, it scatters in all directions. This scattering effect is more pronounced for shorter wavelengths (like blue and violet) than longer wavelengths (like red and orange).\n4. **Blue dominance**: As a result of this scattering, the shorter wavelengths (blue and violet) are distributed throughout the sky, making it appear blue to our eyes.\n\nThe longer wavelengths (red and orange) continue to travel in a more direct pa

We'll keep the parsing module and everything the same.

In [62]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()
# pipe the output of our model into the input of our parser
chain = model | parser
# instead of invoking the model, we invoke the chain instead
chain.invoke("Why is the sky blue ?.")

"The color of the sky can appear different depending on the time of day, atmospheric conditions, and the observer's location. However, under normal conditions, the sky typically appears blue to our eyes because of a phenomenon called Rayleigh scattering.\n\nRayleigh scattering is the scattering of light by small particles or molecules in the atmosphere, such as nitrogen (N2) and oxygen (O2). These gases are much smaller than the wavelength of visible light, so they scatter shorter (blue) wavelengths more efficiently than longer (red) wavelengths. This means that when sunlight enters Earth's atmosphere, the blue and violet colors are scattered in all directions by these tiny molecules, reaching our eyes from all parts of the sky.\n\nHere's a simplified explanation:\n\n1. Sunlight contains all the colors of the visible spectrum (ROY G BIV: red, orange, yellow, green, blue, indigo, and violet).\n2. When sunlight enters the atmosphere, it encounters tiny molecules of gases like N2 and O2.\

We create a prompt template encompassing the context to give the model.

In [74]:
from langchain.prompts import PromptTemplate

template="""
You are a virtual assistant strictly designed to provide knowledge based on provided context from a database of documents.
Answer the question based on the context below.
If you can't answer the question, reply 'I don't know'.

Context: {context}
Question: {question}
"""

prompt = PromptTemplate.from_template(template)
print(prompt.format(context="Here is some context", question="Here is a question"))


You are a virtual assistant strictly designed to provide knowledge based on provided context from a database of documents.
Answer the question based on the context below.
If you can't answer the question, reply 'I don't know'.

Context: Here is some context
Question: Here is a question



To pass this prompt to our model, we expand upon our chain.

In [64]:
chain = prompt | model | parser

In [65]:
chain.input_schema.schema()

{'title': 'PromptInput',
 'type': 'object',
 'properties': {'context': {'title': 'Context', 'type': 'string'},
  'question': {'title': 'Question', 'type': 'string'}}}

In [66]:
# to invoke a chain we need to understand the structure of our prompt template, which can be seen on the input schema above
chain.invoke(
    {
        "context": "In the 1960s, the rock scene was an effervescent field of talented and eccentric musicians brought up through the hippie movement. Jimi Hendrix, Eric Clapton, Jimmy Page are a few guitarists of that era who experienced a lot of success. Jimi Hendrix is still considered to be the best guitarist of all time.",
        "question": "Who is the greatest guitarist ever ?",
    }
)

'According to the context, Jimi Hendrix is still considered to be the best guitarist of all time.'

To make it so that our chain receives our documents' relevant information as context, we'll use our pinecone index. LangChain provides a Pinecone vectorstore instance.

In [70]:
from langchain_pinecone import PineconeVectorStore

vectorstore = PineconeVectorStore(
    index_name=index_name,
    embedding=embed_model,
)

A retriever is a component of Langchain that allows to retrieve information from a vector store (can retrieve from other sources). Using invoke to retrieve the top k closest documents most relevant to the prompt inputted.

In [71]:
retriever = vectorstore.as_retriever()
retriever.invoke("statistical test")

[Document(page_content='#### Statistical test\n- Statistics is to help make decisions based on quantifiable uncertainties\n- A hypothesis test contains a **null hypothesis** (no difference between data) and an **alternative hypothesis** (difference between data), difference based on a **critical value**, a benchmark\n- A hypothesis test can test the following:\n- One variable against another (such as in a t-test)\n- Multiple variables against one variable (for example, linear regression)\n- Multiple variables against multiple variables (for example, MANOVA)  \n### Sampling strategies  \nTwo types of sampling methods:  \n#### Probability sampling  \n- A sample is chosen based on a theory of probability, or randomly with random selection (every member has the same chance)  \n4 types of probability sampling', metadata={'Header 1': 'Part 1, An introduction to statistics', 'Header 2': 'Chapter 1, Sampling and Generalization', 'source': 'data\\computer_science_notes\\Building Statistical Mod

Our pinecone vector base is working as intended. Relevant documents to our query are correctly emphasised.

In [73]:
# an itemgetter allows to create a callable with a set key, and can be used to retrieve related value of a parameter-object with said key,
# here the parameter-object is the dict passed through invoke
from operator import itemgetter

# dict given to the prompt is a Runnable that generates a map with context and question
# context comes from our retriever, which receives a 'question' item
chain = ({
    "context": itemgetter("question") | retriever,
    "question": itemgetter("question"),
    }
    | prompt
    | model
    | parser
)
# item is given through invoke when using the chain with our question
chain.invoke({"question": "What are the different types of types of probability sampling ?"})

'Based on the provided context, there are four types of probability sampling mentioned:\n\n1. **Simple random sampling**: every member has an equal chance.\n2. **Systematic sampling**: based on a fixed interval, choose a random numbered data point and select the rest of the data along the interval.\n\nThese two types of probability sampling are mentioned in the context as part of the "Probability sampling" section.'

We'll fire off a few questions and evaluate our RAG setup.

In [75]:
questions = [
    "How does learning rate affect training ?",
    "What are some of the different object detection techniques/architectures ?",
    "What does ANN stand for ?",
    "Explain cross-entropy and its different forms.",
    "What is the use of a statistical test ?",
    "Are there notes relevant to a book in our documents ? Who wrote it ?",
]

for question in questions:
    print(f"Question: \n\n{question}")
    print(f"Answer : \n{chain.invoke({'question': question})}")

Question: 
How does learning rate affect training ?
Answer : 
Based on the provided context, I can answer that varying the learning rate can impact the training process in the following ways:

* If the learning rate is too high, the model may overfit by a larger amount (as seen in the example with two hidden layers).
* If the learning rate is too low, the model may not learn as well as when there were no hidden layers.

Additionally, it's mentioned that small input values can lead to drastic weight changes, which highlights the importance of choosing an appropriate learning rate.
Question: 
What are some of the different object detection techniques/architectures ?
Answer : 
Based on the provided context, I can identify the following object detection techniques/architectures mentioned:

1. R-CNN (Region-based Convolutional Neural Networks)
2. Fast R-CNN
3. SSD (Single Shot Detector)
4. YOLO (You Only Look Once)
5. U-Net
6. Mask-RCNN
7. Detectron2

These are some of the modern object det

We can still notice the lack of sophistication in our local LLaMa 3's responses. The model tends to rely too much on direct citing, which may be why it references image paths without showcasing the knowledge that they can't be used in this context. As always, model quality is one side of the equation, the other being prompt quality.

In [76]:
for s in chain.stream({'question': 'What is the project JumperCV ?'}):
    print(s, end="", flush=True)

Based on the provided context, I can answer that:

The project JumperCV is a computer vision project focused on detecting and tracking players in basketball videos. The project involves Multiple Object Tracking (MOT) as its central objective.

Pleasantly surprised by this one.

We have successfully integrated a Pinecone vector base into our RAG setup. This allows us to benefit from a faster and less expensive retrieval method, now that we don't have to reload and split our documents every time. This vector base is also easy to expand and displace onto other tools and projects if needed.

Next time we'll build upon this base by looking at LangGraph. This tool gives us the capacity to orchestrate an agentic workflow with our LLM chain and refer ourselves to a state graph to do so. We'll try and implement a router that decides if the provided context is valid/helpful, and if not, use function calling to complement knowledge with the internet. We'll monitor all of this through LangSmith, and finally want to test our RAG on an actual frontend through LangServe Chat Playground (this also necessitates conversational memory). In general, we'll keep versioning these notebooks with different experiments in agentic workflows. I will also make these into actual scripts and use them as a module.
