Given RAG set up is best for Q&A and maybe for summarization and 

Needs some training data for post creation

# Import Libraries

In [26]:
from dataclasses import dataclass
import ollama
from langchain.document_loaders import DirectoryLoader
from langchain.embeddings import HuggingFaceEmbeddings, OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.chroma import Chroma
import os
import shutil

# Creating a Vector Database with Chroma

In [27]:
def load_documents(data_path):
    loader = DirectoryLoader(data_path, glob="*.txt")
    documents = loader.load()
    return documents

def save_to_chroma(chunks: list[Document], chroma_path):
    # Clear out the database first.
    if os.path.exists(chroma_path):
        shutil.rmtree(chroma_path)

    # Create a new DB from the documents.
    #db = Chroma.from_documents(
    #    chunks, OpenAIEmbeddings(), persist_directory=CHROMA_PATH
    #)
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    db = Chroma.from_documents(chunks, embeddings, persist_directory=chroma_path)
    db.persist()
    print(f"Saved {len(chunks)} chunks to {chroma_path}.")

In [28]:
# Each Vector is one whole document/text file
documents = load_documents("data/gen-ai-topic1-corpus")
save_to_chroma(documents, "chroma/document-level")

In [4]:
for document in documents:
    print("CONTENT:")
    print(document.page_content)
    print("META DATA:")
    print(document.metadata)
    print("----------")

CONTENT:
Journal Entry: March 5, 2021 Researcher: Dr. Elena M. Voss Location: Global Ecology Research Center, Geneva

Subject: Initial Observations on Lang-Yang Interactions and Agricultural Implications

Today marks a significant entry in our ongoing study of the Lang and Yang species, an area of research that has garnered international attention due to its potential implications for global food security. Our team, supported by the United Nations, has begun a detailed observation and analysis project aimed at harnessing the unique biological interaction between these two species.

During preliminary observations, we confirmed the long-suspected toxic effect of Yang flesh on Langs, a poignant reminder of nature’s complexities. Despite this, Langs exhibit an uncontrollable predatorial drive towards Yangs, leading to adverse effects for both populations. Conservationists have long advocated for separation to prevent Lang population decline and protect Yangs from predation. However, this 

In [5]:
def split_documents_into_chunks_text_splitter(documents):
    chunks = []
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=300,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    return chunks

In [6]:
#Each Vector is chunk of a document
chunks = split_documents_into_chunks_text_splitter(documents)
save_to_chroma(chunks, "chroma/text-splitter-chunks")



Saved 103 chunks to chroma/text-splitter-chunks.


In [7]:
for i, chunk in enumerate(chunks, start=1):
    print(f"Chunk {i}:")
    print("Text:", chunk.page_content)
    print("Metadata:", chunk.metadata)
    print("----------")

Chunk 1:
Text: Journal Entry: March 5, 2021 Researcher: Dr. Elena M. Voss Location: Global Ecology Research Center, Geneva

Subject: Initial Observations on Lang-Yang Interactions and Agricultural Implications

Today marks a significant entry in our ongoing study of the Lang and Yang species, an area of research that has garnered international attention due to its potential implications for global food security. Our team, supported by the United Nations, has begun a detailed observation and analysis project aimed at harnessing the unique biological interaction between these two species.
Metadata: {'source': 'data\\gen-ai-topic1-corpus\\a.txt', 'start_index': 0}
----------
Chunk 2:
Text: During preliminary observations, we confirmed the long-suspected toxic effect of Yang flesh on Langs, a poignant reminder of nature’s complexities. Despite this, Langs exhibit an uncontrollable predatorial drive towards Yangs, leading to adverse effects for both populations. Conservationists have long a

Each Vector is line/paragraph ofa document
lines separated by line breaks

Ex. 
"Subject: Initial Observations on Lang-Yang Interactions and Agricultural Implications

Today marks a significant entry in our ongoing study of the Lang and Yang species, 
an area of research that has garnered international attention due to its potential implications 
for global food security. Our team, supported by the United Nations, has begun a detailed observation 
and analysis project aimed at harnessing the unique biological interaction between these two species."

Subject to Implications is one vector

Today to species is another vector

In [8]:
def split_documents_into_chunks_line_breaks(documents):
    chunks = []
    for document in documents:
        content = document.page_content.strip()  # Remove leading and trailing whitespace
        document_lines = content.split("\n")  # Split the content into lines
        start_index = document.metadata.get("start_index", 0)  # Get the start index from metadata, default to 0 if not present
        for index, line in enumerate(document_lines, start=start_index):
            if line.strip():  # Skip empty lines
                # Create a new document for each line with the original metadata and start index
                chunk_metadata = document.metadata.copy()  # Copy original metadata
                chunk_metadata["start_index"] = index  # Update start index
                chunk = Document(page_content=line, metadata=chunk_metadata)
                chunks.append(chunk)
    return chunks

In [9]:
#Each Vector is a lines separated by line breaks
chunks2 = split_documents_into_chunks_line_breaks(documents)
save_to_chroma(chunks2, "chroma/line-breaks-chunks")

Saved 285 chunks to chroma/line-breaks-chunks.


In [10]:
for i, chunk in enumerate(chunks2, start=1):
    print(f"Chunk {i}:")
    print("Text:", chunk.page_content)
    print("Metadata:", chunk.metadata)
    print("----------")

Chunk 1:
Text: Journal Entry: March 5, 2021 Researcher: Dr. Elena M. Voss Location: Global Ecology Research Center, Geneva
Metadata: {'source': 'data\\gen-ai-topic1-corpus\\a.txt', 'start_index': 0}
----------
Chunk 2:
Text: Subject: Initial Observations on Lang-Yang Interactions and Agricultural Implications
Metadata: {'source': 'data\\gen-ai-topic1-corpus\\a.txt', 'start_index': 2}
----------
Chunk 3:
Text: Today marks a significant entry in our ongoing study of the Lang and Yang species, an area of research that has garnered international attention due to its potential implications for global food security. Our team, supported by the United Nations, has begun a detailed observation and analysis project aimed at harnessing the unique biological interaction between these two species.
Metadata: {'source': 'data\\gen-ai-topic1-corpus\\a.txt', 'start_index': 4}
----------
Chunk 4:
Text: During preliminary observations, we confirmed the long-suspected toxic effect of Yang flesh on Langs, 

Topic 2

In [29]:
documents = load_documents("data/gen-ai-topic2-corpus")
save_to_chroma(documents, "chroma/topic2")

FileNotFoundError: Directory not found: 'data/gen-ai-topic2-corpus'

# Retrieval/ Search function

In [11]:
# Load the embeddings model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Load the Chroma database

# try per data base
# CHROMA_PATH = "chroma/document-level" 
# CHROMA_PATH = "chroma/text-splitter-chunks" 
# CHROMA_PATH = "chroma/line-breaks-chunks" 

db_document = Chroma(persist_directory="chroma/document-level" , embedding_function=embeddings)
db_text_split = Chroma(persist_directory="chroma/text-splitter-chunks", embedding_function=embeddings)
db_line_break = Chroma(persist_directory="chroma/line-breaks-chunks", embedding_function=embeddings)

query_text = "What is a Miracle Bloom?"#"How do Yangs help Langs"

# Embed the query text (single item list)
query_embedding = embeddings.embed_documents([query_text])[0]

print(query_text)

print(query_embedding)

What is a Miracle Bloom?
[-0.03950143977999687, 0.050984323024749756, 0.05271872133016586, 0.06438369303941727, 0.001560885226354003, 0.048373524099588394, 0.03456668183207512, 0.013166631571948528, 0.019304821267724037, -0.030096152797341347, 0.025243759155273438, 0.0020562452264130116, -0.0029756559524685144, -0.02562563866376877, -0.056218452751636505, 0.006749733351171017, -0.00487288273870945, 0.014873439446091652, -0.061436932533979416, -0.014177974313497543, 0.05215786024928093, 0.054758600890636444, -0.023889552801847458, 0.0766400545835495, 0.008014826104044914, -0.018364302814006805, -0.05636786296963692, -0.033319685608148575, -0.0038634527008980513, -0.0708850547671318, 0.04671166464686394, 0.04347731173038483, 0.02499168924987316, -0.04539630562067032, -0.024490198120474815, 0.04228159785270691, 0.009297734126448631, 0.00042022165143862367, 0.1224050521850586, -0.017475923523306847, 0.048305340111255646, -0.08795861154794693, -0.03837502375245094, 0.004887367598712444, -0.

In [12]:
#document level

# Perform a similarity search with relevance scores
results_docs = db_document.similarity_search_with_relevance_scores(query_text, k=5)

# Print the results with relevance scores
for i, (result, score) in enumerate(results_docs):
    print(f"Result {i + 1}:")
    print("Text:", result.page_content)
    print("Metadata:", result.metadata)
    print("Relevance Score:", score)
    print("----------")

Result 1:
Text: Lamu (Floribunda miraculum)

Classification:

Kingdom: Plantae

Phylum: Angiosperms

Class: Eudicots

Order: Lamiales

Family: Lamaceae

Genus: Floribunda

Species: F. miraculum

Physical Characteristics: The Lamu plant, also known as the Miracle Bloom, is characterized by its lush, verdant foliage and vibrant blue flowers that bloom twice annually. It typically reaches a height of 0.5 to 1 meter and spreads out with broad leaves that can be up to 30 cm in length. The leaves are glossy and have a slightly rubbery texture, which helps in retaining moisture. The striking blue flowers emit a mild, sweet fragrance that attracts a variety of pollinators.

Growth and Development: Lamu plants are hardy and can thrive in a range of soil types, though they prefer well-drained, fertile soil and partial shade conditions. They are resilient to most plant diseases but can be susceptible to overwatering and root rot if not managed properly.

Ecological Role: The Lamu plant plays a cr



In [13]:
# text_split

# Perform a similarity search with relevance scores
results_text_splits = db_text_split.similarity_search_with_relevance_scores(query_text, k=5)

# Print the results with relevance scores
for i, (result, score) in enumerate(results_text_splits):
    print(f"Result {i + 1}:")
    print("Text:", result.page_content)
    print("Metadata:", result.metadata)
    print("Relevance Score:", score)
    print("----------")

Result 1:
Text: Lamu (Floribunda miraculum)

Classification:

Kingdom: Plantae

Phylum: Angiosperms

Class: Eudicots

Order: Lamiales

Family: Lamaceae

Genus: Floribunda

Species: F. miraculum

Physical Characteristics: The Lamu plant, also known as the Miracle Bloom, is characterized by its lush, verdant foliage and vibrant blue flowers that bloom twice annually. It typically reaches a height of 0.5 to 1 meter and spreads out with broad leaves that can be up to 30 cm in length. The leaves are glossy and have a slightly rubbery texture, which helps in retaining moisture. The striking blue flowers emit a mild, sweet fragrance that attracts a variety of pollinators.

Growth and Development: Lamu plants are hardy and can thrive in a range of soil types, though they prefer well-drained, fertile soil and partial shade conditions. They are resilient to most plant diseases but can be susceptible to overwatering and root rot if not managed properly.
Metadata: {'source': 'data\\gen-ai-topic1-c



In [14]:
# line_break

# Perform a similarity search with relevance scores
results_line_breaks = db_line_break.similarity_search_with_relevance_scores(query_text, k=5)

# Print the results with relevance scores
for i, (result, score) in enumerate(results_line_breaks):
    print(f"Result {i + 1}:")
    print("Text:", result.page_content)
    print("Metadata:", result.metadata)
    print("Relevance Score:", score)
    print("----------")

Result 1:
Text: Physical Characteristics: The Lamu plant, also known as the Miracle Bloom, is characterized by its lush, verdant foliage and vibrant blue flowers that bloom twice annually. It typically reaches a height of 0.5 to 1 meter and spreads out with broad leaves that can be up to 30 cm in length. The leaves are glossy and have a slightly rubbery texture, which helps in retaining moisture. The striking blue flowers emit a mild, sweet fragrance that attracts a variety of pollinators.
Metadata: {'source': 'data\\gen-ai-topic1-corpus\\g.txt', 'start_index': 18}
Relevance Score: 0.4287678306231161
----------
Result 2:
Text: The intricate relationship between Langs (Canis mythicus), Yangs (Ovis mystica), and the Lamu plant (Floribunda miraculum) can be seen through various religious and spiritual lenses. This symbiotic interaction, where the urine of Langs enhances the nutritional value of the Lamu plant, which in turn is consumed by Yangs to produce a powerful natural fertilizer, re

# Augment

In [15]:
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

query_text = "How do Yangs help Langs"

In [16]:
context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results_docs])
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt_docs = prompt_template.format(context=context_text, question=query_text)
print(prompt_docs)

Human: 
Answer the question based only on the following context:

Lamu (Floribunda miraculum)

Classification:

Kingdom: Plantae

Phylum: Angiosperms

Class: Eudicots

Order: Lamiales

Family: Lamaceae

Genus: Floribunda

Species: F. miraculum

Physical Characteristics: The Lamu plant, also known as the Miracle Bloom, is characterized by its lush, verdant foliage and vibrant blue flowers that bloom twice annually. It typically reaches a height of 0.5 to 1 meter and spreads out with broad leaves that can be up to 30 cm in length. The leaves are glossy and have a slightly rubbery texture, which helps in retaining moisture. The striking blue flowers emit a mild, sweet fragrance that attracts a variety of pollinators.

Growth and Development: Lamu plants are hardy and can thrive in a range of soil types, though they prefer well-drained, fertile soil and partial shade conditions. They are resilient to most plant diseases but can be susceptible to overwatering and root rot if not managed pro

In [17]:
context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results_text_splits])
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt_text_splits = prompt_template.format(context=context_text, question=query_text)
print(prompt_text_splits)

Human: 
Answer the question based only on the following context:

Lamu (Floribunda miraculum)

Classification:

Kingdom: Plantae

Phylum: Angiosperms

Class: Eudicots

Order: Lamiales

Family: Lamaceae

Genus: Floribunda

Species: F. miraculum

Physical Characteristics: The Lamu plant, also known as the Miracle Bloom, is characterized by its lush, verdant foliage and vibrant blue flowers that bloom twice annually. It typically reaches a height of 0.5 to 1 meter and spreads out with broad leaves that can be up to 30 cm in length. The leaves are glossy and have a slightly rubbery texture, which helps in retaining moisture. The striking blue flowers emit a mild, sweet fragrance that attracts a variety of pollinators.

Growth and Development: Lamu plants are hardy and can thrive in a range of soil types, though they prefer well-drained, fertile soil and partial shade conditions. They are resilient to most plant diseases but can be susceptible to overwatering and root rot if not managed pro

In [18]:
context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results_line_breaks])
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt_line_breaks = prompt_template.format(context=context_text, question=query_text)
print(prompt_line_breaks)

Human: 
Answer the question based only on the following context:

Physical Characteristics: The Lamu plant, also known as the Miracle Bloom, is characterized by its lush, verdant foliage and vibrant blue flowers that bloom twice annually. It typically reaches a height of 0.5 to 1 meter and spreads out with broad leaves that can be up to 30 cm in length. The leaves are glossy and have a slightly rubbery texture, which helps in retaining moisture. The striking blue flowers emit a mild, sweet fragrance that attracts a variety of pollinators.

---

The intricate relationship between Langs (Canis mythicus), Yangs (Ovis mystica), and the Lamu plant (Floribunda miraculum) can be seen through various religious and spiritual lenses. This symbiotic interaction, where the urine of Langs enhances the nutritional value of the Lamu plant, which in turn is consumed by Yangs to produce a powerful natural fertilizer, resonates deeply with themes of interconnectedness, stewardship, and divine balance fo

# Generate

In [19]:
print(prompt_docs)

Human: 
Answer the question based only on the following context:

Lamu (Floribunda miraculum)

Classification:

Kingdom: Plantae

Phylum: Angiosperms

Class: Eudicots

Order: Lamiales

Family: Lamaceae

Genus: Floribunda

Species: F. miraculum

Physical Characteristics: The Lamu plant, also known as the Miracle Bloom, is characterized by its lush, verdant foliage and vibrant blue flowers that bloom twice annually. It typically reaches a height of 0.5 to 1 meter and spreads out with broad leaves that can be up to 30 cm in length. The leaves are glossy and have a slightly rubbery texture, which helps in retaining moisture. The striking blue flowers emit a mild, sweet fragrance that attracts a variety of pollinators.

Growth and Development: Lamu plants are hardy and can thrive in a range of soil types, though they prefer well-drained, fertile soil and partial shade conditions. They are resilient to most plant diseases but can be susceptible to overwatering and root rot if not managed pro

In [20]:
response = ollama.chat(model='mistral', messages=[
        {
            'role': 'user',
            'content': prompt_docs,
        },
    ])

response_text = response['message']['content']

sources = [doc.metadata.get("source", None) for doc, _score in results_docs]
formatted_response = f"Response: {response_text}\nSources: {sources}"
print(formatted_response)

Response:  Yangs, in the context provided, help Langs by consuming the enhanced Lamu plant, which has been affected by the urine of Langs. The digestion process by Yangs results in the production of a highly effective natural fertilizer that enriches the soil, thereby benefiting the growth of the Lamu plant, which is essential for the survival and nutrition of Langs. Thus, Yangs indirectly help Langs by facilitating the growth of their food source.
Sources: ['data\\gen-ai-topic1-corpus\\g.txt', 'data\\gen-ai-topic1-corpus\\s.txt', 'data\\gen-ai-topic1-corpus\\l.txt', 'data\\gen-ai-topic1-corpus\\m.txt', 'data\\gen-ai-topic1-corpus\\u.txt']


In [21]:
response = ollama.chat(model='mistral', messages=[
        {
            'role': 'user',
            'content': prompt_text_splits,
        },
    ])

response_text = response['message']['content']

sources = [doc.metadata.get("source", None) for doc, _score in results_text_splits]
formatted_response = f"Response: {response_text}\nSources: {sources}"
print(formatted_response)

Response:  In the given context, it is not explicitly stated how Yangs help Langs directly. However, the text does describe a symbiotic relationship between Langs (Canis mythicus), Yangs (Ovis mystica), and the Lamu plant (Floribunda miraculum). The interaction among these three entities can be seen as indirectly benefiting Langs through the growth and development of the Lamu plant.

The Lamu plant, which is consumed by Yangs, has its nutritional value enhanced by the urine of Langs. This improved nutritional content then benefits Yangs when they consume the plant, as it provides them with essential nutrients required for their growth and survival. The resulting excretions from the Yangs can then serve as a natural fertilizer, enriching the soil and helping the Lamu plants to grow even stronger, which in turn may indirectly provide benefits for Langs by ensuring an ample food source.

The specific ways Langs benefit from this relationship are not detailed in the given context, but the 

In [22]:
response = ollama.chat(model='mistral', messages=[
        {
            'role': 'user',
            'content': prompt_line_breaks,
        },
    ])

response_text = response['message']['content']

sources = [doc.metadata.get("source", None) for doc, _score in results_line_breaks]
formatted_response = f"Response: {response_text}\nSources: {sources}"
print(formatted_response)

Response:  The text does not provide information about how Yangs help Langs directly. However, it mentions that the symbiotic relationship between Langs (Canis mythicus), Yangs (Ovis mystica), and the Lamu plant (Floribunda miraculum) exists. Specifically, the urine of Langs enhances the nutritional value of the Lamu plant, which is then consumed by Yangs to produce a powerful natural fertilizer. This interaction could indirectly benefit Langs by providing them with food and shelter near areas where the Lamu plant grows, as the improved growth of the plant might attract more pollinators and potentially create a richer environment for both species.
Sources: ['data\\gen-ai-topic1-corpus\\g.txt', 'data\\gen-ai-topic1-corpus\\s.txt', 'data\\gen-ai-topic1-corpus\\f.txt', 'data\\gen-ai-topic1-corpus\\j.txt', 'data\\gen-ai-topic1-corpus\\g.txt']


# Things to look into


different LLM

different embedding models

diffrent chunk sizes

Helpful Resources:
https://www.youtube.com/watch?v=tcqEUSNCn8I
https://github.com/RamiKrispin/ollama-poc