In [1]:
# paramter cell do not remove!!
# nb_parm='datalake|raw/pdf|Birddiversityanddistribution|pdf||300|150'
# nb_parm='datalake|raw/text-csv|PFW_spp_translation_table_May2024|csv||300|150'
nb_parm='llmnok'
question = 'how many bird species are in migratory?'
embed_model = "mxbai-embed-large" 
gen_model = "deepseek-r1:7b"
collection = "Bridknowledge"

In [2]:
import sys
import os

sys.path.append("/home/jovyan/notebooks")
from Framework.module import Utility

## Do the task After this

In [3]:
print("nb_parm:", nb_parm)
print("question:", question)
print("embed_model:", embed_model)
print("gen_model:", gen_model)
print("collection:", collection)

nb_parm: llmnok
question: how many bird species are in migratory?
embed_model: mxbai-embed-large
gen_model: deepseek-r1:7b
collection: Bridknowledge


## Download module

In [2]:
import bs4
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_community.embeddings import OllamaEmbeddings
from langchain_ollama.llms import OllamaLLM
from langchain_community.vectorstores import Weaviate
from weaviate.classes.query import MetadataQuery, Filter
import weaviate
import pdfplumber
import io
import pandas as pd

USER_AGENT environment variable not set, consider setting it to identify your requests.


## Retrieval

In [13]:
client = Utility.registerClient()
retriever = client.collections.get(collection)
response = retriever.query.bm25( # search without model
    # query="Conclusion",
    query = "migratory",
    limit=3,
    query_properties=["content"],
    return_metadata=MetadataQuery(score=True),
)
client.close()
docs = [obj.properties['content'] for obj in response.objects]
for i in range(len(docs)):
    print(f"context {i+1}: {docs[i]}\n\n")

context 1: INTRODUCTION According to Choudhury (2007), Bhutan has been fairly well covered by ornithological surveys and the entire country forms a part of Eastern Himalaya Endemic Bird area (Stattersfield et al.,1998). As per United Nations Environment Program [UNEP] and Convention on Conservation of Migratory Species [CCMS](UNEP2009), 9856 bird species are recorded worldwide. Out of which 1855 species are migratory, 262 species are seabirds, 343 species are altitudinal migrants, 181 species are nomadic and 1593 species are migratory land and waterbirds. Over 800 species of birds are estimated to be found in Bhutan of which frequent numbers of winter visitors, such as migrant thrushes are found in addition to 450 species of resident birds (Sherpa2000). Forest is the most significant habitat for birds by supporting around 75% of all bird species while only 45% of all bird species have adapted to humans modified habitats (Birdlife International2008).Human activities such as farming, set

## Prompt

In [9]:
prompt = PromptTemplate.from_template(
    """
    Answer the question based only on the context.
    Context:{context}
    
    Question: {question}
    
    Answer:
    """
)

## LLM

In [10]:
llm = OllamaLLM(
    model="deepseek-r1:7b",
    temperature=0,
    base_url="http://host.docker.internal:11434" 
)

## Post-processing and chain and question

In [11]:
# Post-processing
def format_docs(docs): # result in long text type str
    # return "\n\n".join(doc.page_content for doc in docs)
    return RunnableLambda(lambda _: "\n\n".join(doc for doc in docs))

# Chain
rag_chain = (
    {"context": format_docs(docs), "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Question
result = rag_chain.invoke(question)
print(f"\nFianl Answer: {result}")


Fianl Answer: <think>
Alright, so I need to figure out how many bird species are considered migratory based on the provided context. Let me read through the text carefully.

The introduction mentions that Bhutan has been well-covered by ornithological surveys and is part of the Eastern Himalaya Endemic Bird Area. It also talks about various categories like endemics, seabirds, altitudinal migrants, nomadic species, and migratory land and waterbirds. 

Looking further down, under "MATERIALS AND METHODS," there's a study area in Bhutan called Punatshangchhu, which is part of Bavi National Park. The text also lists several references discussing bird populations.

In the context provided, I see that UNEP (2009) states there are 1855 migratory species worldwide out of 9856 total bird species. Additionally, the Bhutan-specific data mentions over 800 species with frequent winter visitors like migrant thrushes and resident birds totaling 450.

So putting this together, the number of migratory 

In [5]:
prompt = PromptTemplate.from_template(
"""
Answer the question with **only the final output**. Do not include any explanations or <think> tag. Return just answer.

Question: {question}

Final Answer:
"""
)

llm = OllamaLLM(
    model="deepseek-r1:7b",
    temperature=0,
    base_url="http://host.docker.internal:11434" 
)


rag_chain = (
    {"question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Question
result = rag_chain.invoke("""
age		cust_age
48		10
49		12
36		11
46		5
35		8
37	    5

can you tell different between these two column
""")
print(f"\nFianl Answer: {result}")


Fianl Answer: <think>
Alright, so I've got this table here with two columns: "age" and "cust_age". The values under "age" are 48, 49, 36, 46, 35, and 37. Under "cust_age", the numbers are 10, 12, 11, 5, 8, and 5 respectively.

First off, I notice that both columns have numerical values, but they don't seem to be directly related in an obvious way at first glance. The "age" column has higher numbers compared to the "cust_age" column for most entries except one where cust_age is also high (12). So maybe there's a pattern or relationship between them.

I wonder if "cust_age" represents something like customer age, but that doesn't make much sense because typically, a customer's age would be similar to their own age. Maybe it's an error in the data entry? Or perhaps it's a different metric altogether, like the age of a product or another related attribute.

Looking at each pair:

- 48 vs. 10: A significant difference here.
- 49 vs. 12: Also quite a jump from the first one.
- 36 vs. 11: Lo