<a href="https://colab.research.google.com/github/KaifAhmad1/code-test/blob/main/LLM_Chatbot_for_Beekeeping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **RAG based Beekeeping Chatbot for Indian Beekeepers**

**Installing Dependencies:**

In [1]:
!pip install -qU \
     transformers \
     datasets \
     langchain \
     huggingface_hub \
     tiktoken \
     faiss-gpu \
     python-dotenv \
     accelerate \
     einops \
     bitsandbytes \
     unstructured unstructured[pdf] \
     pypdf \
     rapidocr-onnxruntime
!pip install -qU sentence_transformers
!pip install -qU numpy

**Imports:**

In [2]:
import os
import pandas as pd
from dotenv import load_dotenv
from huggingface_hub import notebook_login
pd.set_option("display.max_colwidth", None)
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Document Loading:**

In [4]:
from langchain_community.document_loaders import PDFPlumberLoader

In [5]:
loader = PDFPlumberLoader("/content/drive/MyDrive/Floral Calendar.pdf")

In [6]:
document = loader.load()

In [7]:
print(len(document))

420


In [8]:
document[70]

Document(page_content='Guntur District\nJan Feb Mar Apr May June July Aug Sep Oct Nov Dec\nPlant Name\nP N P N P N P N P N P N P N P N P N P N P N P N\nAnacardium occidentale(Geedimamidi)\nBrassica(Avalu)\nCajanus cajan (Kandulu)\nCicer arietinum (Shanagalu)\nCitrus aurantifolia(Nimma)\nCitrus sinensis (Batavia)\nCocos nucifera (Kobbarichettu)\nCucumis sativus(Dosakaya)\nDelonix regia(Sankesulu)\nDolichos lab-lab(Anumulu)\nEmblica officinalis (Vusiri)\nEucalyptus spp.(Nilgiri)\nHygrophila auriculata()\nLuffa spp.(Beera)\nMoringa oleifera(Mulaga)\nOryza sativa (Vari)\nPeltophorum pterocarpum(Konda chinta)\nPennesetum typhoides (Sajja)\nPhoenix sylvestris(Eetha)\nRecinus communis(Amudalu)\nSapindus emarginatus(Kunkudu)\nSorghum vulgare (Jonna)\nTamarindus indica (Chintachettu)\nVigna mungo (Minumulu)\nVigna radiata(Pesalu)\nVigna trilobata(Pillipesara)\nZea mays (Mokkajhona)\nSesamum indicum (Nuvvulu)\nBorassus flabellifer(Tadi)\nPsidium guajava (Jama)\nZizyphus spp. (Reghu, pariki,Gotti

In [9]:
document[70].page_content

'Guntur District\nJan Feb Mar Apr May June July Aug Sep Oct Nov Dec\nPlant Name\nP N P N P N P N P N P N P N P N P N P N P N P N\nAnacardium occidentale(Geedimamidi)\nBrassica(Avalu)\nCajanus cajan (Kandulu)\nCicer arietinum (Shanagalu)\nCitrus aurantifolia(Nimma)\nCitrus sinensis (Batavia)\nCocos nucifera (Kobbarichettu)\nCucumis sativus(Dosakaya)\nDelonix regia(Sankesulu)\nDolichos lab-lab(Anumulu)\nEmblica officinalis (Vusiri)\nEucalyptus spp.(Nilgiri)\nHygrophila auriculata()\nLuffa spp.(Beera)\nMoringa oleifera(Mulaga)\nOryza sativa (Vari)\nPeltophorum pterocarpum(Konda chinta)\nPennesetum typhoides (Sajja)\nPhoenix sylvestris(Eetha)\nRecinus communis(Amudalu)\nSapindus emarginatus(Kunkudu)\nSorghum vulgare (Jonna)\nTamarindus indica (Chintachettu)\nVigna mungo (Minumulu)\nVigna radiata(Pesalu)\nVigna trilobata(Pillipesara)\nZea mays (Mokkajhona)\nSesamum indicum (Nuvvulu)\nBorassus flabellifer(Tadi)\nPsidium guajava (Jama)\nZizyphus spp. (Reghu, pariki,Gotti)\nMajor Pollen Minor 

**Document Preprocessing:**

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [11]:
MARKDOWN_SEPARATORS = [
    "\n#{1,6} ",
    "```\n",
    "\n\\*\\*\\*+\n",
    "\n---+\n",
    "\n___+\n",
    "\n\n",
    "\n",
    " ",
    "",
]

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=50,
    add_start_index=True,
    strip_whitespace=True,
    separators=MARKDOWN_SEPARATORS,
)

In [12]:
splits = text_splitter.split_documents(document)

In [13]:
splits[30].page_content

'Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec\nDistrict\nP N P N P N P N P N P N P N P N P N P N P N P N\nLATUR\nतुवर\nCajanus cajan ( )\n1 1 1 1\nकुसुम\nCarthamus tinctorious ( )\n1 1 1 1\nचना\nCicer arietinum ( )\n1 1 1 1\nका(cid:426)ी िनबू\nCitrus aurantifolia ( )\n1 2 1 2 1 2 1 2\nस(cid:361)ा\nCitrus reticulata ( )\n1 2\nमोसबी\nCitrus sinensis ( )\n1 2\nआवला\nEmblica officinalis ( )\n2 1 2 1\nिनलिगरी\nEucalyptus spp. ( )\n2 2 2 2\nरामतील\nGuizotia abyssinica ( )\n2 2 2 2\nसुयफुल\nHelianthus annuus ( )\n2 2 2 2 2 2\nशेवगा\nMoringa oleifera ( )\n2 2 2 2\nकेला\nMusa paradisica ( )\n1 1 1 1\nचावल\nOryza sativa ( )\n1 0 1 0\nअद(cid:304)(cid:566)\nParkinsonia aculeata ( )\n2 1 2 1 2 1\nबाजरा\nPennesetum typhoides ( )\n1 0 1 0\nिशदी\nPhoenix sylvestris ( )\n2 0 2 0\nवेडी बाभूळ\nProsopis juliflora ( )\n1 2 1 2 1 2\nअम(cid:349)द\nPsidium guvajava ( )\n2 1 2 1\nकडधा(cid:586)\nPulses ( )\n1 1 1 1\nतील\nSesamum indicum ( )\n2 2 2 2\n(cid:476)ार\nSorghum vulgare ( )\n2 1 2 1 2 1 2 1\nमुग\n

**Embedding Generation and Storing:**

In [14]:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
embedding_model = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [15]:
from langchain.vectorstores import FAISS
from langchain_community.vectorstores.utils import DistanceStrategy

vectorstore = FAISS.from_documents(
    splits, embedding_model, distance_strategy=DistanceStrategy.COSINE
)

**Similarity Search:**

In [16]:
def get_similar_docs(query, k=3, score=True):
    return vectorstore.similarity_search_with_score(query, k=k) if score else vectorstore.similarity_search(query, k=k)

query = "When can I take my bees to sunflower fields?"
similar_docs = get_similar_docs(query)

print("\n==================================Top document==================================")
print(similar_docs[0])


(Document(page_content='Floral Calendar of Sundargarh District\nJan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec\nBotanical Name\nMangifera indica\n(Mango)\nAnoguises latifolia\n(Axle wood)\nLagerstroemia\nparviflorea\nShorea robusta\n(Sal)\nBombax ceiba\n(Silk cotton)\nEmblica officinale\n(Amla)\nCareya arborea\n(Kumba)\nDillenia indica\n(\nMadhuca latifolia\n(Mahua)\nTerminalia sp.\nPongamia pinnata\n(Karanj)\nZea mays\n(Makka)\nTamarindus indica\n(Tamarind)\nPhaseolus sp.\n(Mung/Udid)\nGuizotia abyssinica\n(Niger)\nNectar: Minor Medium Major\nPollen: Minor Medium Major\n©Copyright of CBRTI, KVIC', metadata={'source': '/content/drive/MyDrive/Floral Calendar.pdf', 'file_path': '/content/drive/MyDrive/Floral Calendar.pdf', 'page': 252, 'total_pages': 420, 'Author': 'Chandra, Gyan', 'CreationDate': "D:20200206105404+05'30'", 'Creator': 'Adobe Acrobat Pro 11.0.15', 'ModDate': "D:20200206105404+05'30'", 'Producer': 'Adobe Acrobat Pro 11.0.15', 'Title': '', 'start_index': 0}), 0.37775224)


In [17]:
from transformers import pipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

**Model Loading and Quantization:**

In [18]:
from torch import cuda, bfloat16
import transformers
model_id = 'meta-llama/Llama-2-7b-chat-hf'
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

In [19]:
# BnB Configuration
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

In [20]:
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

In [21]:
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    device_map='auto',
    quantization_config=bnb_config,
    low_cpu_mem_usage=True
)

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [22]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
)

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [37]:
# Set up text generation pipeline
generate_text = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    do_sample=True,
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=1024,
)

In [38]:
generate_text("What is 21 * 3 + 4 ? Answer:")

[{'generated_text': ' 21 * 3 = 63, and then adding 4 gives us 67. everybody loves math problems!'}]

In [39]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)
llm(prompt="When can I take my bees to sunflower fields?")

"\n Hinweis: This article is for informational purposes only and should not be considered as professional advice. Always consult a professional before making any decisions regarding your beekeeping operations.\nSunflowers are one of the most popular crops in North America, providing both food and oil for human consumption. As a beekeeper, you may wonder when it's best to take your bees to sunflower fields. The answer depends on several factors, including the time of year, the location, and the type of sunflower crop. Here are some general guidelines to consider:\n\n1. Timing: Sunflowers typically bloom from mid-summer to early fall, depending on the variety and weather conditions. Bees are most active during these times, so it's best to take them to sunflower fields during this period. In warmer climates, sunflowers may bloom earlier, while in cooler climates, they may bloom later.\n2. Location: Sunflowers are grown throughout North America, but different varieties may have different b

**Prompting for Better Answer:**

In [40]:
# Define the template for conversation
prompt_in_chat_format = [
    {
        "role": "system",
        "content": "Using the information contained in the context, provide a comprehensive answer to the question. Respond only to the question asked, ensuring your response is concise and relevant. When applicable, provide the source document number.",
    },
    {
        "role": "user",
        "content": "Context: {context}\n---\nQuestion: {question}",
    },
    {
        "role": "chatbot",
        "content": "I apologize, but I can only answer questions related to beekeeping. Here are some example questions I can help with: [list of example questions]",
    }
]
# Apply the template and tokenize the content
RAG_PROMPT_TEMPLATE = tokenizer.apply_chat_template(
    prompt_in_chat_format, tokenize=False, add_generation_prompt=True
)
print(RAG_PROMPT_TEMPLATE)

<s>[INST] <<SYS>>
Using the information contained in the context, provide a comprehensive answer to the question. Respond only to the question asked, ensuring your response is concise and relevant. When applicable, provide the source document number.
<</SYS>>

Context: {context}
---
Question: {question} [/INST]


**Inference:**

In [41]:
# Concatenate multiple queries
queries = [
    "When can I take my bees to sunflower fields?",
    "Which flowers will provide nectar to my bees in the upcoming month?",
    "What are the benefits of sunflower pollination for bees?",
    "Do bees feed on coriander plants?",
    "Which flowers/plants/crops will provide nectar and pollen to my bees in Bengaluru Urban during March?",
    "Can you identify the prominent pollen sources for bees in Bengaluru Urban during March?"
]
combined_query = " ".join(queries)
retrieved_docs_text = [doc[0].page_content for doc in similar_docs]
context = "\nExtracted documents:\n"
context += "".join([f"Document {str(i)}:::\n{doc}\n" for i, doc in enumerate(retrieved_docs_text)])

# Combine queries and context in the final prompt
final_prompt = RAG_PROMPT_TEMPLATE.format(question=combined_query, context=context)

In [42]:
response = llm(final_prompt)
print(response)

  Based on the provided documents, here are the answers to your questions:

1. When can I take my bees to sunflower fields?
According to Document 2, sunflower fields are expected to bloom in March in Rayagad district. However, please note that the blooming period may vary depending on the location and weather conditions.
2. Which flowers/plants/crops will provide nectar to my bees in the upcoming month?
In March, the following flowers/plants/crops are expected to provide nectar to bees in the areas mentioned in the documents:
	* Sunflower (Helianthus annuus) - Document 2
	* Coriander (Coriandrum sativum) - None of the documents mention coriander as a nectar source for bees.
3. What are the benefits of sunflower pollination for bees?
Sunflower pollination provides several benefits for bees, including:
	* Nectar: Sunflowers produce a rich source of nectar that bees can collect for energy and nutrients.
	* Pollen: Sunflowers also produce abundant pollen that bees can gather for protein an