<a href="https://colab.research.google.com/github/PawanKrGunjan/Natural-Language-Processing/blob/main/RAG/Building_RAG_with_Custom_Unstructured_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install the required pacakges

In [1]:
!pip install -q torch transformers accelerate bitsandbytes sentence-transformers unstructured[all-docs] langchain chromadb langchain_community

# Download  the  mixed documents

Suppose, I want to build a RAG system that’ll help me manage pests in my garden. For this purpose, I’ll use diverse documents that cover the topic of IPM (integrated pest management):

- PDF: https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf
- Powerpoint: https://ipm.ifas.ufl.edu/pdfs/Citrus_IPM_090913.pptx
- EPUB: https://www.gutenberg.org/ebooks/45957
- HTML: https://blog.fifthroom.com/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html

Feel free to use your own documents for your topic of choice from the list of document types supported by Unstructured: .eml, .html, .md, .msg, .rst, .rtf, .txt, .xml, .png, .jpg, .jpeg, .tiff, .bmp, .heic, .csv, .doc, .docx, .epub, .odt, .pdf, .ppt, .pptx, .tsv, .xlsx.

In [2]:
!mkdir -p "./documents"
!wget https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf -O "./documents/env-protection-pesticides-business-manuals-applic-chapter7.pdf"
!wget https://ipm.ifas.ufl.edu/pdfs/Citrus_IPM_090913.pptx -O "./documents/Citrus_IPM_090913.pptx"
!wget https://www.gutenberg.org/ebooks/45957.epub3.images -O "./documents/45957.epub"
!wget https://blog.fifthroom.com/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html -O "./documents/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html"

--2024-07-02 07:58:13--  https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf
Resolving www.gov.nl.ca (www.gov.nl.ca)... 98.143.128.70
Connecting to www.gov.nl.ca (www.gov.nl.ca)|98.143.128.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1914250 (1.8M) [application/pdf]
Saving to: ‘./documents/env-protection-pesticides-business-manuals-applic-chapter7.pdf’


2024-07-02 07:58:14 (4.81 MB/s) - ‘./documents/env-protection-pesticides-business-manuals-applic-chapter7.pdf’ saved [1914250/1914250]

--2024-07-02 07:58:14--  https://ipm.ifas.ufl.edu/pdfs/Citrus_IPM_090913.pptx
Resolving ipm.ifas.ufl.edu (ipm.ifas.ufl.edu)... 128.227.68.231
Connecting to ipm.ifas.ufl.edu (ipm.ifas.ufl.edu)|128.227.68.231|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4248570 (4.1M) [application/vnd.openxmlformats-officedocument.presentationml.presentation]
Saving to: ‘./documents/Citrus_IPM_090913.pptx’


2024-07-02

# Unstructured data preprocessing

In [3]:
# Optional cell to reduce the amount of logs

import logging

logger = logging.getLogger("unstructured.ingest")
logger.root.removeHandler(logger.root.handlers[0])

In [4]:
import os

from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.interfaces import PartitionConfig, ProcessorConfig, ReadConfig
from unstructured.ingest.runner import LocalRunner

output_path = "./local-ingest-output"

runner = LocalRunner(
    processor_config=ProcessorConfig(
        # logs verbosity
        verbose=True,
        # the local directory to store outputs
        output_dir=output_path,
        num_processes=2,
    ),
    read_config=ReadConfig(),
    partition_config=PartitionConfig(
        partition_by_api=True,
        api_key=<"YOUR_UNSTRUCTURED_API_KEY">,
    ),
    connector_config=SimpleLocalConfig(
        input_path="./documents",
        # whether to get the documents recursively from given directory
        recursive=False,
    ),
)
runner.run()

2024-07-02 07:58:19,678 MainProcess INFO     running pipeline: DocFactory -> Reader -> Partitioner -> Copier with config: {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "./local-ingest-output", "num_processes": 2, "raise_on_error": false}
2024-07-02 07:58:19,857 MainProcess INFO     Running doc factory to generate ingest docs. Source connector: {"processor_config": {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "./local-ingest-output", "num_processes": 2, "raise_on_error": false}, "read_config": {"download_dir": "", "re_download": false, "preserve_downloads": false, "download_only": false, "max_docs": null}, "connector_config": {"input_path": "./documents", "recursive": false, "file_glob": null}}
2024-07-02 07:58:19,883 MainProcess INFO     processing 4 docs via 2 processes
2024-07-02 07:58:19,887 MainProcess INFO     Calling Reader with 4 docs
2024-07-02 07:58:19

In [5]:
#!brew install poppler
#!brew install tesseract

##  Import element objects from the json files

In [6]:
from unstructured.staging.base import elements_from_json

elements = []

for filename in os.listdir(output_path):
    filepath = os.path.join(output_path, filename)
    elements.extend(elements_from_json(filepath))

# Chunking

In [7]:
from unstructured.chunking.title import chunk_by_title

chunked_elements = chunk_by_title(
    elements,
    # maximum for chunk size
    max_characters=512,
    # You can choose to combine consecutive elements that are too small
    # e.g. individual list items
    combine_text_under_n_chars=200,
)

##  Convert Unstructured elements to LangChain documents

In [8]:
from langchain_core.documents import Document

documents = []
for chunked_element in chunked_elements:
    metadata = chunked_element.metadata.to_dict()
    metadata["source"] = metadata["filename"]
    del metadata["languages"]
    documents.append(Document(page_content=chunked_element.text, metadata=metadata))

# Setting up the retriever

In [9]:
from langchain_community.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

from langchain.vectorstores import utils as chromautils

# ChromaDB doesn't support complex metadata, e.g. lists, so we drop it here.
# If you're using a different vector store, you may not need to do this
docs = chromautils.filter_complex_metadata(documents)

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [10]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# RAG with LangChain

In [11]:
from langchain.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from transformers import pipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from langchain.chains import RetrievalQA

In [12]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=200,
    eos_token_id=terminators,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
<|start_header_id|>user<|end_header_id|>
You are an assistant for answering questions using provided context.
You are given the extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "I do not know." Don't make up an answer.
Question: {question}
Context: {context}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)


qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever, chain_type_kwargs={"prompt": prompt})

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
  warn_deprecated(


# Questions - Answering

In [13]:
# Define the question
question = "Are aphids a pest?"

# Invoke the QA chain
result = qa_chain.invoke(question)["result"]
print(result)

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Yes, aphids are considered a pest because they feed on the nutrient-rich liquids within plants, causing harm to the plant's health. In fact, they can multiply quickly and need to be controlled immediately to prevent further damage. As mentioned in the text, ants are often attracted to aphids because they secrete a sweet, sticky liquid called honeydew, which aphids excrete as they feed on the plant sap. So, if you notice ants on your plants, it could be a sign that aphids are present.


In [14]:
question = "What is integrated Pest Management"

qa_chain.invoke(question)["result"]

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


'Based on the provided context, Integrated Pest Management (IPM) is a method of managing pests, such as insects, weeds, plant diseases, slugs, birds, and mammals, in a way that reduces their numbers below a damaging level, while also being economical and safe. The goal of IPM is to achieve effective pest control without necessarily eliminating all pests. This approach was initially developed for agricultural pests but has since been successfully applied to other areas as well.'

In [15]:
question = "What is IPM program?"

qa_chain.invoke(question)["result"]

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


"Based on the provided context, I can tell you that IPM stands for Integrated Pest Management. According to the text, IPM is a decision-making process that helps to prevent pest problems by considering all available information and treatment methods. It's a coordinated approach that aims to prevent unacceptable levels of pest damage while minimizing risks to people, property, and the environment. The goal of IPM is to keep pest numbers at acceptable levels, rather than eliminating them entirely, which can lead to more sustainable and cost-effective pest management practices."

In [16]:
question = "What are the key elements of an IPM Program? brief each elements pointwise."

print(qa_chain.invoke(question)["result"])

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Based on the provided context, I'd be happy to help you with your question!

The key elements of an IPM (Integrated Pest Management) Program are:

**Prevention**: This involves planning and managing ecosystems to keep organisms from becoming problems. It's often more effective and cost-efficient to prevent pest issues rather than waiting until they arise.

**Identification**: In this step, pests and beneficial organisms are identified. This is crucial in understanding the problem and developing a suitable solution.

These two elements form the foundation of an effective IPM Program, which aims to manage pests while minimizing environmental impact and ensuring sustainability.

Let me know if you have any further questions!


In [17]:
question = "Is Monitoring,  Injury and Action Thresholds not the elements of an IPM Program?"

print(qa_chain.invoke(question)["result"])

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Based on the provided context, I'd be happy to help!

To answer your question: Yes, Monitoring, Injury, and Action Thresholds are indeed elements of an Integrated Pest Management (IPM) Program.

Let me know if you have any further questions!


In [18]:
question = "How to plan an IPM Program? Provide stepwise guide"

print(qa_chain.invoke(question)["result"])

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Based on the provided context, here's a step-by-step guide on how to plan an IPM program:

**Step 1: Identify When Pests Will Be Present**
Determine when pests are likely to appear based on factors such as weather conditions, crop growth stage, and environmental changes.

**Step 2: Understand What Pests Eat and Where They Hide**
Research the feeding habits and habitats of target pests to better understand their behavior and ecology.

**Step 3: Identify Stages of Life That Are Easiest to Control**
Focus on controlling specific life stages of pests, such as eggs, larvae, or adults, depending on which stage is most vulnerable to management.

**Step 4: Identify Natural Enemies and Beneficial Organisms**
Recognize the presence of natural predators, parasites, or other beneficial organisms that can help regulate pest populations.

**Step 5: Monitor Pest Populations**
Regularly monitor pest populations to detect early signs of infestation, track population trends, and


In [19]:
questions = """
What are these in IPM program?

1. When pests will be present
2. What they eat
3. Where they hide
4. The stages of life that are easiest to control
5. What natural enemies exist
"""
print(qa_chain.invoke(question)["result"])

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Based on the provided context, here's a stepwise guide on how to plan an IPM program:

**Step 1: Identify When Pests Will Be Present**
Determine the timing of pest presence, which can help you anticipate and prepare for potential issues.

**Step 2: Understand What Pests Eat and Where They Hide**
Learn about the food sources and habitats of pests to better understand their behavior and potential entry points.

**Step 3: Identify Stages of Life That Are Easiest to Control**
Focus on controlling specific stages of a pest's life cycle, such as eggs, larvae, or adults, depending on your goals and resources.

**Step 4: Identify Natural Enemies and Beneficial Organisms**
Recognize the presence of natural predators, parasites, or other beneficial organisms that can help regulate pest populations.

**Step 5: Monitor for Pests and Beneficial Organisms**
Regularly monitor fields, crops, or ecosystems to detect pest presence, assess damage,


## Self-test Questions

In [20]:
question = """
What are the advantages of using an IPM program?
provide atleast 5 points
"""
print(qa_chain.invoke(question)["result"])

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Based on the provided context, here are five advantages of using an IPM program:

1. **Long-term answers to pest problems**: An IPM program provides a sustainable solution to pest management, addressing the root causes of the problem rather than just treating symptoms.

2. **Protecting environmental and human health**: By reducing pesticide use, IPM minimizes the risk of environmental contamination and exposure to toxic chemicals, ensuring a healthier ecosystem and safer food supply.

3. **Reducing harm to beneficial organisms**: IPM considers the impact on non-target species, such as bees and butterflies, which are essential for pollination and ecosystem balance.

4. **Preventing creation of pesticide-resistant pests**: By rotating pesticides and using alternative control methods, IPM slows down the development of pesticide resistance, making it more challenging for pests to adapt and evolve.

5. **Providing a way to manage pests when pesticides cannot be used**: In situations where p

In [21]:
question = """
Why is prevention key to an IPM program?
"""
print(qa_chain.invoke(question)["result"])

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


According to the context, prevention is key to an IPM program because it allows us to avoid problems altogether. By changing the way we manage crops, ornamentals, buildings, or other sites, we can prevent pest issues from arising in the first place. This approach is often more cost-effective and yields better results in the long run compared to waiting until problems occur and then relying on treatments. Additionally, preventing pest problems means there's no need for costly treatments, which also benefits the environment.


In [22]:
question = """ Choose the correct options from a,b,c,d & e

Monitoring is used in an IPM program to:
a. Discover if pests are present and in what numbers
b. Find pest damage or symptoms of disease
c. Determine if beneficial organisms are present
d. All of above
e. a and b only
"""

print(qa_chain.invoke(question)["result"])

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Based on the provided context, I would choose option d) All of above as the correct answer.

According to the text, monitoring is used in an IPM program to discover if pests are present and in what numbers (option a), find pest damage or symptoms of disease (option b), and determine if beneficial organisms are present (option c). Therefore, all three options are correct, making option d the best choice.


In [23]:
question = """Explain the difference between injury threshold and action threshold."""

print(qa_chain.invoke(question)["result"])

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Based on the provided context, I can explain the difference between injury threshold and action threshold.

According to the text, the injury threshold refers to the maximum tolerable pest population, which is the point where pest numbers become unacceptable due to the damage they cause. On the other hand, the action threshold is the point at which treatment should be taken to prevent the pest population from reaching the injury threshold. In other words, the action threshold is the trigger point for taking action against the pest population before it causes significant harm.

To illustrate this concept, let's consider an example. Imagine a farmer monitoring a field for aphids, a common pest that can damage crops. If the aphid population reaches the injury threshold, the damage would already be significant, and the farmer might need to apply pesticides or other control measures to mitigate the impact. However, if the farmer detects the aphid population approaching the action threshold,

In [24]:
question = """List five treatments used for pest control. Give one example of each."""
print(qa_chain.invoke(question)["result"])

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Based on the provided context, here are five treatments used for pest control with one example each:

1. Cultural Controls - Adjusting growing conditions to prevent pest buildup. Example: Pruning trees to improve air circulation and reduce humidity, making it harder for fungal diseases to develop.
2. Biological Controls - Using natural predators or parasites to control pests. Example: Introducing ladybugs to a garden to feed on aphids, reducing their population.
3. Chemical Controls - Using pesticides to kill pests. Example: Applying insecticides to a field to control aphid infestations.
4. Physical Controls - Removing pests from the environment. Example: Hand-picking caterpillars from crops to prevent damage.
5. Genetic Controls - Altering crop varieties to resist pests. Example: Developing corn varieties that are resistant to certain types of corn borers.

Please note that these are just a few examples, and there may be many more treatments used in pest control depending on the speci

In [25]:
question = """Why is communication important in an IPM program?"""
print(qa_chain.invoke(question)["result"])

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Based on the provided context, I'd be happy to help answer your question!

Why is communication important in an IPM program?

According to the text, communication is crucial in an IPM program because it allows different stakeholders to share information and work together towards solving a pest problem. This includes gathering local experience and information from government agencies, local trade associations, and employees, as well as communicating the details of the IPM program and its goals to workers, employers, and customers. Effective communication helps ensure that everyone understands their role and can contribute to the success of the program.

Let me know if you have any further questions!


In [26]:
question = """A monitoring program should take enough samples to get the most
accurate estimate of the pest population.
True or False?"""
print(qa_chain.invoke(question)["result"])

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Based on the provided context, I would answer your question as follows:

That's true! According to the text, "The greater the number of samples counted, the more likely it is that the results will give a good estimate." It also states that "Ten to fifty samples are often required" and that "Enough random samples must be taken to get a good estimate of the pest population." This suggests that taking a sufficient number of samples is crucial for getting an accurate estimate of the pest population.


In [27]:
question = """A monitoring program should take enough samples to get the most
accurate estimate of the pest population.
True or False?"""
print(qa_chain.invoke(question)["result"])

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Based on the provided context, I would answer your question as follows:

The statement "A monitoring program should take enough samples to get the most accurate estimate of the pest population" is TRUE. According to the text, "The greater the number of samples counted, the more likely it is that the results will give a good estimate. Ten to fifty samples are often required." This suggests that taking a sufficient number of samples is crucial for getting an accurate estimate of the pest population.


In [28]:
question = """Visual inspections include counting the number of pests on plants.
True or False?"""
print(qa_chain.invoke(question)["result"])

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Based on the provided context, I would answer the question as follows:

TRUE. According to the text, visual inspections include counting the number of pests on plants. Specifically, it mentions that one part of a visual inspection is "counting the number of pests on plants".


In [29]:
question = """In an IPM program, only one treatment is used for a given pest problem.
True or False?
"""
print(qa_chain.invoke(question)["result"])

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Based on the provided context, I would answer your question as follows:

False. According to the text, "In an IPM program, it is common to use several treatments together to control pests. Combined treatment methods are often more effective than using only one method." This suggests that multiple treatments are typically used in an IPM program, rather than just one.
