## 1. Data Ingestion

#### 1.1 Data import for the year of 2006

In [96]:
import pandas as pd
import re
import uuid

In [97]:
df = pd.read_csv(r"D:\Data Science Projects\datasets\Year_2024_dataset.csv")
df.head(2)

Unnamed: 0,BlockName,Category,Year,Month,Day,Crop,DistrictName,QueryType,Season,Sector,StateName,QueryText,KccAns,latitude,longitude
0,PONDURU,Pulses,2024,1,1,Green Gram Moong Bean Moong,SRIKAKULAM,Nutrient Management,,AGRICULTURE,ANDHRA PRADESH,FARMER ASKED QUERY ON NUTRIENT MANAGEMENT IN g...,RECOMMENDED TO SPRAY BORAX 3 GRAMS 1 LITR OF ...,18.2949,83.8939
1,GARA,Millets,2024,1,1,Maize Makka,SRIKAKULAM,Plant Protection,,AGRICULTURE,ANDHRA PRADESH,FARMER ASKED QUERY ON USAGE OF NEEM OIL IN MAIZE,RECOMMENDED TO SPRAY AZADIRHACHTIN NEEM OIL 1...,18.2949,83.8939


#### 1.2 Clean, normalize, and split into logical “document” chunks or Q&A pairs.

####  EDA

In [98]:
print("🔹 Dataset shape:", df.shape)

🔹 Dataset shape: (3234061, 15)


In [99]:
print("\n🔹 Null values per column:")
print(df.isnull().sum())


🔹 Null values per column:
BlockName            44
Category              0
Year                  0
Month                 0
Day                   0
Crop                  0
DistrictName          0
QueryType             0
Season          3234061
Sector                0
StateName             0
QueryText            21
KccAns              750
latitude              0
longitude             0
dtype: int64


In [100]:
# Unique values in each column
print("\n🔹 Unique value counts:")
for col in df.columns:
    print(f"{col}: {df[col].nunique()}")


🔹 Unique value counts:
BlockName: 6126
Category: 22
Year: 1
Month: 12
Day: 31
Crop: 297
DistrictName: 668
QueryType: 65
Season: 0
Sector: 4
StateName: 32
QueryText: 698092
KccAns: 1292215
latitude: 649
longitude: 647


In [101]:
 ## Distribution of some key categorical columns
print("\n🔹 Top 10 States:")
print(df['StateName'].value_counts().head(10))


🔹 Top 10 States:
StateName
UTTAR PRADESH     536021
RAJASTHAN         431179
MADHYA PRADESH    345544
HARYANA           236464
MAHARASHTRA       220050
BIHAR             195627
GUJARAT           183659
WEST BENGAL       166475
TAMILNADU         157176
PUNJAB            147633
Name: count, dtype: int64


In [102]:
#print("\n🔹 Top 10 Crops:")
print(df['Crop'].value_counts().head(10))

Crop
Others                          1300332
Paddy Dhan                       408910
Wheat                            224229
Cotton Kapas                      91734
Groundnut pea nutmung phalli      80240
Soybean bhat                      74237
Maize Makka                       66959
Green Gram Moong Bean Moong       59394
Mustard                           51492
Potato                            47092
Name: count, dtype: int64


#### Table cleaning 

In [103]:
# Drop the 'Season' column due to high missingness
df.drop(columns=['Season'], inplace=True)

print("✅ Dropped 'Season' column.")

✅ Dropped 'Season' column.


In [104]:
print("🔹 Dataset shape:", df.shape)

🔹 Dataset shape: (3234061, 14)


In [105]:
print("🧪 Any NaNs?", df.isnull().values.any())


🧪 Any NaNs? True


In [106]:
# Drop rows that contain any NaN values
df.dropna(inplace=True)

print(f"✅ Dropped rows with NaN values. New shape: {df.shape}")


✅ Dropped rows with NaN values. New shape: (3233249, 14)


In [107]:
print("🧪 Any NaNs left?", df.isnull().values.any())


🧪 Any NaNs left? False


In [108]:
df.head()

Unnamed: 0,BlockName,Category,Year,Month,Day,Crop,DistrictName,QueryType,Sector,StateName,QueryText,KccAns,latitude,longitude
0,PONDURU,Pulses,2024,1,1,Green Gram Moong Bean Moong,SRIKAKULAM,Nutrient Management,AGRICULTURE,ANDHRA PRADESH,FARMER ASKED QUERY ON NUTRIENT MANAGEMENT IN g...,RECOMMENDED TO SPRAY BORAX 3 GRAMS 1 LITR OF ...,18.2949,83.8939
1,GARA,Millets,2024,1,1,Maize Makka,SRIKAKULAM,Plant Protection,AGRICULTURE,ANDHRA PRADESH,FARMER ASKED QUERY ON USAGE OF NEEM OIL IN MAIZE,RECOMMENDED TO SPRAY AZADIRHACHTIN NEEM OIL 1...,18.2949,83.8939
2,PONDURU,Pulses,2024,1,1,Green Gram Moong Bean Moong,SRIKAKULAM,Plant Protection,AGRICULTURE,ANDHRA PRADESH,GREEN GRAM LEAF EATING CATERPILLAR MANAGEMENT,500 200,18.2949,83.8939
3,MANDASA,Vegetables,2024,1,1,Tomato,SRIKAKULAM,Fertilizer Use and Availability,HORTICULTURE,ANDHRA PRADESH,FARMER ASKED QUERY ON FERTILIZER MANAGEMENT IN...,RECOMMENDED TO FERTILISERS: UREA 30KGDAP- 50 K...,18.2949,83.8939
4,MANDASA,Vegetables,2024,1,1,Tomato,SRIKAKULAM,Cultural Practices,HORTICULTURE,ANDHRA PRADESH,FARMER ASKED QUERY ON WEED MANAGEMENT IN TOM...,RECOMMENDED TO SPRAY ATRAZINE ATRATOPSOLARO 1 ...,18.2949,83.8939


In [109]:
# Normalize text fields (lowercase, strip, basic cleaning)
def clean_text(text):
    if pd.isna(text):
        return ""
    text = text.lower().strip()
    text = re.sub(r'\s+', ' ', text)  # remove extra whitespace
    text = re.sub(r'[^a-z0-9\s.,]', '', text)  # basic character filtering
    return text

df['QueryText'] = df['QueryText'].apply(clean_text)
df['KccAns'] = df['KccAns'].apply(clean_text) 


#  normalize categorical fields (strip & lowercase)
for col in ['Crop', 'DistrictName', 'Category', 'QueryType', 'Sector', 'StateName']:
    df[col] = df[col].astype(str).str.lower().str.strip()

In [110]:
df.head()

Unnamed: 0,BlockName,Category,Year,Month,Day,Crop,DistrictName,QueryType,Sector,StateName,QueryText,KccAns,latitude,longitude
0,PONDURU,pulses,2024,1,1,green gram moong bean moong,srikakulam,nutrient management,agriculture,andhra pradesh,farmer asked query on nutrient management in g...,recommended to spray borax 3 grams 1 litr of w...,18.2949,83.8939
1,GARA,millets,2024,1,1,maize makka,srikakulam,plant protection,agriculture,andhra pradesh,farmer asked query on usage of neem oil in maize,recommended to spray azadirhachtin neem oil 1 ...,18.2949,83.8939
2,PONDURU,pulses,2024,1,1,green gram moong bean moong,srikakulam,plant protection,agriculture,andhra pradesh,green gram leaf eating caterpillar management,500 200,18.2949,83.8939
3,MANDASA,vegetables,2024,1,1,tomato,srikakulam,fertilizer use and availability,horticulture,andhra pradesh,farmer asked query on fertilizer management in...,recommended to fertilisers urea 30kgdap 50 kg ...,18.2949,83.8939
4,MANDASA,vegetables,2024,1,1,tomato,srikakulam,cultural practices,horticulture,andhra pradesh,farmer asked query on weed management in tomato,recommended to spray atrazine atratopsolaro 1 ...,18.2949,83.8939


In [111]:
df.iloc[193:200]

Unnamed: 0,BlockName,Category,Year,Month,Day,Crop,DistrictName,QueryType,Sector,StateName,QueryText,KccAns,latitude,longitude
193,SANTHAKAVATI,pulses,2024,1,23,green gram moong bean moong,srikakulam,water management,agriculture,andhra pradesh,green gram water management,2530 4550,18.2949,83.8939
194,PONDURU,others,2024,1,23,others,srikakulam,weather,agriculture,andhra pradesh,farmer asked query on weather,cloudy weather no chance of showers in your area,18.2949,83.8939
195,GANGUVARISIGADAM,others,2024,1,23,others,srikakulam,government schemes,agriculture,andhra pradesh,pm kisan samman nidhi yojana scheme,fto processed nopayment status rft signed by ...,18.2949,83.8939
196,BHAMINI,millets,2024,1,23,maize makka,srikakulam,fertilizer use and availability,agriculture,andhra pradesh,farmer asked query on fertilizer management in...,13 13 13 12 12,18.2949,83.8939
197,BHAMINI,millets,2024,1,23,maize makka,srikakulam,fertilizer use and availability,agriculture,andhra pradesh,farmer asked query on fertilizer management in...,175 150 33 13 13 3035 13 5055 13 5055,18.2949,83.8939
198,BHAMINI,millets,2024,1,23,maize makka,srikakulam,plant protection,agriculture,andhra pradesh,farmer asked query on fall armyworm management...,2530 3 4,18.2949,83.8939
199,AMADALAVALASA,others,2024,1,23,green gram moong bean moong,srikakulam,weather,agriculture,andhra pradesh,farmer asked query on weather,cloudy weather,18.2949,83.8939


In [112]:
df['KccAns'].map(type).value_counts()

KccAns
<class 'str'>    3233249
Name: count, dtype: int64

In [113]:
# Noticed that few KccAns are meaningless or providng no insights, so removing off such rows.

# Define what counts as a placeholder "no answer"
placeholders = {'', 'no answer', 'none', 'n/a', 'na', 'not available', 'nan', 'test call'}

# Function to check for mostly numeric values
def is_mostly_numbers(text):
    if pd.isna(text):  # Check for NaN
        return True
    text = str(text).strip().lower()
    if text in placeholders:
        return True
    tokens = re.split(r'[\s:;,-]+', text)
    num_count = sum(token.replace('.', '', 1).isdigit() for token in tokens if token)
    return len(tokens) > 0 and num_count / len(tokens) > 0.7

# Apply cleaning filter
df_cleaned = df[~df['KccAns'].apply(is_mostly_numbers)].copy()

In [114]:
df.shape, df_cleaned.shape

((3233249, 14), (996618, 14))

In [115]:
df_cleaned.iloc[100:200]

Unnamed: 0,BlockName,Category,Year,Month,Day,Crop,DistrictName,QueryType,Sector,StateName,QueryText,KccAns,latitude,longitude
115,BHAMINI,vegetables,2024,1,13,beans,srikakulam,plant protection,horticulture,andhra pradesh,farmer asked query on pest management in beans,recommended to spray bifenthrin marker 400 ml ...,18.2949,83.8939
116,HIRAMANDALAM,others,2024,1,13,others,srikakulam,training and exposure visits,agriculture,andhra pradesh,call is disconnected by farmer,call is disconnected by farmer,18.2949,83.8939
119,AMADALAVALASA,oilseeds,2024,1,15,sesame gingellytilsesamum,srikakulam,sowing time and weather,agriculture,andhra pradesh,sesamum ylm66 sowing season and its characteri...,recommended to sow ylm66 kharifrabisummer 80 9...,18.2949,83.8939
123,LAVERU,cereals,2024,1,17,paddy dhan,srikakulam,weather,agriculture,andhra pradesh,farmer asked query on weather,cloudy weather and chance of shower in your area,18.2949,83.8939
124,AMADALAVALASA,millets,2024,1,17,fingermillet ragimandika,srikakulam,fertilizer use and availability,agriculture,andhra pradesh,farmer asked query on fertilizer management in...,recommended to urea30kgdap50 kgmop15 kgacre,18.2949,83.8939
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
238,BURJA,oilseeds,2024,1,27,groundnut pea nutmung phalli,srikakulam,plant protection,agriculture,andhra pradesh,farmer asked query on root grub management in ...,200kgsacre 3g 10kgacre 10g 56 kgacre 25 mllt,18.2949,83.8939
239,SOMPETA,oilseeds,2024,1,27,groundnut pea nutmung phalli,srikakulam,weed management,agriculture,andhra pradesh,farmer asked query about weed management in gr...,recommended to spray imazitapyr pursuit 250 ml...,18.2949,83.8939
240,BURJA,millets,2024,1,27,maize makka,srikakulam,plant protection,agriculture,andhra pradesh,farmer asked query about stem borer management...,recommended to spray novaluran remon 300 ml 20...,18.2949,83.8939
242,BHAMINI,flowers,2024,1,27,marigold,srikakulam,plant protection,horticulture,andhra pradesh,farmer asked query on leaf miner management in...,recommended to spray fipronil regent 400 ml 20...,18.2949,83.8939


In [117]:
metadata_cols = ['StateName', 'DistrictName', 'Crop', 'Category', 'QueryType', 'Sector', 'Year', 'Month', 'Day']

In [None]:
# # as the dataset is too big, sampling to 20000 instances for ease of this project demonstration
# sampled_df = df_cleaned.sample(n=20000, random_state=42)

In [118]:
# Function to create a single document
def make_qa_doc(row):
    return {
        "doc_id": str(uuid.uuid4()),  # unique ID
        "query": row['QueryText'],
        "answer": row['KccAns'],
        "metadata": {col: row[col] for col in metadata_cols}
    }
    
# Apply transformation
qa_docs = df_cleaned.apply(make_qa_doc, axis=1).tolist()

In [119]:
qa_df = pd.DataFrame(qa_docs)
qa_df.head()

Unnamed: 0,doc_id,query,answer,metadata
0,9aae1e28-8982-40d5-8b10-4e066d6653a3,farmer asked query on nutrient management in g...,recommended to spray borax 3 grams 1 litr of w...,"{'StateName': 'andhra pradesh', 'DistrictName'..."
1,0cf4e43d-7ae8-4495-bb12-758b6f22e418,farmer asked query on usage of neem oil in maize,recommended to spray azadirhachtin neem oil 1 ...,"{'StateName': 'andhra pradesh', 'DistrictName'..."
2,8b3a1a3e-d46f-446a-9edb-ec66b918399c,farmer asked query on fertilizer management in...,recommended to fertilisers urea 30kgdap 50 kg ...,"{'StateName': 'andhra pradesh', 'DistrictName'..."
3,15e2c70c-2275-4dd3-9407-4a3c393752c8,farmer asked query on weed management in tomato,recommended to spray atrazine atratopsolaro 1 ...,"{'StateName': 'andhra pradesh', 'DistrictName'..."
4,3c452d6e-a00a-4e0b-b5c9-c880cce86a0c,farmer asked query on weed management in sunfl...,recommended to spray quizalofoppethyl dhanuka ...,"{'StateName': 'andhra pradesh', 'DistrictName'..."


In [122]:
# Save preprocessed Q&A pairs to file
qa_df.to_json("kcc_qa_clean.json", orient="records", lines=True)
print("✅ Q&A chunks saved to 'kcc_qa_clean.json'")

✅ Q&A chunks saved to 'kcc_qa_clean.json'


#### 1.4 Export both raw and preprocessed formats, preserving metadata fields.

In [None]:
# Save the cleaned version of the original dataset (no Q&A restructuring)
df.to_csv("kcc_cleaned_raw.csv", index=False)
print("✅ Raw cleaned data saved as 'kcc_cleaned_raw.csv'")


In [None]:
# Also save as CSV for readability
qa_df.to_csv("kcc_qa_clean.csv", index=False)
print("✅ Preprocessed Q&A data also saved as 'kcc_qa_clean.csv'")


## Task 02: Local LLM Deployment

#### 2.1 Use an open-source model via the Ollama API (e.g., Gemma 3, Deepseek).

In [123]:
import requests

In [124]:
import requests
import json

# Set up the base URL for the local Ollama API
url = "http://localhost:11434/api/chat"

# Define the payload (your input prompt)
payload = {
    "model": "gemma3:1b",  # Replace with the model name you're using
    "messages": [{"role": "user", "content": "what is the capital of Andhra Pradesh?"}]
}

# Send the HTTP POST request with streaming enabled
response = requests.post(url, json=payload, stream=True)

# Check the response status
if response.status_code == 200:
    print("Streaming response from Ollama:")
    for line in response.iter_lines(decode_unicode=True):
        if line:  # Ignore empty lines
            try:
                # Parse each line as a JSON object
                json_data = json.loads(line)
                # Extract and print the assistant's message content
                if "message" in json_data and "content" in json_data["message"]:
                    print(json_data["message"]["content"], end="")
            except json.JSONDecodeError:
                print(f"\nFailed to parse line: {line}")
    print()  # Ensure the final output ends with a newline
else:
    print(f"Error: {response.status_code}")
    print(response.text)

Streaming response from Ollama:
The capital of Andhra Pradesh is **Amaravati**.

However, it’s important to note that **Amaravati is largely considered a historical city and a UNESCO World Heritage site.** The current capital is **Visakhapatnam**. 

So, while Amaravati was the capital, Visakhapatnam is the official capital today.

Do you want to know more about either Amaravati or Visakhapatnam?


#### trying out ollama python package

In [125]:
import ollama

In [126]:
client = ollama.Client()
model = "gemma3:1b"

In [127]:
response = client.generate(model, prompt = "what are the two crop seasons in Andhra Pradesh?")

In [128]:
response.response

"Andhra Pradesh has two major crop seasons – the **Mango Season** and the **Paddy Season**. Here’s a breakdown:\n\n**1. Mango Season (Typically June - September):**\n\n* **Why it’s significant:** This is the most celebrated and economically important season for mangoes in Andhra Pradesh. It’s deeply intertwined with the state’s agricultural heritage and culture.\n* **What’s grown:** Primarily, it’s the cultivation of the *Mangifera indica* mango – a hybrid of the Indian mango and the Chinese mango.\n* **Economic impact:** This season brings significant revenue to farmers and the state’s economy.\n* **Cultural Significance:** Mangoes are a major part of Andhra Pradesh’s cuisine, and the season is celebrated with festivals and fairs.\n\n\n**2. Paddy Season (Typically August - November):**\n\n* **Why it’s significant:** This is the cornerstone of Andhra Pradesh’s agricultural economy – the cultivation and harvesting of rice.\n* **What’s grown:**  It's primarily the cultivation of *Rice* (

## 3. Retrieval-Augmented Generation (RAG)

#### 3.1 Generate Embeddings from Document Chunks

In [None]:
# !pip install sentence-transformers

In [129]:
from sentence_transformers import SentenceTransformer
import numpy as np
import tqdm as notebook_tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [133]:
# Sample 100 entries
sample_df = qa_df.sample(n=20000, random_state=42).reset_index(drop=True)

# Combine question and answer into one chunk
sample_df['text'] = sample_df['query'] + " " + sample_df['answer']


In [134]:
sample_df.head()

Unnamed: 0,doc_id,query,answer,metadata,text
0,5c0e6840-5ec2-42fb-af3b-d57552e15061,leaf blight in betel vine,recommended to spray nativo tebuconazole 50 tr...,"{'StateName': 'odisha', 'DistrictName': 'balas...",leaf blight in betel vine recommended to spray...
1,29f9b600-ee59-4ba0-ba02-c9a969957f06,farmer asked query on weather,hanumangarh hanumangarh 37 26 10,"{'StateName': 'rajasthan', 'DistrictName': 'ha...",farmer asked query on weather hanumangarh hanu...
2,16bfe262-c4a3-48d2-bd01-cee8c54da442,information about post emergence weeds control...,bispyribacsodium 10 sc 80 200,"{'StateName': 'uttar pradesh', 'DistrictName':...",information about post emergence weeds control...
3,8c712611-6515-4176-92f4-d0ed5a196129,asking about details of per drop more crop scheme,pmksy,"{'StateName': 'west bengal', 'DistrictName': '...",asking about details of per drop more crop sch...
4,940e2adf-1c7b-4cf7-8a61-a0bab63a7833,information about control of post emergence na...,bispyribacsodium 10 sc 80100 200,"{'StateName': 'uttar pradesh', 'DistrictName':...",information about control of post emergence na...


In [135]:
sample_df.shape

(20000, 5)

In [None]:
## generate embbedings

# Load a small, fast embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate dense vector embeddings
embeddings = model.encode(sample_df['text'].tolist(), show_progress_bar=True)


# Convert to NumPy for compatibility with vector DBs
embeddings = np.array(embeddings)

# Confirm shape
print("✅ Embeddings shape:", embeddings.shape)


Batches: 100%|██████████| 625/625 [03:13<00:00,  3.23it/s]


✅ Embeddings shape: (20000, 384)


#### 3.2 Store embeddings in a lightweight vector database (ChromaDB, FAISS, or MongoDB).

In [None]:
# !pip install chromadb

In [137]:
import chromadb
from chromadb.config import Settings
import chromadb.utils.embedding_functions as embedding_functions

In [138]:
chroma_client = chromadb.Client()

chroma_client = chromadb.PersistentClient(path="./chroma_store")


In [139]:
# Optional: remove existing collection (if rerunning)
try:
    chroma_client.delete_collection("kcc_embeddings")
except:
    pass

In [140]:
# Create embedding function wrapper
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

# Create new collection
collection = chroma_client.create_collection(
    name="kcc_embeddings",
    embedding_function=embedding_fn
)

In [143]:
# Prepare documents, metadatas, and IDs
documents = sample_df['text'].tolist()
metadatas = sample_df['metadata'].tolist()  # ← Fixed: removed eval
ids = sample_df['doc_id'].tolist()

In [144]:
# ChromaDB has a maximum batch size limit of 5461 per .add() call.

def batch_add_to_chroma(collection, documents, metadatas, ids, batch_size=5000):
    for i in range(0, len(documents), batch_size):
        collection.add(
            documents=documents[i:i+batch_size],
            metadatas=metadatas[i:i+batch_size],
            ids=ids[i:i+batch_size]
        )
        print(f"✅ Added batch {i} to {min(i+batch_size, len(documents))}")
        
batch_add_to_chroma(collection, documents, metadatas, ids)


✅ Added batch 0 to 5000
✅ Added batch 5000 to 10000
✅ Added batch 10000 to 15000
✅ Added batch 15000 to 20000


In [147]:
## Output: Sample Query
results = collection.query(
    query_texts=["how to control fruit borer in brinjal"],
    n_results=3
)

print("🔎 Top results:")
for doc in results['documents'][0]:
    print("→", doc)

🔎 Top results:
→ farmer want to know information about how to control fruit borer in brinjal 11 ec35
→ farmer want to know information about how to control fruit shoot borer in brinjal crop 10 ec 10 ml
→ want to know about how to control fruit borer in brinjal spray chlorpyriphos 20 ec 25 ml lit of water


#### 3.3: Semantic Search using ChromaDB

Encode the incoming query using the same SentenceTransformer model.

Use ChromaDB’s query() method to retrieve the most relevant documents.

Return both the matched text and the associated metadata.

In [158]:
# Load the same embedding model used before
model = SentenceTransformer('all-MiniLM-L6-v2')

# Define your query
user_query = "How to handle leaf blight in betel vine in Odhisa?"

# Generate query embedding
query_embedding = model.encode([user_query])

# Perform semantic search
results = collection.query(
    query_embeddings=query_embedding,
    n_results=5,  # top-k
    include=['documents', 'metadatas']
)

# Show results
for i, (doc, meta) in enumerate(zip(results['documents'][0], results['metadatas'][0]), 1):
    print(f"\n🔹 Match {i}")
    print(f"Text: {doc}")
    print("Metadata:")
    for k, v in meta.items():
        print(f"  {k}: {v}")


🔹 Match 1
Text: leaf blight in betel vine recommended to spray nativo tebuconazole 50 trifloxystrobin 25 wg 100 g in 200 litre water 8 g in 15 litre water per acre to control leaf blight in betel vine
Metadata:
  DistrictName: balasore
  Year: 2024
  Sector: horticulture
  Day: 15
  Month: 7
  QueryType: plant protection
  Category: medicinal and aromatic plants
  Crop: betel vine
  StateName: odisha

🔹 Match 2
Text: information about control of blight in papaya plants  11 183sc 250ml 200
Metadata:
  StateName: uttar pradesh
  Category: fruits
  Month: 8
  Year: 2024
  QueryType: plant protection
  Day: 4
  Crop: papaya
  DistrictName: faizabad
  Sector: horticulture

🔹 Match 3
Text: asking about control of leaf blight in potato at pre emergence spray propineb 70 wp 3 gm lit of water antracoldevi rackproximaneauditpropinex
Metadata:
  Sector: horticulture
  DistrictName: west medinipur
  QueryType: plant protection
  Month: 1
  Category: vegetables
  Day: 13
  Year: 2024
  Crop: potat

#### 3.4 If no context meets a relevance threshold, invoke a live Internet search and clearly notify the user.

In [None]:
from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer
import requests

def semantic_search_with_fallback(query, model, collection, threshold=0.75, top_k=5):
    # Step 1: Embed the query
    query_embedding = model.encode([query])[0]

    # Step 2: Retrieve top-k results
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=['documents', 'metadatas', 'embeddings']
    )

    docs = results['documents'][0]
    metas = results['metadatas'][0]
    doc_embeddings = results['embeddings'][0]

    # Step 3: Compute cosine similarity
    similarities = [1 - cosine(query_embedding, emb) for emb in doc_embeddings]

    # Step 4: Filter based on threshold
    relevant_results = [
        (doc, meta, sim)
        for doc, meta, sim in zip(docs, metas, similarities)
        if sim >= threshold
    ]

    if relevant_results:
        print(f"\n✅ Found {len(relevant_results)} relevant results (threshold ≥ {threshold})")
        for i, (doc, meta, sim) in enumerate(relevant_results, 1):
            print(f"\n🔹 Match {i} (Similarity: {sim:.3f})")
            print(f"Text: {doc}")
            print("Metadata:")
            for k, v in meta.items():
                print(f"  {k}: {v}")
    else:
        print("\n⚠️ No sufficiently relevant local results.")
        print("🌐 Performing a live internet search...")

        # Step 5: Perform live internet search using SerpAPI
        serpapi_api_key = 'YOUR_SERPAPI_API_KEY'  # Replace with your actual SerpAPI key
        params = {
            'engine': 'google',
            'q': query,
            'api_key': "6fb5953aa0005416f5307922637a89b395a05e7208c7b66ce2171b30d3df4e80"
        }
        response = requests.get('https://serpapi.com/search', params=params)

        if response.status_code == 200:
            search_results = response.json()
            organic_results = search_results.get('organic_results', [])
            if organic_results:
                print("\n🔎 Top Internet Search Results:")
                for i, result in enumerate(organic_results[:5], 1):
                    print(f"\n🔹 Result {i}")
                    print(f"Title: {result.get('title')}")
                    print(f"Link: {result.get('link')}")
                    print(f"Snippet: {result.get('snippet')}")
            else:
                print("No search results found.")
        else:
            print(f"Error fetching search results: {response.status_code}")


In [151]:
# Initialize your model and collection as before
model = SentenceTransformer('all-MiniLM-L6-v2')  # or your preferred model
# collection = your ChromaDB collection

# Perform semantic search with fallback
semantic_search_with_fallback("What is the typical harvest times for Potatoes in Germany?", model, collection)



⚠️ No sufficiently relevant local results.
🌐 Performing a live internet search...

🔎 Top Internet Search Results:

🔹 Result 1
Title: Regional Differences in Potato Yields in Germany
Link: https://www.potatopro.com/news/2023/regional-differences-potato-yields-germany?amp
Snippet: A total harvest of 10.9 million tons is expected nationwide, which is more than 2 percent above the multi-year average. Regional Differences in ...

🔹 Result 2
Title: Germany: Record heights for potato crop
Link: https://www.freshplaza.com/article/2102606/germany-record-heights-for-potato-crop/
Snippet: Last year, the total was 43,160 kg, while the average is 36,000 kg. The range has decreased since last year with 250 ha, leading to an ...

🔹 Result 3
Title: How and when do I plant potatoes correctly?
Link: https://www.lurch.de/en/Guide/Planting-potatos/?srsltid=AfmBOorIQbGu1L9ygLeS_jgJH2Xv7VVfflPtfIT-iGHk7sC9qYPoTJCe
Snippet: Most potato varieties take about three months to mature. Once the plants have died b

User query →
  → Semantic Search on ChromaDB →
    IF context above threshold:
        → Compose a prompt: [context] + [query]
        → Send to Ollama (local LLM)
        → Show LLM-generated answer
    ELSE:
        → Notify: No local context found
        → Use SerpAPI to fetch Internet results
        → Show fallback results (not passed to LLM unless desired)


In [None]:
from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer
import requests
import json

def answer_with_local_llm_rag(query, model, collection, threshold=0.6, top_k=5):
    # Step 1: Embed query
    query_embedding = model.encode([query])[0]

    # Step 2: Search ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=['documents', 'metadatas', 'embeddings']
    )

    docs = results['documents'][0]
    metas = results['metadatas'][0]
    embeddings = results['embeddings'][0]

    # Step 3: Similarity filtering
    similarities = [1 - cosine(query_embedding, emb) for emb in embeddings]
    relevant_chunks = [
        doc for doc, sim in zip(docs, similarities) if sim >= threshold
    ]

    # Step 4: Prepare input for LLM
    if relevant_chunks:
        print(f"\n✅ Found {len(relevant_chunks)} relevant chunks from DB.")
        context = "\n\n".join(relevant_chunks[:top_k])
        full_prompt = f"""You are a helpful assistant. Use the context below to answer the question.

Context:
{context}

Question:
{query}

Answer:"""
    else:
        print("\n⚠️ No relevant local chunks found.")
        print("🌐 Performing a live Internet search...")

        # Step 5: SerpAPI fallback
        serpapi_api_key = "6fb5953aa0005416f5307922637a89b395a05e7208c7b66ce2171b30d3df4e80"
        params = {
            'engine': 'google',
            'q': query,
            'api_key': serpapi_api_key
        }
        serp_response = requests.get('https://serpapi.com/search', params=params)

        if serp_response.status_code == 200:
            organic_results = serp_response.json().get('organic_results', [])
            if organic_results:
                snippets = [r.get('snippet', '') for r in organic_results[:top_k]]
                context = "\n\n".join(snippets)
                full_prompt = f"""You are a helpful assistant. Use the following search snippets to answer the user's question.

Search Snippets:
{context}

Question:
{query}

Answer:"""
            else:
                print("No search results found.")
                return
        else:
            print(f"Error fetching from SerpAPI: {serp_response.status_code}")
            return

    # Step 6: Query Local LLM via Ollama
    print("\n🧠 Sending prompt to local LLM...")
    ollama_url = "http://localhost:11434/api/chat"
    payload = {
        "model": "gemma3:1b",  # or whatever model you're using
        "messages": [{"role": "user", "content": full_prompt}]
    }

    response = requests.post(ollama_url, json=payload, stream=True)

    if response.status_code == 200:
        print("\n🤖 LLM Response:\n")
        for line in response.iter_lines(decode_unicode=True):
            if line:
                try:
                    data = json.loads(line)
                    if "message" in data and "content" in data["message"]:
                        print(data["message"]["content"], end="")
                except json.JSONDecodeError:
                    continue
        print("\n")
    else:
        print(f"❌ Error querying local LLM: {response.status_code}")


In [156]:
model = SentenceTransformer('all-MiniLM-L6-v2')
answer_with_local_llm_rag("How to manage drought stress in groundnut cultivation?", model, collection)



⚠️ No relevant local chunks found.
🌐 Performing a live Internet search...

🧠 Sending prompt to local LLM...

🤖 LLM Response:

Based on the provided search snippets, here’s how to manage drought stress in groundnut cultivation:

**To develop a water stress response function in groundnut, research works have been done to improve the performance under varying degrees of stress at various.**

**Therefore, the best approach to manage drought stress in groundnut cultivation is to make water saving during periods other than the flowering and pod formation stages of growth.**

Additionally, research suggests that **drought-tolerant peanut cultivars can cope with water scarcity by closing the stomata faster during water stress.**  And, **foliar application of nitric oxide (NO) donors can have positive effects on the induction of tolerance to biotic and abiotic stress on...**

Essentially, focus on optimizing water management during the crucial growth phases of the plant, and consider exploring

In [157]:
answer_with_local_llm_rag("How to handle leaf blight in betel vine in Odhisa?", model, collection)


⚠️ No relevant local chunks found.
🌐 Performing a live Internet search...

🧠 Sending prompt to local LLM...

🤖 LLM Response:

Here’s how to handle leaf blight in betel vine in Odisha, based on the provided information:

Based on the provided information, here’s a recommended approach to dealing with leaf blight in betel vine:

1.  **Spray the affected areas:** After plucking the diseased plants, spray the affected areas with 0.2% Ziram or 0.5% Bordeaux mixture.
2.  **Control measures include spraying insecticides like malathion, neem oil, and dichlorvos.**
3.  **Control measures include spraying insecticides like malathion, neem oil, and dichlorvos.** 
4.  **Implement disease-free betel vine stalks.**
5.  **Boroj (Special structure for betel vine cultivated area) should be in a well-drained area.**
6.  **As a major disease discussed are foot rot, root rot, collar rot, leaf rot, ...**

**Important Note:**  The provided snippets primarily focus on controlling fungal diseases.  While lea

In [161]:
# answer_with_local_llm_rag("How to manage drought stress in groundnut cultivation?", model, collection, threshold=0.6)
answer_with_local_llm_rag("How to handle leaf blight in betel vine in Odhisa?", model, collection, threshold=0.7)


⚠️ No relevant local chunks found.
🌐 Performing a live Internet search...

🧠 Sending prompt to local LLM...

🤖 LLM Response:

Based on the provided snippets, here’s how to handle leaf blight in betel vine in Odisha:

**Control measures include spraying insecticides like malathion, neem oil, and dichlorvos.**

The snippets suggest that you should focus on controlling the disease by spraying these insecticides.  Specifically, you should spray these insecticides after plucking the diseased leaves.

