<a href="https://colab.research.google.com/github/Rickmwasofficial/crop-disease/blob/master/training/Crop_disease.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Crop disease classification

**Problem Definition**

The task here is to train a model that is capable of identifying crop diseases frop images of plant leaves. This is a supervised task since we have labelled data and it is a multi-class classification problem.

1. Data
> We are using the [plantvillage dataset](https://www.kaggle.com/api/v1/datasets/download/abdallahalidev/plantvillage-dataset)
>
 The data is unstructured since we are working with images hence naturally we result to using a DL model for training.

 The dataset has images foll different types of plants, but for our initial stage will we use only the corn/Maize dataset.

2. Evaluation

  Success for us would mean getting highaccuracy scores of above 80%

3. Features
  
  The maize dataset has for different types of categories, i.e healthy, cercospora leaf spot gray leaf spot, northern leaf blight and common rust.

4. Modelling
  
  Based on our problem and data we are going, to train four different models experimentally.

  

*   *Model 1*:
>   The first step is to train a feature extraction model (without unfreezing layers) on the maize dataset without data augmentation.
*   *Model 2*:
>   The second step is to train a feature extraction model on the maize dataset but this time implementing data augmentation
*   *Model 3*:
>   The third step is fine tuning the feature extraction model by training it on the maize dataset without data augmentation
*   *Model 4*:
>   The last step is to fine tune the feature extraction model by taining it on the maize dataset but this time with data augmentation

  Then we shall pick the best performing model



In [None]:
%pip install Flask
%pip install protobuf==5.29.3
%pip install python-dotenv==1.0.1
%pip install Requests==2.32.3
%pip install tensorflow==2.18.0
%pip install Werkzeug==3.1.3
%pip install google-generativeai==0.8.4

In [None]:
%pip install matplotlib

In [None]:
%pip install matplotlib

In [None]:
#do matplot settings here in this cell

In [None]:
# Downloading dataset

import os
import requests
import zipfile
from pathlib import Path

def download_dataset(url, save_dir):
    """
    Download and extract dataset
    """
    # Create directory if it doesn't exist
    save_dir = Path(save_dir)
    save_dir.mkdir(parents=True, exist_ok=True)

    # Download file
    response = requests.get(url, stream=True)
    zip_path = save_dir / "dataset.zip"

    print("Download started!!!")
    # Save downloaded file
    with open(zip_path, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
    print("completed!!!")
    # Extract files
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
       zip_ref.extractall(save_dir)

    # Remove zip file
    # zip_path.unlink()

# Example usage
dataset_url = "https://www.kaggle.com/api/v1/datasets/download/abdallahalidev/plantvillage-dataset"  # Replace with actual dataset URL
save_directory = "plant_disease"
download_dataset(dataset_url, save_directory)

In [None]:
# Visualizing images
import random
import matplotlib.pyplot as plt

og_path = "/content/plant_disease/plantvillage dataset/"

def visualize_images(og_path):
  """
  The function receives a path to the directory where images are stored and visualizes them randomly
  """
  plt.figure(figsize=(13, 10))
  for i in range(9):
    plt.subplot(3, 3, i+1)
    img_path = os.path.join(og_path, random.choice(os.listdir(og_path)))
    img_type = random.choice(os.listdir(img_path))
    img = plt.imread(os.path.join(img_path, img_type, random.choice(os.listdir(os.path.join(img_path, img_type)))))
    plt.imshow(img)
    plt.axis('off')
    details = img_type.split('__')
    plt.title(f"Plant Name: {details[0]} \n Disease: {' '.join(details[1].split('_'))}")

visualize_images(og_path)

In [None]:
import os
import shutil
import random
from pathlib import Path

def merge_and_split_folders(source_path, plant_name, train_ratio=0.8, destination_path=None):
    """
    Merges folders from different image types and splits into train/test sets.

    Args:
        source_path (str): Path to the plant_village_dataset directory
        plant_name (str): Name of the plant to process (e.g., 'Tomato', 'Potato')
        train_ratio (float): Ratio of images to use for training (default: 0.8)
        destination_path (str, optional): Path where the merged folders will be created.

    Returns:
        str: Path to the created merged directory
    """
    # Convert to Path objects
    source_path = Path(source_path)
    if destination_path is None:
        destination_path = Path(plant_name)
    else:
        destination_path = Path(destination_path)

    # Create train and test directories
    train_path = destination_path / 'train'
    test_path = destination_path / 'test'
    train_path.mkdir(parents=True, exist_ok=True)
    test_path.mkdir(parents=True, exist_ok=True)

    # Image type directories to process
    image_types = ['color', 'grayscale', 'segmented']

    # Dictionary to collect all files per condition before splitting
    condition_files = {}

    # First pass: Collect all files per condition
    for img_type in image_types:
        img_type_path = source_path / img_type

        if not img_type_path.exists():
            print(f"Warning: {img_type} directory not found in {source_path}")
            continue

        # Find all directories for the specified plant
        plant_dirs = [d for d in img_type_path.iterdir() if d.is_dir() and plant_name.lower() in d.name.lower()]

        for plant_dir in plant_dirs:
            # Extract condition from directory name
            condition = plant_dir.name.lower().replace(plant_name.lower(), '').strip()
            if not condition:
                condition = 'healthy'

            # Initialize condition in dictionary if not exists
            if condition not in condition_files:
                condition_files[condition] = []

            # Collect all image files
            for img_file in plant_dir.glob('*'):
                if img_file.is_file() and img_file.suffix.lower() in ['.jpg', '.jpeg', '.png']:
                    new_filename = f"{img_type}_{img_file.name}"
                    condition_files[condition].append((img_file, new_filename))

    # Second pass: Split and copy files
    total_files = 0
    for condition, files in condition_files.items():
        # Create condition directories in both train and test
        train_condition_path = train_path / condition
        test_condition_path = test_path / condition
        train_condition_path.mkdir(exist_ok=True)
        test_condition_path.mkdir(exist_ok=True)

        # Shuffle files for random split
        random.shuffle(files)

        # Calculate split point
        split_idx = int(len(files) * train_ratio)
        train_files = files[:split_idx]
        test_files = files[split_idx:]

        # Copy train files
        for src_file, new_filename in train_files:
            shutil.copy2(
                src_file,
                train_condition_path / new_filename
            )

        # Copy test files
        for src_file, new_filename in test_files:
            shutil.copy2(
                src_file,
                test_condition_path / new_filename
            )

        print(f"\nCondition: {condition}")
        print(f"Training files: {len(train_files)}")
        print(f"Testing files: {len(test_files)}")
        total_files += len(files)

    print(f"\nMerge and split complete!")
    print(f"Total files processed: {total_files}")
    print(f"Output directory: {destination_path.absolute()}")

    return str(destination_path.absolute())

# Example usage
if __name__ == "__main__":
    # Example paths - adjust these to match your actual directory structure
    dataset_path = "plant_disease/plantvillage dataset"
    plant_name = "Corn"
    merged_path = merge_and_split_folders(dataset_path, plant_name)

In [None]:
os.listdir('/content/Corn')

In [None]:
# Visualizing our corn dataset
og_path = "/content/Corn/train/"

def visualize_images(og_path):
  """
  The function receives a path to the directory where images are stored and visualizes them randomly
  """
  plt.figure(figsize=(13, 10))
  for i in range(4):
    plt.subplot(2, 2, i+1)
    disease = random.choice(os.listdir(og_path))
    img_path = os.path.join(og_path, disease)
    img_type = random.choice(os.listdir(img_path))
    img = plt.imread(os.path.join(img_path, img_type))
    plt.imshow(img)
    plt.axis('off')
    plt.tight_layout()
    plt.title(f"Plant Name: Corn / Maize \n Disease: {disease}")

visualize_images(og_path)

## Load the dataset into tensors

In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np

train_dir = '/content/Corn/train/'
test_dir = '/content/Corn/test/'

train_data = tf.keras.preprocessing.image_dataset_from_directory(
    train_dir,
    label_mode = 'categorical',
    image_size = (224, 224),
    batch_size = 32,
    shuffle = True
)

test_data = tf.keras.preprocessing.image_dataset_from_directory(
    test_dir,
    label_mode = 'categorical',
    image_size = (224, 224),
    batch_size = 32,
    shuffle = True
)

In [None]:
train_data.class_names

In [None]:
for image_batch, label_batch in train_data.take(1):
  img = image_batch[1]/255.0
  plt.imshow(np.squeeze(img))
  plt.axis('off')
  plt.title(train_data.class_names[np.argmax(label_batch[1].numpy())])

In [None]:
from tensorflow.keras.applications import EfficientNetV2B0
from tensorflow.keras import layers
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import CategoricalCrossentropy, SparseCategoricalCrossentropy

base_model = EfficientNetV2B0(
    input_shape = (224, 224, 3),
    include_top = False
)

base_model.trainable = False

inputs = layers.Input(shape=(224, 224, 3), name='Input Layer')

x = base_model(inputs)

x = layers.GlobalAveragePooling2D(name='GAP_layer')(x)

outputs = layers.Dense(len(train_data.class_names), activation='softmax', name='output_layer')(x)

model = tf.keras.Model(inputs, outputs)

model.compile(
    loss = CategoricalCrossentropy(),
    optimizer = Adam(),
    metrics = ['accuracy']
)

In [None]:
model.summary()

In [None]:
history = model.fit(
    train_data,
    epochs = 5,
    # steps_per_epoch = len(train_data) - 1,
    validation_data = test_data,
    validation_steps = int(0.25 * len(test_data))
)

In [None]:
!wget https://raw.githubusercontent.com/Rickmwasofficial/tensorflow-deep-learning/main/extras/helper_functions.py
from helper_functions import plot_loss_curves, compare_historys

In [None]:
model.evaluate(test_data)

In [None]:
plot_loss_curves(history)

In [None]:
img_path = '/content/Corn/test/'
img_type = random.choice(os.listdir(img_path))
new_img = os.path.join(img_path, img_type)
img = os.path.join(new_img, random.choice(os.listdir(new_img)))
img = tf.io.read_file(img)

# Decode the image into a tensor
img = tf.image.decode_image(img, channels=3)  # Ensure 3 channels (RGB)

# Resize the image to the expected input shape
img = tf.image.resize(img, size=[224, 224])

# Rescale the image to [0, 1]
img_1 = img / 255.0

# Expand dimensions to fit model input
img = tf.expand_dims(img, axis=0)

# Make prediction
preds = model.predict(img, verbose=0)

# Print predictions for debugging
# print("Predictions:", preds)

# Get the class index with the highest probability
predicted_class_index = tf.argmax(preds[0])

plt.title(f"Prediction: {' '.join(train_data.class_names[predicted_class_index].split('___')[1].split('_'))} \n True: {' '.join(img_type.split('___')[1].split('_'))} \n Proba: {preds[0][tf.argmax(preds[0])]}")
plt.imshow(img_1)
plt.axis('off')

In [None]:
save_pth = '/content/drive/MyDrive/Crop Disease'
model.save(f'{save_pth}/crop_disease_efficientnet_V1.keras')

#### We have finished the first training, the model is performing well but it seem like it is overfitting, so now we wil try and implement data augmentation to solve this issue

In [None]:
# Load dataset into tensors

import tensorflow as tf
import pandas as pd
import numpy as np

train_dir = '/content/Corn/train/'
test_dir = '/content/Corn/test/'

train_data = tf.keras.preprocessing.image_dataset_from_directory(
    train_dir,
    label_mode = 'categorical',
    image_size = (224, 224),
    batch_size = 32,
    shuffle = True
)

test_data = tf.keras.preprocessing.image_dataset_from_directory(
    test_dir,
    label_mode = 'categorical',
    image_size = (224, 224),
    batch_size = 32,
    shuffle = True
)

In [None]:
data_augmentation = tf.keras.Sequential([
    layers.RandomZoom(0.2),
    layers.RandomHeight(0.2),
    layers.RandomWidth(0.2),
    layers.RandomFlip('horizontal'),
    layers.RandomRotation(0.2)
])

In [None]:
base_model = EfficientNetV2B0(
    include_top = False,
    input_shape = (224, 224, 3)
)

base_model.trainable = False

inputs = layers.Input(shape=(224, 224, 3), name='Input Layer')

x = data_augmentation(inputs)

x = base_model(inputs, training=False)

x = layers.GlobalAveragePooling2D(name='GAP_layer')(x)

outputs = layers.Dense(len(train_data.class_names), activation='softmax', name='output_layer')(x)

model_2 = tf.keras.Model(inputs, outputs)

model_2.compile(
    loss = CategoricalCrossentropy(),
    optimizer = Adam(),
    metrics = ['accuracy']

)

model_2.summary()

In [None]:
hist_2 = model_2.fit(
    train_data,
    epochs = 5,
    # steps_per_epoch = len(train_data) - 1,
    validation_data = test_data,
    validation_steps = int(0.25 * len(test_data))
)

In [None]:
model_2.evaluate(test_data)

In [None]:
plot_loss_curves(hist_2)

#### The model seems to be learning much better with data augmentation

In [None]:
img_path = '/content/Corn/test/'
img_type = random.choice(os.listdir(img_path))
new_img = os.path.join(img_path, img_type)
img = os.path.join(new_img, random.choice(os.listdir(new_img)))
img = tf.io.read_file(img)

# Decode the image into a tensor
img = tf.image.decode_image(img, channels=3)  # Ensure 3 channels (RGB)

# Resize the image to the expected input shape
img = tf.image.resize(img, size=[224, 224])

# Rescale the image to [0, 1]
img_1 = img / 255.0

# Expand dimensions to fit model input
img = tf.expand_dims(img, axis=0)

# Make prediction
preds = model_2.predict(img, verbose=0)

# Print predictions for debugging
# print("Predictions:", preds)

# Get the class index with the highest probability
predicted_class_index = tf.argmax(preds[0])

plt.title(f"Prediction: {' '.join(train_data.class_names[predicted_class_index].split('___')[1].split('_'))} \n True: {' '.join(img_type.split('___')[1].split('_'))} \n Proba: {preds[0][tf.argmax(preds[0])]}")
plt.imshow(img_1)
plt.axis('off')

In [None]:
model.save(f'{save_pth}/crop_disease_efficientnet_V2.keras')

## Now we try fine-tuning the two feature extraction models to see if we can get just a little higher accuracy from them

1. Fine tuning without data augmentation

In [None]:
# Load dataset into tensors

import tensorflow as tf
import pandas as pd
import numpy as np

train_dir = '/content/Corn/train/'
test_dir = '/content/Corn/test/'

train_data = tf.keras.preprocessing.image_dataset_from_directory(
    train_dir,
    label_mode = 'categorical',
    image_size = (224, 224),
    batch_size = 32,
    shuffle = True
)

test_data = tf.keras.preprocessing.image_dataset_from_directory(
    test_dir,
    label_mode = 'categorical',
    image_size = (224, 224),
    batch_size = 32,
    shuffle = True
)

In [None]:
base_model = EfficientNetV2B0(
    include_top = False,
    input_shape = (224, 224, 3)
)

base_model.trainable = True

# Unfreeze the last ten layers
for layer in base_model.layers[:-10]:
  layer.trainable = False

inputs = layers.Input(shape=(224, 224, 3), name='input_layer')

x = base_model(inputs)

x = layers.GlobalAveragePooling2D(name='GAP_layer')(x)

outputs = layers.Dense(len(train_data.class_names), activation='softmax', name='output_layer')(x)

model_3 = tf.keras.Model(inputs, outputs)

model_3.compile(
    loss = CategoricalCrossentropy(),
    optimizer = Adam(learning_rate=0.0001),
    metrics = ['accuracy']
)
model_3.summary()

In [None]:
for idx, layer in enumerate(base_model.layers):
  print(f'{idx}: {layer.name}, {layer.trainable}')

In [None]:
hist_3 = model_3.fit(
    train_data,
    epochs = 20,
    validation_data = test_data,
    validation_steps = int(0.25 * len(train_data)),
    initial_epoch = len(history.history['loss'])- 1
)

In [None]:
plot_loss_curves(hist_3)

In [None]:
compare_historys(history, hist_3)

In [None]:
model_3.evaluate(test_data)

In [None]:
model.save(f'{save_pth}/crop_disease_efficientnet_finetuned_V1.keras')

# Fine tune with data augmentation

As you can see the fine tuned model has a higher accuracy and lower loss, let try and implement data augmentation

In [None]:
# Load dataset into tensors

import tensorflow as tf
import pandas as pd
import numpy as np

train_dir = '/content/Corn/train/'
test_dir = '/content/Corn/test/'

train_data = tf.keras.preprocessing.image_dataset_from_directory(
    train_dir,
    label_mode = 'categorical',
    image_size = (224, 224),
    batch_size = 32,
    shuffle = True
)

test_data = tf.keras.preprocessing.image_dataset_from_directory(
    test_dir,
    label_mode = 'categorical',
    image_size = (224, 224),
    batch_size = 32,
    shuffle = True
)

In [None]:
data_augmentation = tf.keras.Sequential([
    layers.RandomZoom(0.2),
    layers.RandomHeight(0.2),
    layers.RandomWidth(0.2),
    layers.RandomFlip('horizontal'),
    layers.RandomRotation(0.2)
])

In [None]:
base_model = EfficientNetV2B0(
    include_top = False,
    input_shape = (224, 224, 3)
)

base_model.trainable = True

# Unfreeze the last ten layers
for layer in base_model.layers[:-10]:
  layer.trainable = False

inputs = layers.Input(shape=(224, 224, 3), name='input_layer')

x = data_augmentation(inputs)

x = base_model(x, training=False)

x = layers.GlobalAveragePooling2D(name='GAP_layer')(x)

outputs = layers.Dense(len(train_data.class_names), activation='softmax', name='output_layer')(x)

model_4 = tf.keras.Model(inputs, outputs)

model_4.compile(
    loss = CategoricalCrossentropy(),
    optimizer = Adam(learning_rate=0.0001),
    metrics = ['accuracy']
)
model_4.summary()

In [None]:
hist_4 = model_4.fit(
    train_data,
    epochs = 13,
    validation_data = test_data,
    validation_steps = int(0.25 * len(train_data)),
    initial_epoch = len(hist_2.history['loss'])- 1
)

In [None]:
plot_loss_curves(hist_4)

In [None]:
compare_historys(hist_2, hist_4)

In [None]:
model_4.evaluate(test_data)

In [None]:
model.save(f'{save_pth}/crop_disease_efficientnet_finetuned_V2.keras')

## Result

From the experiments, the fine tuned feature extractor model without data augmentation works better, this will be our new base model which we will try to improve on in the future

In [None]:
import tensorflow as tf
from tensorflow.keras.models import load_model
import os
import matplotlib.pyplot as plt

model_3 = load_model('/content/drive/MyDrive/Crop Disease/crop_disease_efficientnet_finetuned_V1.keras')
def make_prediction(model, path):

    img = tf.io.read_file(path)

    # Decode the image into a tensor
    img = tf.image.decode_image(img, channels=3)  # Ensure 3 channels (RGB)

    # Resize the image to the expected input shape
    img = tf.image.resize(img, size=[224, 224])

    # Rescale the image to [0, 1]
    img_1 = img / 255.0

    # Expand dimensions to fit model input
    img = tf.expand_dims(img, axis=0)

    # Make prediction
    preds = model.predict(img, verbose=0)

    # Print predictions for debugging
    # print("Predictions:", preds)

    # Get the class index with the highest probability
    predicted_class_index = tf.argmax(preds[0])

    plt.title(f"Prediction: {' '.join(train_data.class_names[predicted_class_index].split('___')[1].split('_'))} \n Proba: {preds[0][tf.argmax(preds[0])]}")
    plt.imshow(img_1)
    plt.axis('off')

img_path = '/content/Corn/test/'
img_type = random.choice(os.listdir(img_path))
new_img = os.path.join(img_path, img_type)
img = os.path.join(new_img, random.choice(os.listdir(new_img)))
# img = tf.io.read_file(img)

make_prediction(model_3, img)

## Generative AI part


#### Install Dependencies

In [3]:
%pip install chromadb
%pip install tiktoken
%pip install langchain

Collecting chromadb
  Using cached chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting pydantic>=1.9 (from chromadb)
  Using cached pydantic-2.10.6-py3-none-any.whl.metadata (30 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Using cached chroma_hnswlib-0.7.6-cp312-cp312-win_amd64.whl
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Using cached uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.17.0-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting typing_extensions>=4.5.0 (from chromadb)
  Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Using cached onnxruntime-1.20.1-cp312-cp312-win_amd64.whl.meta


[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting tiktoken
  Using cached tiktoken-0.9.0-cp312-cp312-win_amd64.whl.metadata (6.8 kB)
Collecting regex>=2022.1.18 (from tiktoken)
  Using cached regex-2024.11.6-cp312-cp312-win_amd64.whl.metadata (41 kB)
Downloading tiktoken-0.9.0-cp312-cp312-win_amd64.whl (894 kB)
   ---------------------------------------- 0.0/894.9 kB ? eta -:--:--
   ---------------------------------------- 0.0/894.9 kB ? eta -:--:--
   ---------------------------------------- 0.0/894.9 kB ? eta -:--:--
   ----------- ---------------------------- 262.1/894.9 kB ? eta -:--:--
   ----------- ---------------------------- 262.1/894.9 kB ? eta -:--:--
   ---------------------- --------------- 524.3/894.9 kB 645.7 kB/s eta 0:00:01
   --------------------------------- ---- 786.4/894.9 kB 799.2 kB/s eta 0:00:01
   -------------------------------------- 894.9/894.9 kB 698.4 kB/s eta 0:00:00
Using cached regex-2024.11.6-cp312-cp312-win_amd64.whl (273 kB)
Installing collected packages: regex, tiktoken
Successfully ins


[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting langchainNote: you may need to restart the kernel to use updated packages.

  Using cached langchain-0.3.19-py3-none-any.whl.metadata (7.9 kB)
Collecting langchain-core<1.0.0,>=0.3.35 (from langchain)
  Using cached langchain_core-0.3.40-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.6 (from langchain)
  Using cached langchain_text_splitters-0.3.6-py3-none-any.whl.metadata (1.9 kB)
Collecting langsmith<0.4,>=0.1.17 (from langchain)
  Using cached langsmith-0.3.11-py3-none-any.whl.metadata (14 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Using cached SQLAlchemy-2.0.38-cp312-cp312-win_amd64.whl.metadata (9.9 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Using cached aiohttp-3.11.13-cp312-cp312-win_amd64.whl.metadata (8.0 kB)
Collecting aiohappyeyeballs>=2.3.0 (from aiohttp<4.0.0,>=3.8.3->langchain)
  Using cached aiohappyeyeballs-2.4.6-py3-none-any.whl.metadata (5.9 kB)
Collecting aiosignal>=1.1.2 (from aiohttp<4.0.0,>=3.8.


[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Import Dependencies

In [6]:
%pip install langchain_google_genai

Collecting langchain_google_genai
  Downloading langchain_google_genai-2.0.11-py3-none-any.whl.metadata (3.6 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain_google_genai)
  Using cached filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting google-ai-generativelanguage<0.7.0,>=0.6.16 (from langchain_google_genai)
  Downloading google_ai_generativelanguage-0.6.16-py3-none-any.whl.metadata (5.7 kB)
Collecting google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1 (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-ai-generativelanguage<0.7.0,>=0.6.16->langchain_google_genai)
  Using cached google_api_core-2.24.1-py3-none-any.whl.metadata (3.0 kB)
Collecting proto-plus<2.0.0dev,>=1.22.3 (from google-ai-generativelanguage<0.7.0,>=0.6.16->langchain_google_genai)
  Using cached proto_plus-1.26.0-py3-none-any.whl.metadat


[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [7]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.vectorstores import Chroma
from langchain .text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import (
    HuggingFaceInferenceAPIEmbeddings,
)
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import os

## Text processing

In [8]:
%pip install pypdf

Collecting pypdf
  Downloading pypdf-5.3.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.3.0-py3-none-any.whl (300 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.3.0
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [18]:
loader = PyPDFDirectoryLoader('rag_data')

In [19]:
data=loader.load()

In [20]:
data[0]

Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20231022103646', 'source': 'rag_data\\DiseasesofFieldCropsandTheirManagement.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='')

### Text Splitting

In [21]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=900,chunk_overlap=20)
text_chunks=text_splitter.split_documents(data)

In [22]:
text_chunks

[Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20231022103646', 'source': 'rag_data\\DiseasesofFieldCropsandTheirManagement.pdf', 'total_pages': 15, 'page': 1, 'page_label': '2'}, page_content="Diseases Of Field Crops\nAnd Their Management\n7KH\x03ERRN\x03HQWLWOHG\x03³'LVHDVHV\x03RI\x03¿HOG\x03FURSV\x03DQG\x03WKHLU\x03PDQDJHPHQW´\x03SURYLGHV\x03PRVW\x03UHFHQW\x03LQIRUPDWLRQ\x03 \nDERXW\x03PDMRU\x03GLVHDVHV\x03RI\x03FXOWLYDWLRQ\x03¿HOG\x03FURSV\x0f\x03WKHLU\x03V\\PSWRPV\x0f\x03SDWKRJHQ\x03FKDUDFWHUV\x0f\x03HSLGHPLRORJ\\\x0f\x03 \nDQG\x03PDQDJHPHQW\x11\x03,Q\x03RUGHU\x03WR\x03PDNH\x03WKH\x03ERRN\x03DOO\x03LQ\x03RQH\x0f\x03WKH\x03LPSRUWDQFH\x03RI\x03PDMRU\x03GLVHDVHV\x03KDV\x03DOVR\x03 \nEHHQ\x03GHDOW\x03ZLWK\x03LQ\x03EULHI\x11 \n'U\x11\x036\x11\x033DUWKDVDUDWK\\\x03LV\x03$VVLVWDQW\x033URIHVVRU\x03\x0b3ODQW\x033DWKRORJ\\\x0c\x0f\x03&ROOHJH\x03RI\x03$JULFXOWXUDO\x037HFKQRORJ\\\x0f\x03 \n7KHQL\x11\x03+H\x03FRPSOHWHG\x03KLV\x033K\x11'\x11\x03IURP\x037DPLO\x

In [24]:
len(text_chunks)

619

In [25]:
print(text_chunks[102].page_content)

Senegal, Sudan, Ethiopia and Angola as well as some parts of West Africa. It can live at altitudes up to 1000 m above 
sea level. Stem borers are mainly distributed from country to country or region to region by diapausing (dormant) 
larvae in the stems and other crop residues. Millet stems are often used for roofs, fences and other building uses; it 
has been reported that attacks are more severe near villages where the stems are used for this purpose.  
FURTHER READING 
Youm, O., Harris, K.M., and Nwan’ze,K.F . 1996. Coniesta ignefusalis, the Millet stem borer: a handbook of information. 
Information Bulletin no. 46. Patancheru 502 324, Andhra Pradesh, India: International Crops Research Institute for the 
Semi-Arid Tropics. http://pdf.usaid.gov/pdf_docs/pnaby140.pdf 
ICRISAT . Pheromone-based monitoring system to manage the millet stem borer Coniesta ignefusalis (Lepidoptera:


### Load environment Variables

In [26]:
from dotenv import load_dotenv
load_dotenv()
gemini_api_key=os.getenv('GEMINI_API_KEY')
huggingface_api_key=os.getenv('HUGGINGFACE_API_KEY')

In [None]:
huggingface_api_key

In [28]:
hf_embeddings = HuggingFaceInferenceAPIEmbeddings(
    api_key=huggingface_api_key,
    model_name="sentence-transformers/all-MiniLM-l6-v2"
)

## Create Chroma Db

In [29]:
persist_directory='db'

In [31]:
vectordb=Chroma.from_documents(documents=text_chunks,
                               embedding=hf_embeddings,
                               persist_directory=persist_directory,
                               )

In [32]:
vectordb=Chroma(persist_directory=persist_directory,embedding_function=hf_embeddings)

  vectordb=Chroma(persist_directory=persist_directory,embedding_function=hf_embeddings)


In [33]:
vectordb

<langchain_community.vectorstores.chroma.Chroma at 0x21d31558680>

### Add a Retriever

In [34]:
retriever=vectordb.as_retriever()

In [35]:
docs=retriever.invoke("What is bright disease?")

In [36]:
retriever=vectordb.as_retriever(search_kwargs={"k":2})

In [37]:
docs2=retriever.invoke("What is bright disease?")

In [38]:
docs2

[Document(metadata={'creationdate': '2015-09-04T15:53:35+01:00', 'creator': 'Adobe InDesign CC 2014 (Macintosh)', 'moddate': '2015-09-04T15:55:26+01:00', 'page': 110, 'page_label': '111', 'producer': 'Adobe PDF Library 11.0', 'rgid': 'PB:282278884_AS:278936278847499@1443514999965', 'source': 'rag_data\\PestanddiseasemanualallPRAMandASHC.pdf', 'total_pages': 136, 'trapped': '/False'}, page_content='disease similar to leaf petiole and stem blight was mostly associated with another fungus, from the Phoma genus. An \nAlternaria was found occasionally, but it was not typical of A. bataticola. Phoma is a common soil-borne fungus that \ncauses a pink rot of the storage roots, but had not previously been reported on vines in South Africa. \nLeaf petiole and stem blight of sweet potato is also known as Alternaria anthracnose. Anthracnose means ‘coal disease’; \nit is a word used to describe diseases caused by fungi that produce dark spots on leaves, petioles, stems and fruits. The \ndisease on 

## Make a Chain

In [39]:
from langchain.chains import RetrievalQA

In [46]:
llm=ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    api_key=gemini_api_key,
    temperature=0.5,
    max_tokens=100,
    
)

In [47]:
qa_chain=RetrievalQA.from_chain_type(llm=llm,
                                     chain_type='stuff',
                                     retriever=retriever,
                                     return_source_documents=True)

In [48]:
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSource:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [51]:
query='What is blight disease?'
llm_response=qa_chain(query)
process_llm_response(llm_response=llm_response)

Blight is a disease caused by fungi that produce dark spots on leaves, petioles, stems, and fruits.


Source:
rag_data\PestanddiseasemanualallPRAMandASHC.pdf
rag_data\PestanddiseasemanualallPRAMandASHC.pdf
