<a href="https://colab.research.google.com/github/langat-che/Crop_disease_prediction_and_recommendation_system/blob/bev%2Fupdates/CDRR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Business Understanding

## 1.1 Problem Statement
Agriculture is the backbone of many African economies, including Kenya. However, crop losses due to pests and diseases pose a major threat to food security and farmers' livelihoods. These losses often go undetected until it's too late, especially among smallholder farmers who may lack access to timely agronomic advice or diagnostics.

**Problem:**
Small holder farmers often lack:
- Timely diagnosis of plant diseases
- Access to expert advice or affordable treatment
- Scalable, easy-to-use tools

---

## 1.2 Objectives
**Main Goal**
Build an AI-powered system that detects crop diseases and provides relevant treatment recommendations.

**Specific Objectives:**
1. Collect and clean image + metadata of crops (healthy and diseased)
2. Train an ML/DL model to identify crop diseases
3. Develop a recommendation engine for treatment and prevention
4. Build a user-friendly platform for farmers to interact with the system.
5. Demonstrate potential impact on yield, cost saving and power empowerment.

---

## 1.3 Metrics of Success
To evaluate the effectiveness and impact of the system:

### Model Performance
- **Accuracy, Precision, Recall, and F1-score** of the disease prediction model
- **Confusion matrix** to assess misclassification of disease types
- **Model latency** (how fast predictions are returned)

### Usability & Adoption
- **User satisfaction scores** (via surveys)
- **Adoption rate** among smallholder farmers
- **Task success rate** (successful image uploads and predictions)

---

## 1.4 External Relevance & Risk Factors
Key external elements that could affect success:

### Data Challenges
- Limited or low-quality labeled datasets
- Variations in disease appearance across regions

### Infrastructure & Access
- Poor rural internet or electricity
- Limited access to recommended agrovet products

### Policy & Ethics
- Data privacy concerns (image uploads)
- Government or regulatory compliance

### Environmental Factors
- Seasonal disease variations
- Climate change influencing disease patterns

# 2. Data Understanding

The success of the crop disease prediction system relies heavily on the quality and diversity of the data used to train and evaluate the models. This section outlines the dataset being used—**TOM2024**—and its relevance, structure, and characteristics.

## 2.1 Dataset Overview

The **TOM2024 dataset** is a comprehensive agricultural dataset that contains:

- **Total images**: 25,844 raw images  
- **Labeled images**: 12,227 expert-labeled images  
- **Crop types**: Tomato, Maize, Onion  
- **Classes**: 30 classes covering pests and diseases  
- **Image resolution**: High-quality images captured under varied environmental conditions

This dataset is especially valuable for building machine learning and deep learning models for plant health diagnostics, enabling early detection and informed responses to crop threats.

## 2.2 Purpose and Value

- Facilitates **early and accurate pest/disease identification**
- Supports **precision agriculture** and **reduced pesticide use**
- Enables **regionally adaptable AI models**
- Usable for **education**, **extension services**, and **digital agriculture tools**

## 2.3 Data Structure

| Field/Aspect         | Description |
|----------------------|-------------|
| `image`              | High-resolution image of plant leaf or crop symptom |
| `crop_type`          | Categorical: tomato, maize, onion |
| `issue_type`         | Categorical: pest or disease |
| `class_label`        | Specific pest/disease (30 classes) |
| `environment_context`| Field conditions, lighting, background elements |
| `expert_verified`    | Boolean: whether image has been validated by an expert |

## 2.4 Data Collection Methodology

The dataset was curated using a systematic approach:

1. **Site Selection** – Representative agricultural areas chosen
2. **Image Acquisition** – Raw images collected in-field using mobile devices
3. **Image Grouping** – By crop type and visible issue (pest/disease)
4. **Issue Identification** – Initial tagging by field experts
5. **Image Cropping/Resizing** – For consistency and model-readiness
6. **Expert Validation** – Cropped images verified by trained agronomists
7. **Storage and Structuring** – Organized for easy access by researchers

## 2.5 Data Quality & Variability

- Images captured in **diverse lighting, backgrounds, and leaf stages**
- **Label imbalance** may exist across certain pests/diseases
- High **label accuracy** due to expert validation
- **Realistic variation** in field conditions enhances model robustness

## 2.6 Dataset Directory Structure

The TOM2024 dataset is divided into two main categories, each with a unique structure that supports different stages of model development.

---

### 📁 Category A – Raw Labeled Collections

This category consists of image folders grouped by **crop type** and **issue type** (pests, diseases, or pest activities). It serves as the initial labeled dataset for exploration, augmentation, or additional curation.

---

Each subfolder contains images representing a specific pest, disease, or activity affecting the respective crop. These images are labeled and may vary in lighting, background, and clarity to simulate real-world field conditions.

---

### 📁 Category B – Train/Test Data for Model Training

Category B is structured to support supervised machine learning and deep learning model training. Each crop folder contains `train` and `test` subfolders, and within those are class folders labeled according to the condition:

- **`_d`** → Disease  
- **`_p`** → Pest  
- **`_a`** → Pest Activity (maize only)  
- A separate folder for **healthy** plants is included in each case.



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
zip_path = '/content/drive/My Drive/Data.zip'

In [3]:
import zipfile
import os

extract_path = '/content/dataset'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

### Loading and exploring the Dataset

In [4]:
dataset_path = '/content/dataset'

for root, dirs, files in os.walk(dataset_path):
    print("Current folder:", root)
    for f in files:
        print("  File:", f)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  File: 1568862208_1659780866614.jpg
  File: 11651630_1661181134312.jpg
  File: 66348894_1660040446021.jpg
  File: 297712499_1659169683728.jpg
  File: 1471622473_1660205466777.jpg
  File: 1981140678_1659339472449.jpg
  File: 20467544_1659781014983.jpg
  File: 734306931_1661085293836.jpg
  File: 1766497932_1660898149324.jpg
  File: 842278875_1659169893698.jpg
  File: 478178275_1661871915093.jpg
  File: 601350043_1660202771062.jpg
  File: 497018844_1662116999422.jpg
  File: 1288584240_1661355799139.jpg
  File: 923814837_1659433506335.jpg
  File: 686918655_1659165335570.jpg
  File: 298335239_1660381170126.jpg
  File: 170540221_1661356417707.jpg
  File: 1115343460_1661182428006.jpg
  File: 2015650838_1660556833702.jpg
  File: 1259307558_1661272741594.jpg
  File: 1496740273_1660205712139.jpg
  File: 1218755420_1660383751724.jpg
  File: 922358325_1661167669157.jpg
  File: 275761169_1660911008812.jpg
  File: 1180541318_166020313

# 3. Data Preparation
Before training the model, the dataset underwent several steps of cleaning, labeling, augmentation, and formatting to ensure it is suitable for machine learning tasks.

---

## 3.1 Data Cleaning

- Removing white spaces from folder names
- Channging all D and Ps to lower caps
- Removed duplicate or corrupted image files.
- Ensured consistent image formats (e.g., `.jpg`, `.png`) and resized images to a standard size (e.g., 224x224 pixels).
- Verified class labels for each folder based on metadata and visual inspection.
- Excluded low-quality or highly blurred images that could introduce noise into the model.


In [5]:
# importing necessary libraries
import re
import hashlib
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import random
import cv2
from PIL import Image, UnidentifiedImageError

### 3.1.1 Handling the folder names

In [6]:
# Step 1: Replace hyphens with spaces
def replace_hyphens_with_spaces(root_dir):
    for root, dirs, files in os.walk(root_dir, topdown=False):
        for name in files:
            if '-' in name:
                old_path = os.path.join(root, name)
                new_name = name.replace('-', ' ')
                new_path = os.path.join(root, new_name)
                os.rename(old_path, new_path)
        for name in dirs:
            if '-' in name:
                old_path = os.path.join(root, name)
                new_name = name.replace('-', ' ')
                new_path = os.path.join(root, new_name)
                os.rename(old_path, new_path)

# Step 2: Replace spaces with underscores
def rename_all(root_dir):
    for root, dirs, files in os.walk(root_dir, topdown=False):
        for name in files:
            if ' ' in name:
                old_path = os.path.join(root, name)
                new_name = name.replace(' ', '_')
                new_path = os.path.join(root, new_name)
                os.rename(old_path, new_path)
        for name in dirs:
            if ' ' in name:
                old_path = os.path.join(root, name)
                new_name = name.replace(' ', '_')
                new_path = os.path.join(root, new_name)
                os.rename(old_path, new_path)

# Step 3: Collapse multiple underscores
def clean_names(root_dir):
    for root, dirs, files in os.walk(root_dir, topdown=False):
        for name in files:
            new_name = re.sub(r'_+', '_', name)
            if new_name != name:
                old_path = os.path.join(root, name)
                new_path = os.path.join(root, new_name)
                os.rename(old_path, new_path)
        for name in dirs:
            new_name = re.sub(r'_+', '_', name)
            if new_name != name:
                old_path = os.path.join(root, name)
                new_path = os.path.join(root, new_name)
                os.rename(old_path, new_path)

# Step 4: Fix diseases ending with D and pests ending with P (folder & file names)
def fix_disease_pest_labels(root_dir):
    for root, dirs, files in os.walk(root_dir, topdown=False):
        for name in files:
            name_wo_ext, ext = os.path.splitext(name)
            new_name_wo_ext = re.sub(r'(?<=\w)D$', 'd', name_wo_ext)
            new_name_wo_ext = re.sub(r'(?<=\w)P$', 'p', new_name_wo_ext)
            if new_name_wo_ext != name_wo_ext:
                old_path = os.path.join(root, name)
                new_path = os.path.join(root, new_name_wo_ext + ext)
                os.rename(old_path, new_path)
        for name in dirs:
            new_name = re.sub(r'(?<=\w)D$', 'd', name)
            new_name = re.sub(r'(?<=\w)P$', 'p', new_name)
            if new_name != name:
                old_path = os.path.join(root, name)
                new_path = os.path.join(root, new_name)
                os.rename(old_path, new_path)

# Step 5: Lowercase all folders and subfolders under CATEGORY_A and CATEGORY_B
def lowercase_category_folders(base_path, category_folders=['CATEGORY_A', 'CATEGORY_B']):
    for category in category_folders:
        category_path = os.path.join(base_path, category)
        if not os.path.exists(category_path):
            print(f"⚠️ Category folder not found: {category_path}")
            continue
        for root, dirs, _ in os.walk(category_path, topdown=False):
            for name in dirs:
                old_path = os.path.join(root, name)
                new_name = name.lower()
                new_path = os.path.join(root, new_name)
                if name != new_name and not os.path.exists(new_path):
                    os.rename(old_path, new_path)

# Run all steps
replace_hyphens_with_spaces(dataset_path)
rename_all(dataset_path)
clean_names(dataset_path)
fix_disease_pest_labels(dataset_path)
lowercase_category_folders(os.path.join(dataset_path, 'Data'))

print("✅ All folders names are handled well.")

✅ All folders names are handled well.


### 3.1.2 Making sure the images are not corrupted

In [7]:
# def remove_duplicates_and_corrupt_images(root_dir):
#   hashes = set()
#   removed_duplicates = 0
#   removed_corrupt = 0

#   for folder, _, files in os.walk(root_dir):
#       for file in files:
#           file_path = os.path.join(folder, file)

#           try:
#               with Image.open(file_path) as img:
#                   img.verify()  # Check for corruption
#                   img_hash = hashlib.md5(open(file_path, 'rb').read()).hexdigest()

#                   if img_hash in hashes:
#                       os.remove(file_path)
#                       removed_duplicates += 1
#                       print(f"🗑️ Duplicate removed: {file_path}")
#                   else:
#                       hashes.add(img_hash)

#           except (UnidentifiedImageError, IOError, OSError) as e:
#               os.remove(file_path)
#               removed_corrupt += 1
#               print(f"🚫 Corrupt image removed: {file_path}")

#   print(f"\n✅ Cleanup complete.")
#   print(f"➕ Total duplicates removed: {removed_duplicates}")
#   print(f"❌ Total corrupt images removed: {removed_corrupt}")

# # Run it on your dataset
# remove_duplicates_and_corrupt_images(dataset_path)

### 3.1.3 Ensuring consistent image formats (e.g., `.jpg`, `.png`) and resized images to a standard size (e.g., 224x224 pixels).

In [8]:
def convert_and_resize_images(root_dir, size=(224, 224), format='JPEG'):
    converted = 0
    resized = 0
    skipped = 0

    for folder, _, files in os.walk(root_dir):
        for file in files:
            file_path = os.path.join(folder, file)

            try:
                with Image.open(file_path) as img:
                    img = img.convert('RGB')          # Ensure RGB format
                    img = img.resize(size)            # Resize to standard size

                    new_file_name = os.path.splitext(file)[0] + '.jpg'
                    new_path = os.path.join(folder, new_file_name)

                    if file_path != new_path:
                        os.remove(file_path)          # Delete original if name changes
                        converted += 1

                    img.save(new_path, format)
                    resized += 1
                    print(f"🖼️ Processed: {file_path} → {new_path}")

            except Exception as e:
                skipped += 1
                print(f"⚠️ Skipped (not an image or failed to process): {file_path}")

    print(f"\n✅ Completed image conversion and resizing:")
    print(f"🔁 Converted to .jpg: {converted}")
    print(f"📏 Resized: {resized}")
    print(f"⛔ Skipped files: {skipped}")

# Run on your dataset path
convert_and_resize_images(dataset_path)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
🖼️ Processed: /content/dataset/Data/CATEGORY_A/maize_pests/Spodoptera_frugiperda_p/1419430073_1660630959124.jpg → /content/dataset/Data/CATEGORY_A/maize_pests/Spodoptera_frugiperda_p/1419430073_1660630959124.jpg
🖼️ Processed: /content/dataset/Data/CATEGORY_A/maize_pests/Spodoptera_frugiperda_p/1306162624_1661441580759.jpg → /content/dataset/Data/CATEGORY_A/maize_pests/Spodoptera_frugiperda_p/1306162624_1661441580759.jpg
🖼️ Processed: /content/dataset/Data/CATEGORY_A/maize_pests/Spodoptera_frugiperda_p/248694976_1660909174570.jpg → /content/dataset/Data/CATEGORY_A/maize_pests/Spodoptera_frugiperda_p/248694976_1660909174570.jpg
🖼️ Processed: /content/dataset/Data/CATEGORY_A/maize_pests/Spodoptera_frugiperda_p/1414676279_1660458031924.jpg → /content/dataset/Data/CATEGORY_A/maize_pests/Spodoptera_frugiperda_p/1414676279_1660458031924.jpg
🖼️ Processed: /content/dataset/Data/CATEGORY_A/maize_pests/Spodoptera_frugiperda_p/679520

### 3.1.4 Correcting class labels

In [38]:
def list_class_folders(dataset_path):
    class_folders = []
    for folder in os.listdir(dataset_path):
        folder_path = os.path.join(dataset_path, folder)
        if os.path.isdir(folder_path):
            class_folders.append(folder)
    print("📂 Found class folders:")
    for folder in sorted(class_folders):
        print(f" - {folder}")
    return class_folders

# Run this on the cleaned dataset
class_folders = list_class_folders('/content/dataset/Data')

📂 Found class folders:
 - CATEGORY_A
 - CATEGORY_B


In [42]:
def preview_images_per_class(base_path, num_classes=3, num_images=3):
    class_folders = [f for f in os.listdir(base_path) if os.path.isdir(os.path.join(base_path, f))]
    sampled_classes = random.sample(class_folders, min(num_classes, len(class_folders)))

    for class_folder in sampled_classes:
        class_path = os.path.join(base_path, class_folder)
        image_files = [f for f in os.listdir(class_path) if f.lower().endswith('.jpg')]
        sampled_images = random.sample(image_files, min(num_images, len(image_files)))

        print(f"\n📸 Class: {class_folder} — {len(image_files)} image(s) total")
        plt.figure(figsize=(12, 3))
        for idx, image_name in enumerate(sampled_images):
            image_path = os.path.join(class_path, image_name)
            img = mpimg.imread(image_path)
            plt.subplot(1, num_images, idx + 1)
            plt.imshow(img)
            plt.title(image_name)
            plt.axis('off')
        plt.show()

# Run this
preview_images_per_class('/content/dataset/Data/CATEGORY_A/maize_diseases/virosis_d', num_classes=5, num_images=3)

### 3.1.5 Removing blurry images

In [45]:
def is_blurry(image_path, threshold=0.5):
    try:
        image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
        if image is None:
            return True  # Consider unreadable images as blurry or corrupted
        laplacian_var = cv2.Laplacian(image, cv2.CV_64F).var()
        return laplacian_var < threshold
    except:
        return True

def remove_blurry_images(dataset_path, threshold=100.0):
    removed_count = 0
    for root, _, files in os.walk(dataset_path):
        for file in files:
            if file.lower().endswith('.jpg'):
                image_path = os.path.join(root, file)
                if is_blurry(image_path, threshold):
                    os.remove(image_path)
                    print(f"🗑️ Removed blurry image: {image_path}")
                    removed_count += 1
    print(f"\n✅ Done. Removed {removed_count} blurry or unreadable images.")

# Run this on your dataset
remove_blurry_images(dataset_path, threshold=0.5)


✅ Done. Removed 0 blurry or unreadable images.
