I am using 2 Data Sources:
Plant Village: Accessed via TensorFlow Datasets (plant_village), with a local fallback in case of TFDS access issues. (https://www.tensorflow.org/datasets/catalog/plant_village)
Crop Recommendation: Sourced from Kaggle and stored locally as ./dataset/Crop_recommendation.csv.
(https://www.kaggle.com/datasets/siddharthss/crop-recommendation-dataset)

This code processes two datasets to provide agricultural recommendations. The Crop Recommendation Dataset contains static soil and environmental data, including:

N: Soil Nitrogen content (kg/ha), vital for plant growth and leaf development.
P: Soil Phosphorus content (kg/ha), crucial for root development and energy transfer.
K: Soil Potassium content (kg/ha), important for water regulation and disease resistance.


The Plant Village Dataset contains images of plant disease states. 
Neither dataset includes timestamp data since they focus on static conditions or single-point-in-time observations.

The primary objective is to preprocess and cluster these datasets. This will help identify common disease patterns from Plant Village and group similar soil conditions from the Crop Recommendation data. The insights gained will then be used by a large language model (LLM) to generate practical advisory messages, such as pest control recommendations for specific crops like apples.

Step 1:
This section imports necessary Python libraries for data processing, machine learning, and visualization. It also configures TensorFlow logging to debug mode to help diagnose potential issues with dataset loading.

In [44]:
import os
import pandas as pd
import numpy as np
import tensorflow as tf
import cv2
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import logging

logging.getLogger('tensorflow').setLevel(logging.DEBUG)

TFDS (TensorFlow Datasets): A library providing easy access to datasets like Plant Village, enabling online streaming without manual downloads, though it may fail due to network issues, requiring a mock data fallback.

Step 2: Define Severity Mapping for Plant Village
This section defines a dictionary (severity_map) that maps Plant Village disease labels to heuristic severity scores (0 to 1), where 0 represents no disease (healthy) and higher values indicate more severe diseases (e.g., scab=0.7, late_blight=0.8). This mapping is used to assign realistic severity scores to images, replacing the previous random approach, to improve clustering accuracy.

In [45]:
severity_map = {
    'healthy': 0.0,
    'early_blight': 0.3,
    'common_rust': 0.3,
    'leaf_spot': 0.4,
    'spider_mites': 0.4,
    'powdery_mildew': 0.4,
    'gray_leaf_spot': 0.4,
    'northern_leaf_blight': 0.4,
    'late_blight': 0.8,
    'scab': 0.7,
    'black_rot': 0.7,
    'bacterial_spot': 0.6,
    'target_spot': 0.6,
    'mosaic_virus': 0.7,
    'yellow_leaf_curl_virus': 0.7,
    'leaf_scorch': 0.6,
    'leaf_mold': 0.5,
    'septoria_leaf_spot': 0.5,
    'esca_(black_measles)': 0.6,
    'isariopsis_leaf_spot': 0.5,
    'cedar_apple_rust': 0.5,
    'apple_rust': 0.5,
    'blight': 0.8,
    'rust': 0.5,
    'anthracnose': 0.6,
    'verticillium_wilt': 0.6,
    'brown_spot': 0.5,
    'downy_mildew': 0.6,
    'phytophthora_infestans': 0.8,
}

Step 3: Configure TFDS Directory for Plant Village
This section sets up a custom directory (F:/Personal/PrecisionAgriculture/tfds_data) to store TensorFlow Datasets (TFDS) data for the Plant Village Dataset, ensuring write permissions on Windows. It also disables Google Cloud Storage to enforce local storage, which helps manage potential network issues.

In [46]:
tfds_dir = 'F:/Personal/PrecisionAgriculture/tfds_data'
os.makedirs(tfds_dir, exist_ok=True)
tfds.core.utils.gcs_utils._is_gcs_disabled = True

Step 4: Load Crop Recommendation Dataset
This section loads the Crop Recommendation Dataset from a CSV file (./dataset/Crop_recommendation.csv) into a pandas DataFrame. It checks if the file exists, raising an error if not, to ensure the dataset is available for processing.

In [47]:
csv_path = './dataset/Crop_recommendation.csv'
if not os.path.exists(csv_path):
    raise FileNotFoundError(f"{csv_path} not found. Download from Kaggle.")
df_crop = pd.read_csv(csv_path)

Step 5: Verify Features for Crop Recommendation Dataset
This section ensures the Crop Recommendation Dataset contains the expected columns (N, P, K, temperature, humidity, ph, rainfall, label). If any are missing, it raises an error. It also adds a plot_id column (1 to 2200) for unique identification of each data entry.

In [48]:
expected_columns = ['N', 'P', 'K', 'temperature', 'humidity', 'ph', 'rainfall', 'label']
if not all(col in df_crop.columns for col in expected_columns):
    raise ValueError("CSV missing expected columns.")
df_crop = df_crop[expected_columns]
df_crop['plot_id'] = range(1, len(df_crop) + 1)

Step 6: Handle Missing Values for Crop Recommendation Dataset
This section handles missing values in the numerical columns (N, P, K, ph, temperature, humidity, rainfall) of the Crop Recommendation Dataset by imputing them with the median value of each column, ensuring data integrity for clustering.

In [49]:
numerical_cols = ['N', 'P', 'K', 'ph', 'temperature', 'humidity', 'rainfall']
df_crop[numerical_cols] = df_crop[numerical_cols].fillna(df_crop[numerical_cols].median())

Step 7: Encode and Normalize Crop Recommendation Dataset
This section preprocesses the Crop Recommendation Dataset by:


One-hot encoding the label column (crop types, e.g., apple, maize) into binary columns (e.g., crop_type_apple, crop_type_maize) to enable numerical analysis.
Concatenating the encoded columns with the original DataFrame and dropping the label column.
Normalizing the numerical columns (N, P, K, ph, temperature, humidity, rainfall) to a [0,1] range using MinMaxScaler to prepare for DBSCAN clustering.

In [50]:
encoder = OneHotEncoder(sparse_output=False)
crop_encoded = encoder.fit_transform(df_crop[['label']])
df_crop_encoded = pd.DataFrame(crop_encoded, columns=encoder.get_feature_names_out())
df_crop = pd.concat([df_crop.drop('label', axis=1), df_crop_encoded], axis=1)

scaler = MinMaxScaler()
df_crop[numerical_cols] = scaler.fit_transform(df_crop[numerical_cols])

Step 8: Save Crop Recommendation Processed Data
This section saves the preprocessed Crop Recommendation DataFrame to a CSV file (crop_recommendation_processed.csv) and prints a sample of the processed data (first 5 rows) to verify the preprocessing steps, ensuring the data is ready for clustering.

In [51]:
df_crop.to_csv('crop_recommendation_processed.csv', index=False)
print("Crop Recommendation Processed Sample:")
print(df_crop.head())

Crop Recommendation Processed Sample:
          N         P      K  temperature  humidity        ph  rainfall  \
0  0.642857  0.264286  0.190     0.345886  0.790267  0.466264  0.656458   
1  0.607143  0.378571  0.180     0.371445  0.770633  0.549480  0.741675   
2  0.428571  0.357143  0.195     0.406854  0.793977  0.674219  0.875710   
3  0.528571  0.214286  0.175     0.506901  0.768751  0.540508  0.799905   
4  0.557143  0.264286  0.185     0.324378  0.785626  0.641291  0.871231   

   plot_id  label_apple  label_banana  ...  label_mango  label_mothbeans  \
0        1          0.0           0.0  ...          0.0              0.0   
1        2          0.0           0.0  ...          0.0              0.0   
2        3          0.0           0.0  ...          0.0              0.0   
3        4          0.0           0.0  ...          0.0              0.0   
4        5          0.0           0.0  ...          0.0              0.0   

   label_mungbean  label_muskmelon  label_orange  labe

Plant Village Dataset Preprocessing

The Plant Village Dataset is streamed online via TensorFlow Datasets (tfds.load), so images are not stored locally but accessed during runtime. If TFDS fails, mock data is used instead, simulating image features (e.g., HSV values, severity) without actual images.

Step 9: Define Function to Load Plant Village Dataset
This section defines a function (load_plant_village) to load the Plant Village Dataset via TensorFlow Datasets (TFDS). It attempts to stream the dataset online with 3 retries, using a timeout of 120 seconds. If TFDS fails, it generates mock data (1000 entries) with random crop types, disease labels, heuristic severity scores (via severity_map), and mock HSV values, ensuring data availability for preprocessing.

In [52]:
def load_plant_village():
    import cv2
    data_dir = 'F:/Personal/PrecisionAgriculture/tfds_data/plant_village'
    if not os.path.exists(data_dir):
        raise FileNotFoundError(f"{data_dir} not found. Ensure Plant Village images are downloaded and stored in this directory.")
    
    plant_data = []
    image_count = 0
    for folder in os.listdir(data_dir):
        if image_count >= 1000:
            break
        folder_path = os.path.join(data_dir, folder)
        if not os.path.isdir(folder_path):
            continue
        crop_type = folder.split('___')[0]
        disease_label = folder.split('___')[1]
        for img_file in os.listdir(folder_path):
            if image_count >= 1000:
                break
            img_path = os.path.join(folder_path, img_file)
            img = cv2.imread(img_path)  # Loads image as uint8 (values in [0, 255])
            if img is None:
                continue
            img = cv2.resize(img, (224, 224))  # Still uint8
            img = img / 255.0  # Converts to float64 (values in [0, 1])
            img = img.astype(np.float32)
            hsv_img = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)  # Fails because img is float64
            hsv_mean = np.mean(hsv_img, axis=(0, 1))
            severity = severity_map.get(disease_label, 0.5)
            plant_data.append({
                'image_id': f'img_{image_count+1}',
                'crop_type': crop_type,
                'disease_label': disease_label,
                'severity': severity,
                'hsv_h': hsv_mean[0],
                'hsv_s': hsv_mean[1],
                'hsv_v': hsv_mean[2]
            })
            image_count += 1
    df_plant = pd.DataFrame(plant_data)
    return df_plant, None

Step 10: Load and Process Plant Village Dataset
This section loads the Plant Village Dataset using the load_plant_village function. If TFDS data is available, it processes 1000 images by:

Resizing images to 224x224 pixels and normalizing pixel values to [0,1].
Computing heuristic severity using severity_map based on the disease label.
Calculating mean HSV color metrics (hsv_h, hsv_s, hsv_v).
Extracting crop type and disease label from metadata (e.g., Apple___Apple_scab).
Creating a DataFrame with the extracted features.
If mock data is used, the DataFrame is directly provided by the load_plant_village function.

In [53]:
ds, info = load_plant_village()

if isinstance(ds, pd.DataFrame):
    df_plant = ds
    label_map = None
else:
    label_map = info.features['label'].names
    def process_image(image, label):
        image = tf.image.resize(image, [224, 224])
        image = image / 255.0
        
        crop_type = label_map[label].split('___')[0]
        disease_label = label_map[label].split('___')[1]

        severity = severity_map.get(disease_label, 0.5)
        hsv_image = tf.image.rgb_to_hsv(image)
        hsv_mean = tf.reduce_mean(hsv_image, axis=[0, 1]).numpy()
        return {
            'severity': severity,
            'hsv_h': hsv_mean[0],
            'hsv_s': hsv_mean[1],
            'hsv_v': hsv_mean[2],
            'crop_type': crop_type,
            'disease_label': disease_label
        }

    plant_data = []
    for i, (image, label) in enumerate(ds.take(1000)):
        features = process_image(image, label)
        features['image_id'] = f'img_{i+1}'
        plant_data.append(features)
    df_plant = pd.DataFrame(plant_data)

Step 11: Preprocess Plant Village Dataset
This section completes the preprocessing of the Plant Village Dataset by:

Removing rows with missing values to ensure data quality.
Normalizing the numerical features (severity, hsv_h, hsv_s, hsv_v) to a [0,1] range using MinMaxScaler, preparing the data for DBSCAN clustering.

In [54]:
df_plant.dropna(inplace=True)
df_plant[['severity', 'hsv_h', 'hsv_s', 'hsv_v']] = scaler.fit_transform(df_plant[['severity', 'hsv_h', 'hsv_s', 'hsv_v']])

Step 12: Save Plant Village Processed Data
This section saves the preprocessed Plant Village DataFrame to a CSV file (plant_village_processed.csv) and prints a sample of the processed data (first 5 rows) to verify the preprocessing steps, ensuring the data is ready for clustering.

In [55]:
df_plant.to_csv('plant_village_processed.csv', index=False)
print("Plant Village Processed Sample:")
print(df_plant.head())

Plant Village Processed Sample:
  image_id     crop_type   disease_label  severity     hsv_h     hsv_s  \
0    img_1  Pepper__bell  Bacterial_spot       1.0  0.698441  0.160373   
1    img_2  Pepper__bell  Bacterial_spot       1.0  0.679488  0.215264   
2    img_3  Pepper__bell  Bacterial_spot       1.0  0.719451  0.054814   
3    img_4  Pepper__bell  Bacterial_spot       1.0  0.661150  0.447317   
4    img_5  Pepper__bell  Bacterial_spot       1.0  0.564769  0.444664   

      hsv_v  
0  0.340281  
1  0.166608  
2  0.412886  
3  0.379964  
4  0.115355  


Clustering identifies patterns in the data (e.g., high-severity disease clusters in Plant Village, nutrient-deficient plots in Crop Recommendation), enabling targeted LLM advisory messages (e.g., pest control for specific crops).

Step 13: Cluster Plant Village Dataset
This section applies DBSCAN clustering to the Plant Village Dataset:

Generate K-Distance Plot: Uses NearestNeighbors (k=5) to compute the distance to the 5th nearest neighbor for each point in the feature set (severity, hsv_h, hsv_s, hsv_v). The distances are sorted and plotted to identify the elbow point, which helps tune the eps parameter (set to 0.15). The plot is saved as plant_village_k_distance.png.
Apply DBSCAN Clustering: Clusters the data using DBSCAN with eps=0.15 and min_samples=5, assigning cluster labels (e.g., 0, 1, -1 for noise) to each point.
Evaluate Clusters: Prints the count of points in each cluster to assess the clustering outcome.

In [56]:
X_plant = df_plant[['severity', 'hsv_h', 'hsv_s', 'hsv_v']].values
neigh = NearestNeighbors(n_neighbors=5)
neigh.fit(X_plant)
distances, _ = neigh.kneighbors(X_plant)
distances = np.sort(distances[:, 4])
plt.plot(distances)
plt.title('K-Distance Plot for Plant Village')
plt.savefig('plant_village_k_distance.png')
plt.close()

dbscan_plant = DBSCAN(eps=0.15, min_samples=5)
df_plant['cluster'] = dbscan_plant.fit_predict(X_plant)
print("Plant Village Cluster Counts:")
print(df_plant['cluster'].value_counts())

Plant Village Cluster Counts:
cluster
 0    988
-1     12
Name: count, dtype: int64


Step 14: Cluster Crop Recommendation Dataset
This section applies DBSCAN clustering to the Crop Recommendation Dataset:

Generate K-Distance Plot: Uses NearestNeighbors (k=5) to compute the distance to the 5th nearest neighbor for each point in the feature set (N, P, K, ph, temperature, humidity, rainfall). The distances are sorted and plotted to identify the elbow point, tuning the eps parameter (set to 0.3). The plot is saved as crop_recommendation_k_distance.png.
Apply DBSCAN Clustering: Clusters the data using DBSCAN with eps=0.3 and min_samples=5, assigning cluster labels to each point.
Evaluate Clusters: Prints the count of points in each cluster to assess the clustering outcome.

In [57]:
X_crop = df_crop[numerical_cols].values
neigh.fit(X_crop)
distances, _ = neigh.kneighbors(X_crop)
distances = np.sort(distances[:, 4])
plt.plot(distances)
plt.title('K-Distance Plot for Crop Recommendation')
plt.savefig('crop_recommendation_k_distance.png')
plt.close()

dbscan_crop = DBSCAN(eps=0.3, min_samples=5)
df_crop['cluster'] = dbscan_crop.fit_predict(X_crop)
print("Crop Recommendation Cluster Counts:")
print(df_crop['cluster'].value_counts())

Crop Recommendation Cluster Counts:
cluster
0    1900
2     200
1     100
Name: count, dtype: int64


Summarize clusters by computing the mean of key features (e.g., severity for Plant Village, N, ph, humidity for Crop Recommendation) per cluster, formatting as text for LLM prompts (e.g., "Apple cluster 0: severity 0.7").

Step 15: Prepare Summaries for LLM Prompt
This section prepares summaries of the clusters for LLM input:

Plant Village: Groups the data by crop_type and cluster, computes the mean severity for each group, and formats a summary string (e.g., "Apple cluster 0: severity 0.7").
Crop Recommendation: Derives the crop_type column from one-hot encoded columns, groups by crop_type and cluster, computes the mean of N, ph, and humidity, and formats a summary string (e.g., "apple cluster 1: N 0.42, pH 0.45, humidity 0.71").

In [58]:
plant_summary = df_plant.groupby(['crop_type', 'cluster']).agg({'severity': 'mean'}).reset_index()
plant_summary['summary'] = plant_summary.apply(
    lambda x: f"{x['crop_type']} cluster {x['cluster']}: severity {x['severity']:.2f}", axis=1
)

df_crop['crop_type'] = df_crop[encoder.get_feature_names_out()].idxmax(axis=1).str.replace('crop_type_', '')
crop_summary = df_crop.groupby(['crop_type', 'cluster']).agg({
    'N': 'mean', 'ph': 'mean', 'humidity': 'mean'
}).reset_index()
crop_summary['summary'] = crop_summary.apply(
    lambda x: f"{x['crop_type']} cluster {x['cluster']}: N {x['N']:.2f}, ph {x['ph']:.2f}, humidity {x['humidity']:.2f}", axis=1
)

Step 16: Generate LLM Prompt
This section generates a sample LLM prompt by combining the cluster summaries from both datasets, requesting a pest control recommendation for apple crops. The prompt is printed to verify the formatted output, which can be used with an actual LLM (e.g., OpenAI, xAI) for generating recommendations.

In [59]:
llm_prompt = f"""
Analyze plant disease and soil data for advisory messages:
- Plant Disease: {plant_summary['summary'].tolist()}
- Soil Data: {crop_summary['summary'].tolist()}
Generate a pest control recommendation for apple crops.
"""
print("Sample LLM Prompt:")
print(llm_prompt)

Sample LLM Prompt:

Analyze plant disease and soil data for advisory messages:
- Plant Disease: ['Pepper__bell cluster -1: severity 0.75', 'Pepper__bell cluster 0: severity 1.00']
- Soil Data: ['label_apple cluster 2: N 0.15, ph 0.38, humidity 0.91', 'label_banana cluster 0: N 0.72, ph 0.39, humidity 0.77', 'label_blackgram cluster 0: N 0.29, ph 0.56, humidity 0.59', 'label_chickpea cluster 1: N 0.29, ph 0.60, humidity 0.03', 'label_coconut cluster 0: N 0.16, ph 0.38, humidity 0.94', 'label_coffee cluster 0: N 0.72, ph 0.51, humidity 0.52', 'label_cotton cluster 0: N 0.84, ph 0.53, humidity 0.77', 'label_grapes cluster 2: N 0.17, ph 0.39, humidity 0.79', 'label_jute cluster 0: N 0.56, ph 0.50, humidity 0.76', 'label_kidneybeans cluster 0: N 0.15, ph 0.35, humidity 0.09', 'label_lentil cluster 0: N 0.13, ph 0.53, humidity 0.59', 'label_maize cluster 0: N 0.56, ph 0.43, humidity 0.59', 'label_mango cluster 0: N 0.14, ph 0.35, humidity 0.42', 'label_mothbeans cluster 0: N 0.15, ph 0.52, h