<a href="https://colab.research.google.com/github/EmmanuelKnows/ML-AppleDiseaseDetector/blob/main/ML_Apple_Detect.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning - Apple Disease Detection
### Model development

This project implements a deep learning model using transfer learning with MobileNetV2 to detect diseases in apple fruits. The model is trained on a dataset sourced from Kaggle and utilizes MLflow for experiment tracking and model management.

### Project Overview

The goal of this project is to build a robust image classification model that can accurately identify different diseases affecting apples based on images. This can be valuable for agricultural applications, enabling early detection and management of diseases.


### Features

- **Transfer Learning:** Leverages the power of a pre-trained convolutional neural network (MobileNetV2) to benefit from learned features on a large dataset (ImageNet).
- **Data Augmentation:** Techniques like rotation, zooming, and flipping are applied to the training data to increase the size and diversity of the dataset, improving the model's ability to generalize.
- **MLflow Integration:** Tracks various aspects of the machine learning lifecycle, including parameters, metrics, and artifacts (model weights, plots, reports). This facilitates experiment comparison and reproducibility.
- **Callbacks:** Utilizes Keras callbacks such as ModelCheckpoint (to save the best model), EarlyStopping (to prevent overfitting), and ReduceLROnPlateau (to adjust the learning rate during training).

### Dataset from Kaggle
#### Code cell explaining how to download the dataset using `kagglehub`.

In [None]:
#import kagglehub
# link: https://www.kaggle.com/datasets/ateebnoone/fruits-dataset-for-fruit-disease-classification

# Download latest version
#path = kagglehub.dataset_download("ateebnoone/fruits-dataset-for-fruit-disease-classification")

#print("Path to dataset files:", path)

### Install the `mlflow` library.

In [5]:
pip install mlflow

Collecting mlflow
  Downloading mlflow-3.4.0-py3-none-any.whl.metadata (30 kB)
Collecting mlflow-skinny==3.4.0 (from mlflow)
  Downloading mlflow_skinny-3.4.0-py3-none-any.whl.metadata (31 kB)
Collecting mlflow-tracing==3.4.0 (from mlflow)
  Downloading mlflow_tracing-3.4.0-py3-none-any.whl.metadata (19 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting fastmcp<3,>=2.0.0 (from mlflow)
  Downloading fastmcp-2.12.3-py3-none-any.whl.metadata (17 kB)
Collecting graphene<4 (from mlflow)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting gunicorn<24 (from mlflow)
  Downloading gunicorn-23.0.0-py3-none-any.whl.metadata (4.4 kB)
Collecting databricks-sdk<1,>=0.20.0 (from mlflow-skinny==3.4.0->mlflow)
  Downloading databricks_sdk-0.67.0-py3-none-any.whl.metadata (39 kB)
Collecting opentelemetry-proto<3,>=1.9.0 (from mlflow-skinny==3.4.0->mlflow)
  Downloading opentelemetry_proto-1.37.0-py3-none-any.w

### Import necessary libraries and dependencies

libraries and dependencies for building and training the model, including TensorFlow, Keras, Matplotlib, NumPy, scikit-learn, seaborn, and MLflow.

In [6]:
# Import Libraries and Dependencies
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D, Dropout
from tensorflow.keras.applications import MobileNetV2 # Or VGG16, ResNet50, etc.
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import mlflow
import mlflow.keras # Import mlflow.keras for autologging

### Mount Google Drive to access the dataset stored there.

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Image preprocessing and data augmentation

using `ImageDataGenerator`, setting up training, validation, and test data generators from the dataset directories. It also prints the class names and number of classes detected.

In [None]:
# 1. Data Acquisition (Getting my data images in folders: 'train/healthy', 'train/scab', etc.)


# 2. Image Preprocessing and Augmentation (using Keras ImageDataGenerator)
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Define image parameters
IMG_HEIGHT = 224
IMG_WIDTH = 224
BATCH_SIZE = 32

# Create ImageDataGenerators
# For training, include data augmentation and rescale pixel values
train_datagen = ImageDataGenerator(
    rescale=1./255,             # Normalize pixel values to [0, 1]
    shear_range=0.2,            # Shear transformations
    zoom_range=0.2,             # Random zoom
    horizontal_flip=True,       # Random horizontal flips
    validation_split=0.2        # Split a portion for validation
)

# For testing, only rescale pixel values (no augmentation)
test_datagen = ImageDataGenerator(rescale=1./255)

# Load data from directories using flow_from_directory
train_generator = train_datagen.flow_from_directory(
    '/content/drive/MyDrive/Colab Notebooks/Datasets/apple-fruit-ds/Train',        # Path to the training directory
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    class_mode='categorical',   # For multi-class classification (one-hot encoding)
    subset='training',          # Specify this is the training subset
    seed=42
)

validation_generator = train_datagen.flow_from_directory(
    '/content/drive/MyDrive/Colab Notebooks/Datasets/apple-fruit-ds/Train',        # Same path as training, but for validation subset
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    class_mode='categorical',
    subset='validation',        # Specify this is the validation subset
    seed=42
)

test_generator = test_datagen.flow_from_directory(
    '/content/drive/MyDrive/Colab Notebooks/Datasets/apple-fruit-ds/Test',         # Path to the test directory
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    class_mode='categorical',
    shuffle=False               # Important: Do not shuffle test data for consistent evaluation
)


# Get class names and number of classes
class_names = list(train_generator.class_indices.keys())
num_classes = len(class_names)
print(f"Class names: {class_names}")
print(f"Number of classes: {num_classes}")

# You can get the first batch and inspect shapes
# images, labels = next(train_generator)
# print(f"Image batch shape: {images.shape}")
# print(f"Labels batch shape: {labels.shape}")

# Now `train_generator`, `validation_generator`, `test_generator` are ready for model.fit()
# e.g., model.fit(train_generator, validation_data=validation_generator, epochs=..., steps_per_epoch=train_generator.samples // BATCH_SIZE)

### Setting up MLflow Tracking Server in Colab

**Start the Tracking Server:** Run the following command in a code cell. This code will start a server in the background.


In [None]:
!mlflow ui &>/dev/null &

### Install the `ngrok` library for creating a tunnel to access the MLflow UI.

**Get the Ngrok Tunnel URL:** Since the server is running on `localhost` inside the Colab environment, you'll need a tunnel to access the UI from your browser. We'll use `ngrok` for this. First, install `ngrok`.

In [None]:
!pip install ngrok

### Authenticate ngrok using a token stored in Colab secrets.

**Authenticate ngrok (Optional but Recommended):** If you have an ngrok account, you can add your authtoken for more reliable connections. Replace `YOUR_AUTHTOKEN` with your actual token.

In [None]:
# @title Store your ngrok authtoken securely
from google.colab import userdata

# Input your ngrok authtoken
# To get an authtoken, go to https://dashboard.ngrok.com/get-started/your-authtoken
# Use the Colab Secrets Manager to store your token securely
# the secret is named 'NGROK_AUTHTOKEN'
ngrok_authtoken = userdata.get('NGROK_AUTHTOKEN') # get the secret key from colab secret

# Or uncomment the line below and paste your token directly (less secure)
# ngrok_authtoken = "Your Authtoken"

# for authtoken stored in Colab secrets, use the line below
!ngrok config add-authtoken $ngrok_authtoken

**Create the ngrok Tunnel:** Run this command to create a tunnel to the MLflow server running on port 5000.

In [None]:
pip install pyngrok

### Create an ngrok tunnel to the MLflow server
Running on port 5000 and print the public URL to access the MLflow UI.

In [None]:
from pyngrok import ngrok
import os
import time

# Terminate any existing ngrok tunnels
# ngrok.kill()

# Get the authtoken from Colab secrets
NGROK_AUTHTOKEN = userdata.get('NGROK_AUTHTOKEN')
if NGROK_AUTHTOKEN:
  ngrok.set_auth_token(NGROK_AUTHTOKEN)
else:
  print("Ngrok authtoken not found in Colab secrets. You may experience connection issues.")
  print("Add NGROK_AUTHTOKEN to Colab secrets (left panel, '🔑')")

# Open a ngrok tunnel to the MLflow server (port 5000)
print("Opening ngrok tunnel...")
public_url = ngrok.connect(5000).public_url
print(f"MLflow UI Tunnel URL: {public_url}")

# You can now access the MLflow UI at the URL printed above.
# Use this URL in your browser to view tracking information.

### Set the MLflow tracking URI to the public URL provided by ngrok and enable Keras autologging again.

**Set the MLflow Tracking URI:** Now that you have the public URL, set the MLflow tracking URI in your code to point to this URL.

In [None]:
# You can use the `public_url` variable from the previous cell
# Or replace with the actual URL printed by ngrok
mlflow.set_tracking_uri(public_url)
mlflow.set_experiment("Apple_Disease_Detection")

# Enable Keras autologging
mlflow.keras.autolog()

### Define and compile the deep learning model

using transfer learning with MobileNetV2 as the base model and adding custom classification layers. It also prints the model summary.


In [None]:
# Assume these generators are already defined from your previous step:
# train_generator
# validation_generator
# test_generator

# Get the number of classes from your generators
num_classes = train_generator.num_classes
print(f"Number of classes detected: {num_classes}")
print(f"Class indices: {train_generator.class_indices}")

# Define input image dimensions (should match target_size in ImageDataGenerator)
IMG_HEIGHT = 224
IMG_WIDTH = 224
INPUT_SHAPE = (IMG_HEIGHT, IMG_WIDTH, 3) # 3 for RGB channels

# Load the pre-trained MobileNetV2 model
# include_top=False means we don't include the classification layers of MobileNetV2
# weights='imagenet' means it's pre-trained on the ImageNet dataset
base_model = MobileNetV2(weights='imagenet', include_top=False, input_shape=INPUT_SHAPE)

# Freeze the convolutional layers of the base model
# This prevents their weights from being updated during initial training
for layer in base_model.layers:
    layer.trainable = False

# Add custom classification layers on top of the base model
x = base_model.output
x = GlobalAveragePooling2D()(x) # Converts feature maps to a single vector per image
x = Dense(256, activation='relu')(x) # A fully connected layer
x = Dropout(0.5)(x) # Dropout for regularization to prevent overfitting
predictions = Dense(num_classes, activation='softmax')(x) # Output layer with softmax for multi-class classification

# Create the full model
model = Model(inputs=base_model.input, outputs=predictions)

# Compile the model
# Using Adam optimizer, categorical_crossentropy for multi-class classification, and accuracy as metric
model.compile(optimizer=Adam(learning_rate=0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

### Train the model

within an MLflow run, logging parameters and using callbacks for model checkpointing, early stopping, and learning rate reduction.

In [None]:
# --- Train the model within an MLflow run ---
with mlflow.start_run(run_name="MobileNetV2_TransferLearning") as run:
    # Log custom parameters (optional, autologging handles many)
    mlflow.log_param("epochs", 50)
    mlflow.log_param("batch_size", BATCH_SIZE)
    mlflow.log_param("optimizer", "Adam")
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("base_model", "MobileNetV2")

    # Define callbacks (ModelCheckpoint is good for saving best model locally)
    checkpoint_filepath = 'best_model.keras'
    model_checkpoint_callback = ModelCheckpoint(
        filepath=checkpoint_filepath, save_weights_only=False, monitor='val_accuracy',
        mode='max', save_best_only=True, verbose=1
    )
    early_stopping_callback = EarlyStopping(
        monitor='val_loss', patience=10, mode='min', verbose=1, restore_best_weights=True
    )
    reduce_lr_callback = ReduceLROnPlateau(
        monitor='val_loss', factor=0.2, patience=5, min_lr=0.00001, verbose=1
    )

    EPOCHS = 50
    history = model.fit(
        train_generator,
        steps_per_epoch=train_generator.samples // train_generator.batch_size,
        epochs=EPOCHS,
        validation_data=validation_generator,
        validation_steps=validation_generator.samples // validation_generator.batch_size,
        callbacks=[model_checkpoint_callback, early_stopping_callback, reduce_lr_callback]
    )

    print("\nModel training complete.")



### Log the trained model, Evaluate it on the test set

log test metrics, generate and log the classification report and confusion matrix as artifacts, and log plots of the training history to MLflow.

In [None]:
with mlflow.start_run(run_name="MobileNetV2_TransferLearning") as run:
    # Here, load the best model (saved by checkpoint) and log it to MLflow
    best_model_for_mlflow = tf.keras.models.load_model(checkpoint_filepath)
    mlflow.keras.log_model(
        best_model_for_mlflow,
        artifact_path="apple_disease_model",
        registered_model_name="AppleDiseaseDetector" # This registers it in the Model Registry
    )

    # Log evaluation metrics from the test set
    print("\nEvaluating the model on the test set for MLflow logging...")
    # Load the best model saved during training for evaluation
    loaded_best_model = tf.keras.models.load_model(checkpoint_filepath)
    eval_results = loaded_best_model.evaluate(
        test_generator,
        steps=test_generator.samples // test_generator.batch_size,
        verbose=0
    )
    mlflow.log_metric("test_loss", eval_results[0])
    mlflow.log_metric("test_accuracy", eval_results[1])

    print(f"Logged Test Loss: {eval_results[0]:.4f}")
    print(f"Logged Test Accuracy: {eval_results[1]:.4f}")

    # Generate and log classification report and confusion matrix as artifacts
    test_generator.reset()
    class_labels = list(test_generator.class_indices.keys())
    predictions = loaded_best_model.predict(test_generator, steps=test_generator.samples // test_generator.batch_size + 1)
    predicted_classes = np.argmax(predictions, axis=1)
    true_classes = test_generator.classes[:len(predicted_classes)]

    # Get the unique true labels and sort them to use as labels for confusion matrix and classification report
    unique_true_labels = sorted(np.unique(true_classes))

    report = classification_report(true_classes, predicted_classes, target_names=class_labels, output_dict=True, labels=unique_true_labels)
    mlflow.log_dict(report, "classification_report.json")

    cm = confusion_matrix(true_classes, predicted_classes, labels=unique_true_labels)
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=[class_labels[i] for i in unique_true_labels], yticklabels=[class_labels[i] for i in unique_true_labels])
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.title('Confusion Matrix')
    plt.savefig("confusion_matrix.png")
    mlflow.log_artifact("confusion_matrix.png")
    plt.close() # Close plot to free memory

    # Log plots of training history as artifacts
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 2, 1)
    plt.plot(history.history['accuracy'], label='Training Accuracy')
    plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
    plt.title('Training and Validation Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(history.history['loss'], label='Training Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.title('Training and Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.tight_layout()
    plt.savefig("training_history.png")
    mlflow.log_artifact("training_history.png")
    plt.close() # Close plot

    print(f"MLflow Run ID: {run.info.run_id}")
    print(f"MLflow Tracking URI: {mlflow.get_tracking_uri()}")