# **Setup and Library Imports**

### **Connect to google drive**

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### **Logging into Hugging Face Hub**

In [5]:
# from huggingface_hub import notebook_login
## Execute the login function to access the Hugging Face account
# notebook_login()

### **Installing Required Packages**

In [6]:
! pip install --quiet "transformers[torch]"
! pip install --quiet evaluate
! pip install --quiet tabulate
! pip install --quiet ipywidgets
! pip install --quiet datasets
! pip install --quiet pillow
! pip install --quiet scikit-learn
! pip install --quiet tensorboard
! pip install --quiet openpyxl

### **Importing Libraries**

In [3]:
# PyTorch for tensor operations
import torch

# Hugging Face libraries for training and transformer models
from transformers import Trainer, TrainingArguments, TrainerCallback
from transformers import AutoImageProcessor, AutoModelForImageClassification
from transformers import get_cosine_with_hard_restarts_schedule_with_warmup
from transformers import AdamW

# Evaluation metrics and utilities
import os
import evaluate
import numpy as np
from datetime import datetime

# Loading datasets for training and evaluation
from datasets import load_dataset

# Data manipulation and display utilities
import pandas as pd
from tabulate import tabulate
from collections import Counter

### **Defining Model, Dataset Paths, and Output Directories**

List of Models


```
microsoft/resnet-152
facebook/convnext-base-224
google/vit-base-patch16-224
google/vit-hybrid-base-bit-384
microsoft/swin-base-patch4-window7-224
facebook/deit-base-patch16-224
microsoft/beit-base-patch16-224
facebook/dinov2-base
```



List of Datasets


```
cvmil/rice-leaf-disease-augmented-v3
cvmil/rice-leaf-disease-augmented-v2
cvmil/rice-leaf-disease-augmented
cvmil/rice-leaf-disease-augmented-test
cvmil/rice-disease-02
```

Define paths for saving model training outputs and logs, incorporating model and dataset names along with the current date.

In [7]:
# Define model and dataset paths
model_path = "microsoft/resnet-152"
dataset_path = "cvmil/rice-leaf-disease-augmented-v3"
train_epochs = 20
resume_from_checkpoint = False

base_model_name = model_path.split("/")[-1]
dataset_name = dataset_path.split("/")[-1]

model_name = f"{base_model_name}_{dataset_name}_fft"
output_dir = f"./drive/Shareddrives/CS198-Drones/[v3] Training Output/{model_name}"

# Define directory for storing training logs
logging_dir = f"{output_dir}/logs"
metrics_dir = f"{output_dir}/training_metrics.xlsx"

# **Data Preparation and Processing Pipeline**

This section handles the dataset loading, label extraction, image processing setup, and defines necessary functions for data transformation, batching, and metric computation to prepare the data for model training and evaluation.

### **Load Dataset and Extract Labels**

Load the dataset from huggingface and extract the class labels from the training data.

In [8]:
# Load the dataset
dataset = load_dataset(dataset_path)

# Extract class labels from the training set
labels = dataset['train'].features['label'].names

README.md:   0%|          | 0.00/813 [00:00<?, ?B/s]

train-00000-of-00004.parquet:   0%|          | 0.00/354M [00:00<?, ?B/s]

train-00001-of-00004.parquet:   0%|          | 0.00/558M [00:00<?, ?B/s]

train-00002-of-00004.parquet:   0%|          | 0.00/363M [00:00<?, ?B/s]

train-00003-of-00004.parquet:   0%|          | 0.00/328M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/55.4M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/54.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8192 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/307 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/303 [00:00<?, ? examples/s]

Generate and display a table showing class distribution across training and validation splits.

In [9]:
label_mapping = dataset['train'].features['label'].int2str

# Count the number of samples per class in each split
train_counts = Counter(dataset['train']['label'])
validation_counts = Counter(dataset['validation']['label'])

# Create a DataFrame for the class distribution
data = {
    'ID': list(range(len(labels))),
    'Label': labels,
    'Training': [train_counts[i] if i in train_counts else 0 for i in range(len(labels))],
    'Validation': [validation_counts[i] if i in validation_counts else 0 for i in range(len(labels))],
}

# Display the class distribution in a table format
df = pd.DataFrame(data)
print(tabulate(df, headers='keys', tablefmt='grid', showindex=False))

+------+------------------------+------------+--------------+
|   ID | Label                  |   Training |   Validation |
|    0 | Bacterial Leaf Blight  |       1024 |           34 |
+------+------------------------+------------+--------------+
|    1 | Brown Spot             |       1024 |           50 |
+------+------------------------+------------+--------------+
|    2 | Healthy Rice Leaf      |       1024 |           30 |
+------+------------------------+------------+--------------+
|    3 | Leaf Blast             |       1024 |           42 |
+------+------------------------+------------+--------------+
|    4 | Leaf scald             |       1024 |           35 |
+------+------------------------+------------+--------------+
|    5 | Narrow Brown Leaf Spot |       1024 |           22 |
+------+------------------------+------------+--------------+
|    6 | Rice Hispa             |       1024 |           41 |
+------+------------------------+------------+--------------+
|    7 |

### **Initialize Image Processor**

Load and initialize the image processor from the pre-trained model.

In [10]:
# Load the image processor from the pre-trained model
processor = AutoImageProcessor.from_pretrained(model_path)
print(processor)

preprocessor_config.json:   0%|          | 0.00/266 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


ConvNextImageProcessor {
  "crop_pct": 0.875,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.485,
    0.456,
    0.406
  ],
  "image_processor_type": "ConvNextImageProcessor",
  "image_std": [
    0.229,
    0.224,
    0.225
  ],
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "shortest_edge": 224
  }
}



### **Data Preparation and Processing Pipeline**

Create mappings for label-to-ID and ID-to-label.

In [11]:
label2id = {c: idx for idx, c in enumerate(labels)}
id2label = {idx: c for idx, c in enumerate(labels)}

Define the transformation function to process the image batch.

In [12]:
def transforms(batch):
    batch['image'] = [x.convert('RGB') for x in batch['image']]
    inputs = processor(batch['image'], return_tensors='pt')
    inputs['labels'] = batch['label']
    return inputs

Define the custom collation function for batching pixel values and labels.

In [13]:
def collate_fn(batch):
    return {
        'pixel_values': torch.stack([x['pixel_values'] for x in batch]),
        'labels': torch.tensor([x['labels'] for x in batch])
    }


Define the function to compute accuracy during evaluation.

In [14]:
accuracy = evaluate.load('accuracy')

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

### **Apply Data Transformations to Dataset**

Apply the defined transformation function to the dataset for preprocessing. </br>
Note: This assumes that data augmentation and normalization have already been handled in the previous pipeline and is ready for fine-tuning.

In [15]:
processed_dataset = dataset.with_transform(transforms)

# **Model Initialization and Trainer Setup**

This section handles the initialization of the model, configuration of training parameters, and setting up the Trainer for fine-tuning, including the datasets, data processing, and evaluation metrics.

### **Initialize Pre-trained Model for Fine-tuning**

Load a pre-trained image classification model, configuring it with the correct label mappings and number of labels for the fine-tuning task.

In [16]:
# Load pre-trained model and configure it for fine-tuning
model = AutoModelForImageClassification.from_pretrained(
    model_path,                  # Path to the pre-trained model
    num_labels=len(labels),      # Set the number of labels for classification
    id2label=id2label,           # Map from ID to label
    label2id=label2id,           # Map from label to ID
    ignore_mismatched_sizes=True # Ignore size mismatches in weights
)

config.json:   0%|          | 0.00/69.6k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Some weights of ResNetForImageClassification were not initialized from the model checkpoint at microsoft/resnet-152 and are newly initialized because the shapes did not match:
- classifier.1.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([8]) in the model instantiated
- classifier.1.weight: found shape torch.Size([1000, 2048]) in the checkpoint and torch.Size([8, 2048]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### **Check Model Parameters for Fine-tuning**

Unfreeze all layers of the model for full fine-tuning

In [17]:
for param in model.parameters():
    param.requires_grad = True

We can check how many parameters are there in the model along with how many are actually going to be trained now.

In [18]:
num_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {num_params:,} | Trainable parameters: {trainable_params:,}")

Total parameters: 58,160,200 | Trainable parameters: 58,160,200


### **Define Training Arguments**

Set learning rate for the model layers, we use lower learning rate for finetuning the pretrained model weight, and higher weight for the classification layer.

In [19]:
for param in model.named_parameters():
    if "classifier" in param[0]:
        print(param[0])

classifier.1.weight
classifier.1.bias


### **Create LR Scheduler**

In [20]:
# Define different learning rates
base_lr = 3e-5
classifier_lr = 3e-4
weight_decay = 0.1
warmup_ratio = 0.1

# Separate model parameters
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if "classifier" not in n],
        "lr": base_lr,
        "weight_decay": weight_decay
    },
    {
        "params": [p for n, p in model.named_parameters() if "classifier" in n],
        "lr": classifier_lr,
        "weight_decay": weight_decay
    },
]

# Define optimizer with different learning rates
optimizer = AdamW(optimizer_grouped_parameters)



Set up the training configuration with parameters such as batch size, number of epochs, learning rate, and logging strategies for the fine-tuning process.

In [21]:
training_args = TrainingArguments(
    num_train_epochs=train_epochs,              # Number of training epochs
    per_device_train_batch_size=64,             # Batch size for training
    per_device_eval_batch_size=64,              # Batch size for evaluation

    fp16=True,                                  # Use mixed precision training
    warmup_ratio=warmup_ratio,                  # Warmup ratio for learning rate scheduler
    weight_decay=weight_decay,                  # Weight decay for regularization
    lr_scheduler_type='cosine_with_restarts',   # Learning rate scheduler type
    lr_scheduler_kwargs = { "num_cycles": 4 },  # Number of cycles for learning rate scheduler

    save_total_limit=3,                         # Limit the number of saved models
    report_to=['tensorboard'],                  # Log to TensorBoard
    save_strategy="epoch",                      # Save strategy
    eval_strategy="epoch",                      # Evaluation strategy
    logging_strategy="epoch",                   # Logging strategy
    logging_dir=logging_dir,                    # Directory for logging
    output_dir=output_dir,                      # Directory for saving outputs

    remove_unused_columns=False,                # Retain unused columns in the dataset
    load_best_model_at_end=True,                # Load best model at the end of training
    metric_for_best_model="eval_loss",          # Specify the metric to track
    greater_is_better=False,                    # For loss, lower is better
    push_to_hub=True,                           # Push model to Hugging Face Hub
)

### **Trainer Callback**

In [22]:
class CustomSaveCallback(TrainerCallback):
    def on_evaluate(self, args, state, control, **kwargs):
        previous_logs = state.log_history[-2:]
        new_logs = {k: v for log in previous_logs for k, v in log.items()}

        new_logs["timestamp"] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

        # Add GPU VRAM usage details (in MB)
        if torch.cuda.is_available():
            new_logs["gpu_vram_allocated_mb"] = torch.cuda.memory_allocated() / (1024 ** 2)
            new_logs["gpu_vram_reserved_mb"] = torch.cuda.memory_reserved() / (1024 ** 2)
        else:
            new_logs["gpu_vram_allocated_mb"] = None
            new_logs["gpu_vram_reserved_mb"] = None

        # Read the existing Excel file, if it exists
        if os.path.exists(metrics_dir):
            try:
                df_existing = pd.read_excel(metrics_dir)
            except Exception as e:
                print(f"Error reading {metrics_dir}: {e}")
                df_existing = pd.DataFrame()
        else:
            df_existing = pd.DataFrame()

        # Check if this epoch's record already exists; if yes, update it; otherwise, append.
        if not df_existing.empty and (df_existing["epoch"] == new_logs["epoch"]).any():
            df_existing.loc[df_existing["epoch"] == new_logs["epoch"], new_logs.keys()] = new_logs.values()
            df_to_save = df_existing
        else:
            df_new = pd.DataFrame([new_logs])
            df_to_save = pd.concat([df_existing, df_new], ignore_index=True)

        # Save the updated DataFrame back to Excel
        df_to_save.to_excel(metrics_dir, index=False)
        return control

### **Initialize Trainer**

Initialize the Trainer object with the model, training arguments, data collator, metrics computation, and datasets for training and evaluation.

In [23]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    optimizers=(optimizer, None),
    compute_metrics=compute_metrics,
    train_dataset=processed_dataset["train"],
    eval_dataset=processed_dataset["validation"],
    tokenizer=processor,
    callbacks=[CustomSaveCallback()]
)

  trainer = Trainer(


### **Create Model Card**

In [24]:
trainer.create_model_card(
    language="en",
    license="MIT",
    tags=["image-classification", "fine-tuning"],
    model_name=model_name,
    finetuned_from=base_model_name,
    tasks=["image-classification"],
    dataset_tags=["image", "rice-leaf_disease"],
    dataset=dataset_name,
    dataset_args=["size: 224x224", "augmentation: true"],
)

# **Model Training and Evaluation**

### **Start Fine-tuning Process**

Initiates the fine-tuning of the model using the Trainer, applying the specified training configurations, such as the batch size, learning rate, and number of epochs. During training, the model will be evaluated at the end of each epoch on the validation dataset using the compute_metrics function, which calculates accuracy.

The model will undergo the following process during fine-tuning:
- **Training**: The model will be trained on the training dataset for the specified number of epochs.
- **Evaluation**: After each epoch, the model will be evaluated on the validation dataset, and accuracy will be computed using the compute_metrics function.
- **Metrics Logging**: The training progress and evaluation results will be logged to TensorBoard and can be monitored during training.

In [25]:
print(f"Training {model_name} on {dataset_name} dataset...")
train_results = trainer.train(resume_from_checkpoint=resume_from_checkpoint)

Training resnet-152_rice-leaf-disease-augmented-v3_fft on rice-leaf-disease-augmented-v3 dataset...


Epoch,Training Loss,Validation Loss,Accuracy
1,2.051,1.977593,0.456026
2,1.8042,1.485783,0.599349
3,1.129,0.813229,0.745928
4,0.6016,0.612886,0.811075
5,0.3757,0.548821,0.827362
6,0.2893,0.54626,0.846906
7,0.2625,0.531409,0.846906
8,0.1559,0.461918,0.86645
9,0.0772,0.505535,0.856678
10,0.0471,0.56594,0.856678


### **Save Model and Training State**

After the training process, the model and relevant training state are saved. This includes saving the model weights, training metrics, and the state of the trainer, ensuring that training progress can be restored if needed.

In [26]:
# Save the trained model
trainer.save_model()

# Log and save training metrics for later reference
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)

# Save the state of the trainer, including configuration and optimizer state
trainer.save_state()

events.out.tfevents.1741720359.5ed668b8bdc1.357.0:   0%|          | 0.00/16.8k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.50k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/233M [00:00<?, ?B/s]

***** train metrics *****
  epoch                    =         20.0
  total_flos               = 8015202562GF
  train_loss               =       0.3469
  train_runtime            =   2:19:16.14
  train_samples_per_second =       19.607
  train_steps_per_second   =        0.306
