# FarmFederate ‚Äî Colab Notebook

This notebook helps you run the FarmFederate pipeline in Colab. It contains safe preview (dry-run), dataset acquisition, dataset checking, and training steps. **No cell will be executed automatically**; follow the cells step-by-step and provide credentials where needed.

## 1) Colab setup & dependencies ‚öôÔ∏è

This cell installs needed packages and shows common setup steps. Run it once after creating a new Colab runtime.


In [None]:
```bash
# Install core packages (use a single cell in Colab)
python -m pip install --upgrade pip
pip install -r requirements.txt || true
pip install kaggle datasets transformers accelerate --upgrade
# Optional: pillow-simd for faster image processing
# pip install -U pillow-simd
```


## 2) Credentials (Kaggle & HuggingFace)

- Upload your `kaggle.json` (from https://www.kaggle.com/me/account) to Colab and move it to `~/.kaggle/kaggle.json` with `chmod 600`.
- Set HuggingFace token in an env var `HF_TOKEN` (recommended) or use the login helper below.


In [None]:
# Upload kaggle.json and set it up (one-time)
from google.colab import files
print('Upload kaggle.json when prompted (https://www.kaggle.com/me/account)')
uploaded = files.upload()
if 'kaggle.json' in uploaded:
    import os
    os.makedirs('/root/.kaggle', exist_ok=True)
    open('/root/.kaggle/kaggle.json','wb').write(uploaded['kaggle.json'])
    os.chmod('/root/.kaggle/kaggle.json', 0o600)
    print('kaggle.json saved to /root/.kaggle/kaggle.json')
else:
    print('kaggle.json not uploaded; you can still use HTTP/GitHub fallbacks')

# Set HF token interactively (recommended)
try:
    hf = input('Paste your HuggingFace token (or leave blank to skip): ').strip()
    if hf:
        import os
        os.environ['HF_TOKEN'] = hf
        print('HF_TOKEN set in environment for this session')
except Exception as e:
    print('Skipping HF token setup:', e)


## 3) Verify tests & local utilities ‚úÖ

Quick smoke tests to ensure helper modules are importable and local tests pass.


In [None]:
# Run tests (pytest) - this will not run long tests, only unit tests added to repo
!pytest -q tests/test_dataset_utils.py -q || true

# Try importing dataset utilities to check for obvious import errors
try:
    from dataset_utils import build_text_label_dfs, count_images_in_dir
    print('dataset_utils imported successfully')
except Exception as e:
    print('Import failed:', e)


## 4) Dry-run (preview datasets) üîç

Run the script with dry-run & report-only flags to inspect what would be downloaded/discovered. This is the recommended first step.


In [1]:
# Preview (no network actions if --dry-run is implemented in script)
!python FarmFederate_Kaggle_Complete.py --dry-run --report-only

# After running, inspect generated reports
import json, os
if os.path.exists('datasets_report.json'):
    print('datasets_report.json found ‚Äî preview:')
    print(json.dumps(json.load(open('datasets_report.json')), indent=2)[:1000])
else:
    print('datasets_report.json not found ‚Äî check script output or logs')

[*] Using device: cpu

CONFIGURATION
Labels: ['water_stress', 'nutrient_def', 'pest_risk', 'disease_risk', 'heat_stress']
Config: {
  "max_samples": 600,
  "train_split": 0.8,
  "batch_size": 8,
  "text_embed_dim": 256,
  "vision_embed_dim": 256,
  "hidden_dim": 256,
  "num_labels": 5,
  "epochs": 5,
  "learning_rate": 0.0002,
  "weight_decay": 0.01,
  "num_clients": 3,
  "fed_rounds": 3,
  "local_epochs": 2,
  "dirichlet_alpha": 0.5,
  "participation_rate": 0.8,
  "fusion_types": [
    "concat",
    "attention",
    "gated",
    "clip",
    "flamingo",
    "blip2",
    "coca",
    "unified_io"
  ],
  "seed": 42
}

DATA GENERATION
== Checking candidates for label: water_stress ==
Trying HTTP download: https://github.com/ashokpant/drought-detection
[DRY_RUN] Would download https://github.com/ashokpant/drought-detection -> drought
Trying HTTP download: https://github.com/ashokpant/drought-detection/archive/refs/heads/main.zip
[DRY_RUN] Would download https://github.com/ashokpant/drought-

## 5) Download datasets (Kaggle & HTTP fallbacks) ‚¨áÔ∏è

If the dry-run preview looks good, run download-only. Ensure `~/.kaggle/kaggle.json` is present and you have accepted competitions when required.


In [None]:
# Download (may take a long time depending on datasets and your connection)
!python FarmFederate_Kaggle_Complete.py --download-only

# After download, run the local scanner to verify
!python scripts/check_datasets.py || true

# Preview local report if created
import json, os
if os.path.exists('local_datasets_report.json'):
    print(json.dumps(json.load(open('local_datasets_report.json')), indent=2)[:1000])
else:
    print('local_datasets_report.json not found ‚Äî check check_datasets.py output')

## 6) Quick smoke training (small run) üèÉ‚Äç‚ôÇÔ∏è

To test the end-to-end training quickly, set small sample/epoch parameters in the script or use environment overrides if available. This avoids long runs during initial validation.

**Example:** set small environment var and run training (the script must honor these env vars or you can edit config in the script):


In [None]:
# Example of running a quick training (modify script to accept QUICK_TRAIN or set config vars inside it)
# This example assumes the script respects QUICK_TRAIN env var to limit data and epochs.
import os
os.environ['QUICK_TRAIN'] = '1'
# Run a short test training ‚Äî expect it to finish quickly (e.g., 1 epoch, few samples)
!python FarmFederate_Kaggle_Complete.py --train-quick || true

## 7) Inspect dataset report & visualize samples üìä

Load the `datasets_report.json` and show label counts and a few sample images per label.


In [None]:
import os, json
from pathlib import Path
import matplotlib.pyplot as plt
from PIL import Image

if not Path('datasets_report.json').exists():
    print('datasets_report.json not found ‚Äî run dry-run/report-only cell first')
else:
    report = json.load(open('datasets_report.json'))
    print('Label summary:')
    for label, info in report.get('labels', {}).items():
        print(f"- {label}: {info.get('image_count','?')} images, texts: {info.get('text_count','?')}")
    # Show sample images for first label
    sample_label = next(iter(report.get('labels', {})), None)
    if sample_label:
        roots = report['labels'][sample_label].get('image_roots', [])
        if roots:
            img_files = list(Path(roots[0]).rglob('*.jpg'))[:9]
            if img_files:
                fig, axs = plt.subplots(1, len(img_files), figsize=(3*len(img_files),3))
                for ax, p in zip(axs, img_files):
                    ax.imshow(Image.open(p).convert('RGB'))
                    ax.axis('off')
                plt.show()
            else:
                print('No images found under', roots[0])


## 8) Troubleshooting & common fixes ‚ö†Ô∏è

- Kaggle 401/403: ensure `kaggle.json` is correct and competitions are accepted via `https://www.kaggle.com/` (some datasets require acceptance).\n- If a dataset fails on Kaggle, the script tries HTTP/GitHub fallbacks and will synthesize small datasets when absolutely necessary.\n- Check `datasets_report.json` and `local_datasets_report.json` for detailed diagnostics.

If you hit a persistent issue, paste the relevant log output (error lines) into the project's GitHub issue or ask here for help.

## Final notes & next steps ‚úÖ

1. Start with the **dry-run** cell and inspect `datasets_report.json`.
2. If the dataset preview looks good, run **download-only**. Expect Kaggle prompts/acceptance for some datasets.
3. Run quick smoke training before committing to longer training runs. Adjust script config to small epochs / samples.
4. If you want, I can run the dry-run in a Colab session for you (requires permission to execute). Ask and I'll proceed.

Good luck ‚Äî run step-by-step and paste any errors here if you want me to help debug them. üîß

## Before you run (Important) ‚ö†Ô∏è

This notebook will *only* run commands you execute manually. Follow these steps:

1. Run the **Credentials** cell and upload `kaggle.json` when prompted (or set `KAGGLE_USERNAME` / `KAGGLE_KEY` env vars). ‚úÖ
2. Paste your HuggingFace token when asked (optional but recommended). ‚úÖ
3. Run the **Dry-run** cell to preview dataset acquisition and inspect `datasets_report.json`. ‚úÖ
4. If the dry-run looks OK and you want to fetch *real* datasets, run the **Download datasets** cell. This will ask you to confirm `--allow-synthesis` usage (I recommend `NO` for now so only real datasets are used). ‚¨áÔ∏è

_Note: I will not run anything automatically ‚Äî you must execute each cell manually._

In [None]:
# Set whether synthesis is allowed for missing labels (default: NO)
allow = input('Allow synthesis for missing image labels? (y/N): ').strip().lower()
import os
if allow in ('y','yes'):
    os.environ['ALLOW_SYNTHESIS'] = '1'
    print('ALLOW_SYNTHESIS set to 1')
else:
    os.environ.pop('ALLOW_SYNTHESIS', None)
    print('ALLOW_SYNTHESIS unset (will require real datasets)')

In [None]:
## Dry-run (preview datasets) üîç

Run this cell to preview what will be downloaded / found. It will create `datasets_report.json` but will not download large files.

# Dry-run: preview (no network actions if your script honors --dry-run)
!python FarmFederate_Kaggle_Complete.py --dry-run --report-only

# After running, inspect generated reports
import json, os
if os.path.exists('datasets_report.json'):
    print('\n=== datasets_report.json preview ===')
    print(json.dumps(json.load(open('datasets_report.json')), indent=2)[:1000])
else:
    print('\nNo datasets_report.json found ‚Äî check the script output for errors')

In [None]:
## Upload `kaggle.json` and set HuggingFace token üîê

# Upload kaggle.json and set it up (one-time)
from google.colab import files
print('Upload kaggle.json when prompted (https://www.kaggle.com/me/account)')
uploaded = files.upload()
if 'kaggle.json' in uploaded:
    import os
    os.makedirs('/root/.kaggle', exist_ok=True)
    open('/root/.kaggle/kaggle.json','wb').write(uploaded['kaggle.json'])
    os.chmod('/root/.kaggle/kaggle.json', 0o600)
    print('kaggle.json saved to /root/.kaggle/kaggle.json')
else:
    print('kaggle.json not uploaded; you can still use HTTP/GitHub fallbacks')

# Set HF token interactively (recommended)
try:
    hf = input('Paste your HuggingFace token (or leave blank to skip): ').strip()
    if hf:
        import os
        os.environ['HF_TOKEN'] = hf
        print('HF_TOKEN set in environment for this session')
except Exception as e:
    print('Skipping HF token setup:', e)

In [None]:
## Download datasets (requires `kaggle.json` / accepted competitions) ‚¨áÔ∏è

# Run this only AFTER you uploaded kaggle.json and reviewed the dry-run.
confirm = input('Proceed with dataset download now? This will attempt Kaggle downloads. (yes/NO): ').strip().lower()
if confirm in ('y','yes'):
    allow = os.environ.get('ALLOW_SYNTHESIS','0')
    cmd = 'python FarmFederate_Kaggle_Complete.py --download-only'
    if allow == '1':
        cmd += ' --allow-synthesis'
    print('Running:', cmd)
    !{cmd}
    print('\nAfter download, run the scanner:')
    !python scripts/check_datasets.py || true
    if os.path.exists('local_datasets_report.json'):
        import json
        print(json.dumps(json.load(open('local_datasets_report.json')), indent=2)[:1000])
else:
    print('Download cancelled by user')

In [None]:
## Quick smoke training (short run) üèÉ‚Äç‚ôÇÔ∏è

# Use minimal settings for a fast sanity check ‚Äî edit parameters below as needed
q = input('Run quick training now? (yes/NO): ').strip().lower()
if q in ('y','yes'):
    import os
    os.environ['QUICK_TRAIN'] = '1'
    print('Starting quick train (1 epoch, small sample sizes) ‚Äî this will take a few minutes')
    !python FarmFederate_Kaggle_Complete.py --train-quick || true
else:
    print('Quick training skipped')