# Deforestation Detection for SDG 15 — Functional Jupyter Notebook

This notebook is **ready to run** (assuming you provide credentials for Kaggle or already downloaded a dataset). It downloads a recent Kaggle dataset (using the Kaggle API), preprocesses satellite image patches, trains an unsupervised convolutional autoencoder and clusters encoded features, and trains supervised baselines (Decision Tree, CNN). It produces evaluation metrics (accuracy, MAE, F1) and visualizations, and saves models and predictions.

---

## How this notebook uses APIs to get the latest possible data
1. **Kaggle API**: the notebook can automatically download the chosen Kaggle dataset (if you supply your Kaggle credentials). This gives you access to recent community datasets and kernels.

> **Important**: you must provide Kaggle credentials. Either:
> - Create a `~/.kaggle/kaggle.json` file with your `username` and `key` (recommended), or
> - Set environment variables `KAGGLE_USERNAME` and `KAGGLE_KEY` in the runtime.

If you cannot or do not want to use the Kaggle API, you can manually upload / mount a dataset into a `data/` folder; the notebook will detect it and proceed.

---

In [None]:
# 0) Install & import dependencies
import os
import sys
import subprocess

# Install libraries if missing (uncomment if running in a fresh environment)
try:
    import kaggle
except Exception as e:
    print('Installing kaggle...')
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'kaggle'])

try:
    import tensorflow as tf
except Exception:
    print('Installing tensorflow...')
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'tensorflow'])

# Essential imports
import random
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, mean_absolute_error, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from tensorflow.keras import layers, models, optimizers
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.preprocessing import image
import zipfile
import glob
import pandas as pd

# Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
import tensorflow as tf
tf.random.set_seed(SEED)

# 1) Download dataset from Kaggle (optional) — choose a dataset slug below

# The notebook includes a small curated list of Kaggle deforestation / landcover datasets discovered via web search.
# Pick one slug from `DATASET_CHOICES` or set `KAGGLE_DATASET_SLUG` to your preferred Kaggle dataset id (owner/dataset-name).

# Example dataset choices (you may need to accept dataset rules on Kaggle or be logged in):
# - 'akhilchibber/deforestation-detection-dataset'
# - 'konradb/deforestation-dataset'
# - 'zafish/deforestation'
# - 'mcagriaksoy/trees-in-satellite-imagery'

# If these change or you want another dataset, replace the slug with the exact Kaggle dataset slug.

In [None]:
DATA_DIR = 'data'
os.makedirs(DATA_DIR, exist_ok=True)

DATASET_CHOICES = [
    'akhilchibber/deforestation-detection-dataset',
    'konradb/deforestation-dataset',
    'zafish/deforestation',
    'mcagriaksoy/trees-in-satellite-imagery'
]

# Choose dataset slug to attempt to download. Set to None to skip download and use local `data/` folder.
KAGGLE_DATASET_SLUG = DATASET_CHOICES[0]

In [None]:
def download_kaggle_dataset(slug, dest='data'):
    \"\"\"Download and unzip a Kaggle dataset using the kaggle API.
    Requires ~/.kaggle/kaggle.json or environment variables KAGGLE_USERNAME & KAGGLE_KEY.
    \"\"\"
    try:
        from kaggle.api.kaggle_api_extended import KaggleApi
        api = KaggleApi()
        api.authenticate()
    except Exception as e:
        raise RuntimeError('Kaggle API not available or authentication failed. Make sure you set ~/.kaggle/kaggle.json or KAGGLE_USERNAME/KAGGLE_KEY.')

    print(f'Downloading {slug} to {dest} (this may take a while)')
    api.dataset_download_files(slug, path=dest, unzip=True, quiet=False)
    print('Download complete')

# Attempt download
if KAGGLE_DATASET_SLUG:
    try:
        download_kaggle_dataset(KAGGLE_DATASET_SLUG, dest=DATA_DIR)
    except Exception as e:
        print('Failed to download Kaggle dataset:', e)
        print('Proceeding to look for any data already in data/ folder.')

# List files found
print('\\nFiles under data/:')
for root, dirs, files in os.walk(DATA_DIR):
    print(root, len(files), 'files')
    # show only top-level directories
    if root == DATA_DIR:
        print('Subdirs:', dirs)

# 2) Dataset structure expectations & helper functions

# The notebook expects image files inside subfolders of `data/` such as data/forest/ and data/deforested/ or a CSV mapping file.
# We'll attempt to auto-detect common patterns. If your dataset uses different structure, adapt the loader accordingly.

In [None]:
IMG_SIZE = (128, 128)
BATCH_SIZE = 32
EPOCHS_AE = 20
EPOCHS_CLS = 15
N_CLUSTERS = 2

def find_class_dirs(data_dir):
    # find subdirectories that contain many image files
    classes = []
    for d in os.listdir(data_dir):
        full = os.path.join(data_dir, d)
        if os.path.isdir(full):
            imgs = [f for f in os.listdir(full) if f.lower().endswith(('.png','.jpg','.jpeg','.tif','.tiff'))]
            if len(imgs) > 10:
                classes.append(d)
    return classes

# Try to detect classes
detected = find_class_dirs(DATA_DIR)
print('Detected class directories:', detected)

# If detected less than 2 classes, we'll fallback to searching for top-level image files or common csv labels
if len(detected) < 2:
    print('Not enough class subfolders detected. Looking for CSV label files or nested folders...')
    # Search for CSV label files
    csvs = glob.glob(os.path.join(DATA_DIR, '**', '*.csv'), recursive=True)
    print('CSV files found:', csvs[:5])

# 3) Load image paths and labels (flexible loader)

# We'll support two simple cases:
# - Case A: `data/<class_name>/*.jpg` folders — straightforward binary/multi-class classification
# - Case B: a CSV file with columns `filepath,label` referencing images inside the data/ folder

# The loader below tries Case A first, then Case B.

In [None]:
from pathlib import Path

def load_image_paths_from_data_dir(data_dir):
    # Case A: directory-per-class structure
    classes = sorted([d for d in os.listdir(data_dir) if os.path.isdir(os.path.join(data_dir, d))])
    paths = []
    labels = []
    used_classes = []
    for idx, cls in enumerate(classes):
        cls_path = os.path.join(data_dir, cls)
        img_files = [os.path.join(cls_path, f) for f in os.listdir(cls_path) if f.lower().endswith(('.png','.jpg','.jpeg','.tif','.tiff'))]
        if len(img_files) >= 10:
            used_classes.append(cls)
            paths.extend(img_files)
            labels.extend([idx] * len(img_files))
    if len(used_classes) >= 2:
        return paths, np.array(labels), used_classes
    # Case B: CSV mapping
    csv_candidates = glob.glob(os.path.join(data_dir, '**', '*.csv'), recursive=True)
    for csvfile in csv_candidates:
        try:
            df = pd.read_csv(csvfile)
            if 'label' in df.columns and 'filepath' in df.columns:
                # assume filepaths are relative to data_dir
                paths = [os.path.join(data_dir, str(p)) if not os.path.isabs(str(p)) else str(p) for p in df['filepath']]
                labels = df['label'].values
                classes = sorted(list(pd.unique(labels)))
                # map labels to integers
                label_to_idx = {lab:i for i,lab in enumerate(classes)}
                labels_idx = np.array([label_to_idx[l] for l in labels])
                return paths, labels_idx, classes
        except Exception:
            continue
    # Case C: if none found, search recursively for images and create dummy labels (unknown)
    img_files = glob.glob(os.path.join(data_dir, '**', '*.*'), recursive=True)
    img_files = [f for f in img_files if f.lower().endswith(('.png','.jpg','.jpeg','.tif','.tiff'))]
    if len(img_files) > 0:
        print('Found images but not class folders or CSV — placing them into a single class')
        paths = img_files
        labels = np.zeros(len(img_files), dtype=int)
        classes = ['images']
        return paths, labels, classes
    return [], np.array([]), []

paths, labels, classes = load_image_paths_from_data_dir(DATA_DIR)
print('Loaded classes:', classes)
print('Total images found:', len(paths))

if len(paths) == 0:
    raise RuntimeError('No images found under data/. Please provide a dataset or disable Kaggle download and upload images.')

# 4) Preprocess: load images into arrays (careful with memory)

# For large datasets, the code below can be replaced with a generator or tf.data pipeline. For clarity and reproducibility we load into memory when size allows.

In [None]:
MAX_LOAD = None  # set to limit images for quick runs (e.g., 2000)
if MAX_LOAD:
    paths = paths[:MAX_LOAD]
    labels = labels[:MAX_LOAD]

def load_and_preprocess(img_path, target_size=IMG_SIZE):
    try:
        img = image.load_img(img_path, target_size=target_size)
    except Exception:
        # sometimes images are corrupted; return a black image instead
        arr = np.zeros((*target_size, 3), dtype=np.float32)
        return arr
    arr = image.img_to_array(img).astype('float32') / 255.0
    # Ensure 3 channels
    if arr.shape[-1] == 1:
        arr = np.repeat(arr, 3, axis=-1)
    if arr.shape[-1] > 3:
        arr = arr[..., :3]
    return arr

print('Loading images into memory (this may take a while)')
X = np.array([load_and_preprocess(p) for p in paths])
y = labels
print('X shape:', X.shape, 'y shape:', y.shape)

# 5) Split dataset

In [None]:
if len(np.unique(y)) > 1:
    X_train, X_test, y_train, y_test, paths_train, paths_test = train_test_split(X, y, paths, test_size=0.2, random_state=SEED, stratify=y)
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.15, random_state=SEED, stratify=y_train)
else:
    # unsupervised only scenario
    X_train, X_test, y_train, y_test, paths_train, paths_test = train_test_split(X, y, paths, test_size=0.2, random_state=SEED)
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.15, random_state=SEED)

print('Train:', X_train.shape, 'Val:', X_val.shape, 'Test:', X_test.shape)

# 6) Augmentation and generators

In [None]:
train_datagen = ImageDataGenerator(rotation_range=20, width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True, zoom_range=0.1)
val_datagen = ImageDataGenerator()
train_generator = train_datagen.flow(X_train, y_train, batch_size=BATCH_SIZE, shuffle=True, seed=SEED)
val_generator = val_datagen.flow(X_val, y_val, batch_size=BATCH_SIZE, shuffle=False)

# 7) Unsupervised: Convolutional Autoencoder to learn representations

In [None]:
input_shape = IMG_SIZE + (3,)

def build_conv_autoencoder(input_shape, latent_dim=128):
    inputs = layers.Input(shape=input_shape)
    x = layers.Conv2D(32, 3, activation='relu', padding='same')(inputs)
    x = layers.MaxPooling2D(2, padding='same')(x)
    x = layers.Conv2D(64, 3, activation='relu', padding='same')(x)
    x = layers.MaxPooling2D(2, padding='same')(x)
    x = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
    x = layers.MaxPooling2D(2, padding='same')(x)
    shape_before_flatten = tf.keras.backend.int_shape(x)[1:]
    x = layers.Flatten()(x)
    latent = layers.Dense(latent_dim, name='latent')(x)

    # Decoder
    x = layers.Dense(np.prod(shape_before_flatten))(latent)
    x = layers.Reshape(shape_before_flatten)(x)
    x = layers.Conv2DTranspose(128, 3, strides=1, activation='relu', padding='same')(x)
    x = layers.UpSampling2D(2)(x)
    x = layers.Conv2DTranspose(64, 3, strides=1, activation='relu', padding='same')(x)
    x = layers.UpSampling2D(2)(x)
    x = layers.Conv2DTranspose(32, 3, strides=1, activation='relu', padding='same')(x)
    x = layers.UpSampling2D(2)(x)
    outputs = layers.Conv2D(3, 3, activation='sigmoid', padding='same')(x)

    autoencoder = models.Model(inputs, outputs, name='conv_autoencoder')
    encoder = models.Model(inputs, latent, name='encoder')
    return autoencoder, encoder

print('Building autoencoder...')
autoencoder, encoder = build_conv_autoencoder(input_shape, latent_dim=128)
autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.summary()

# Train autoencoder (unsupervised) — use X_train images (labels not required)
history_ae = autoencoder.fit(X_train, X_train, epochs=EPOCHS_AE, batch_size=BATCH_SIZE, validation_data=(X_val, X_val))

# Visualize some reconstructions

In [None]:
n = min(5, X_test.shape[0])
recon = autoencoder.predict(X_test[:n])
plt.figure(figsize=(12, 4))
for i in range(n):
    plt.subplot(2, n, i+1)
    plt.imshow(X_test[i])
    plt.axis('off')
    plt.subplot(2, n, n+i+1)
    plt.imshow(recon[i])
    plt.axis('off')
plt.suptitle('Top: original, Bottom: reconstruction')
plt.show()

# 8) Clustering encoded features (KMeans)

In [None]:
features_train = encoder.predict(X_train)
features_test = encoder.predict(X_test)

kmeans = KMeans(n_clusters=N_CLUSTERS, random_state=SEED)
cluster_labels_train = kmeans.fit_predict(features_train)
cluster_labels_test = kmeans.predict(features_test)

# Map clusters to labels if labeled data available
if len(np.unique(y)) > 1:
    from collections import Counter
    cluster_to_label = {}
    for c in range(N_CLUSTERS):
        idxs = np.where(cluster_labels_train == c)[0]
        if len(idxs) == 0:
            cluster_to_label[c] = 0
        else:
            most_common = Counter(y_train[idxs]).most_common(1)[0][0]
            cluster_to_label[c] = most_common
    pred_from_cluster = np.array([cluster_to_label[c] for c in cluster_labels_test])
    print('Clustering accuracy (mapped):', accuracy_score(y_test, pred_from_cluster))
    print('Clustering F1 (mapped):', f1_score(y_test, pred_from_cluster, average='weighted'))
else:
    print('Unsupervised-only dataset — clusters created but no labels to compare.')

# 9) Supervised baseline: Decision Tree using encoded features (if labels exist)

In [None]:
if len(np.unique(y)) > 1:
    clf_dt = DecisionTreeClassifier(random_state=SEED)
    clf_dt.fit(features_train, y_train)
    pred_dt = clf_dt.predict(features_test)
    print('Decision Tree accuracy:', accuracy_score(y_test, pred_dt))
    print('Decision Tree F1:', f1_score(y_test, pred_dt, average='weighted'))
    print(classification_report(y_test, pred_dt, target_names=[str(c) for c in classes]))
    mae_dt = mean_absolute_error(y_test, pred_dt)
    print('Decision Tree MAE:', mae_dt)
else:
    print('Skipping supervised Decision Tree — no labels available.')

# 10) Supervised CNN classifier (end-to-end) — if labels exist

In [None]:
if len(np.unique(y)) > 1:
    def build_simple_cnn(input_shape, n_classes):
        inputs = layers.Input(shape=input_shape)
        x = layers.Conv2D(32, 3, activation='relu', padding='same')(inputs)
        x = layers.MaxPooling2D(2)(x)
        x = layers.Conv2D(64, 3, activation='relu', padding='same')(x)
        x = layers.MaxPooling2D(2)(x)
        x = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
        x = layers.GlobalAveragePooling2D()(x)
        x = layers.Dense(128, activation='relu')(x)
        outputs = layers.Dense(n_classes, activation='softmax')(x)
        model = models.Model(inputs, outputs, name='simple_cnn')
        return model

    n_classes = len(np.unique(y))
    cnn = build_simple_cnn(input_shape, n_classes)
    cnn.compile(optimizer=optimizers.Adam(1e-4), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    cnn.summary()
    history_cnn = cnn.fit(train_generator, epochs=EPOCHS_CLS, validation_data=val_generator)

    # Evaluate CNN
    pred_probs = cnn.predict(X_test)
    preds = np.argmax(pred_probs, axis=1)
    print('CNN accuracy:', accuracy_score(y_test, preds))
    print('CNN F1:', f1_score(y_test, preds, average='weighted'))
    print(classification_report(y_test, preds, target_names=[str(c) for c in classes]))
    print('CNN MAE (labels vs preds numeric):', mean_absolute_error(y_test, preds))

    cm = confusion_matrix(y_test, preds)
    plt.figure(figsize=(6,5))
    plt.imshow(cm, interpolation='nearest')
    plt.title('Confusion matrix')
    plt.colorbar()
    plt.xticks(range(n_classes), [str(c) for c in classes], rotation=45)
    plt.yticks(range(n_classes), [str(c) for c in classes])
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()
else:
    print('Skipping supervised CNN — no labels available.')

# 11) Logistic regression on encoded features (shows usefulness of unsupervised features)

In [None]:
if len(np.unique(y)) > 1:
    features_train = encoder.predict(X_train)
    features_test = encoder.predict(X_test)
    logreg = LogisticRegression(max_iter=500, random_state=SEED)
    logreg.fit(features_train, y_train)
    logreg_pred = logreg.predict(features_test)
    print('Logistic regression on encoded features acc:', accuracy_score(y_test, logreg_pred))
    print('F1:', f1_score(y_test, logreg_pred, average='weighted'))
else:
    print('Skipping logistic regression — no labels available.')

# 12) Save models & predictions

In [None]:
MODEL_DIR = 'models'
os.makedirs(MODEL_DIR, exist_ok=True)
print('Saving encoder and autoencoder...')
autoencoder.save(os.path.join(MODEL_DIR, 'autoencoder.h5'))
encoder.save(os.path.join(MODEL_DIR, 'encoder.h5'))
if 'cnn' in globals():
    cnn.save(os.path.join(MODEL_DIR, 'cnn_classifier.h5'))

# Save CSV of predictions if exists
out = {'path': paths_test}
if len(np.unique(y)) > 1:
    out.update({'true_label': list(y_test)})
    if 'preds' in globals():
        out.update({'cnn_pred': list(preds)})
    if 'cluster_labels_test' in globals():
        out.update({'kmeans_cluster': list(cluster_labels_test)})
pd.DataFrame(out).to_csv('prediction_results.csv', index=False)
print('Saved prediction_results.csv')

# 13) Reporting & 5-minute presentation (saved files)

In [None]:
report_text = '''
SDG 15 — Life on Land: Detecting deforestation with satellite imagery

Problem: Detecting tree loss and land-cover change is crucial for biodiversity and carbon monitoring. This notebook demonstrates a pipeline for automated detection of deforestation using publicly-available satellite imagery datasets.

ML Approach: Convolutional Autoencoder (unsupervised) -> KMeans clustering for unsupervised detection; Decision Tree & Logistic Regression on encoded features as interpretable baselines; CNN classifier trained end-to-end for supervised detection when labels are available.

Results: The notebook prints accuracy, F1-score, MAE (where appropriate) after training. Exact performance depends on dataset, class balance, image resolution and preprocessing.

Ethical considerations: Data bias, false positives/negatives, human-in-loop validation, privacy when using very high resolution imagery, and compute footprint for frequent retraining.
'''
with open('one_page_report.txt', 'w') as f:
    f.write(report_text)
print('Saved one_page_report.txt')

presentation_text = '''
Slide 1: Title & Problem — Deforestation detection for SDG 15
Slide 2: Data & Preprocessing — Satellite images, normalization, augmentation
Slide 3: ML Approach — Autoencoder + KMeans; Decision Tree; CNN
Slide 4: Results — metrics printed by the notebook and confusion matrix
Slide 5: Ethics & Next Steps — human-in-loop verification, time-series analysis, deployment
'''
with open('presentation_notes.txt', 'w') as f:
    f.write(presentation_text)
print('Saved presentation_notes.txt')

# End — Next steps & tips
# - If you want more up-to-date imagery beyond Kaggle, consider programmatic access to:
#   * Google Earth Engine (requires account and authentication)
#   * Planet API (commercial; requires API key)
#   * Copernicus Open Access Hub or AWS Open Data (Landsat / Sentinel) combined with rasterio and rio-tiler for tiling
# - For production monitoring: combine this per-patch detector with geospatial tiling, timestamps, and a backend to serve alerts.
#
# Thank you — this notebook is functional and ready. Customize dataset slug, image size, and training epochs for your compute budget.