# Phase 3: Scaling CV Prototype for TriagePal
This notebook expands the Phase 2 eye classification to include **rashes** (Fitzpatrick17k) and **wounds** (Wound Classification dataset from Kaggle). We use **transfer learning with MobileNetV2** for fine-tuning multi-class CNNs, **data augmentation**, and **class weights** for imbalances. Finally, we test an **integrated pipeline** that processes images from multiple categories.\n,
  
   ## Key Features:
  - **Eyes**: Binary (healthy vs. infected) from Phase 2.
  - **Rashes**: Multi-class (e.g., psoriasis, dermatitis) from Fitzpatrick17k.
  - **Wounds**: Multi-class severity (e.g., abrasions, burns) from Kaggle wound dataset.
  - **Fine-tuning**: MobileNetV2 base + custom head.
 - **Augmentation & Weights**: Built-in via Keras.
  - **Integration**: Unified prediction function (select model by image type).
  
  **Setup**: Upload Kaggle API key (`kaggle.json`) in the first cell. Datasets will auto-download.

  ## Next: NLP Integration
  After CV, add BERT for symptoms (see comments at end)

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("alisofiya/conjunctivitis")
print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/alisofiya/conjunctivitis?dataset_version_number=1...


100%|██████████| 32.5M/32.5M [00:00<00:00, 149MB/s] 

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/alisofiya/conjunctivitis/versions/1


In [None]:
# Setup Kaggle API (Run once per session),
from google.colab import files
files.upload()  # Upload kaggle.json
!mkdir -p ~/.kaggle/
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets list  # Test API

Saving kaggle (1).json to kaggle (1) (1).json
cp: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory
Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 4, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.12/dist-packages/kaggle/__init__.py", line 6, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.12/dist-packages/kaggle/api/kaggle_api_extended.py", line 434, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method. See setup instructions at https://github.com/Kaggle/kaggle-api/


## Cell 1: Installs & Imports

In [None]:
!pip install -q kaggle tensorflow matplotlib scikit-learn pandas pillow

import os, glob, random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from PIL import Image
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.applications import MobileNetV2
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import classification_report, confusion_matrix

# Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

print("TensorFlow version:", tf.__version__)

TensorFlow version: 2.19.0


## Load Datasets
        
### 1. Eyes (Binary: Healthy vs. Infected)



In [None]:
# Download & Load Eyes Dataset
!kaggle datasets download -d alisofiya/conjunctivitis -p /content/eyes --unzip
eyes_root = Path('/content/eyes/conjunctivitis')
img_files_eyes = list(eyes_root.rglob("*.jpg")) + list(eyes_root.rglob("*.jpeg"))
rows_eyes = [[str(p.resolve()), p.parent.name] for p in img_files_eyes]
df_eyes = pd.DataFrame(rows_eyes, columns=["image_path", "label"])
df_eyes = df_eyes.sample(frac=1, random_state=SEED).reset_index(drop=True)
print("Eyes Dataset:")
print(df_eyes['label'].value_counts())
print("\nSample:")
display(df_eyes.head())

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/kaggle/cli.py", line 68, in main
    out = args.func(**command_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/kaggle/api/kaggle_api_extended.py", line 1741, in dataset_download_cli
    with self.build_kaggle_client() as kaggle:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/kaggle/api/kaggle_api_extended.py", line 688, in build_kaggle_client
    username=self.config_values['username'],
             ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'username'
Eyes Dataset:
Series([], Name: count, dtype: int64)

Sample:


Unnamed: 0,image_path,label


### 2. Rashes (Multi-Class: Filtered Fitzpatrick17k)

In [None]:
# Download & Load Rashes (Fitzpatrick17k)
!kaggle datasets download -d nazmussadat013/fitzpatrick17k -p /content/rashes --unzip

rashes_root = Path('/content/fitzpatrick17k.csv')
csv_path = rashes_root / 'fitzpatrick17k.csv'
images_dir = rashes_root / 'images_part_1'  # Adjust if multi-part; merge if needed

if not csv_path.exists():
    print(f"Error: CSV file not found at {csv_path}. Please ensure the Kaggle dataset is downloaded and unzipped correctly.")
else:
    df_rashes = pd.read_csv(csv_path)
    df_rashes['image_path'] = (images_dir / (df_rashes['md5hash'] + '.jpg')).astype(str)
    df_rashes = df_rashes[df_rashes['image_path'].apply(os.path.exists)]  # Filter existing

    # Filter for common rash types (customize based on your CSV sample)
    rash_types = ['psoriasis', 'acne vulgaris', 'allergic contact dermatitis', 'urticaria', 'lichen planus']
    df_rashes = df_rashes[df_rashes['label'].str.contains('|'.join(rash_types), na=False, case=False)]

    # Sample for speed (200 per class)
    df_rashes = df_rashes.groupby('label').apply(lambda x: x.sample(min(200, len(x)), random_state=SEED)).reset_index(drop=True)

    print("Rashes Dataset (Sampled):")
    print(df_rashes['label'].value_counts())
    print("\nSample:")
    display(df_rashes.head())

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/kaggle/cli.py", line 68, in main
    out = args.func(**command_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/kaggle/api/kaggle_api_extended.py", line 1741, in dataset_download_cli
    with self.build_kaggle_client() as kaggle:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/kaggle/api/kaggle_api_extended.py", line 688, in build_kaggle_client
    username=self.config_values['username'],
             ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'username'
Error: CSV file not found at /content/fitzpatrick17k.csv/fitzpatrick17k.csv. Please ensure the Kaggle dataset is downloaded and unzipped correctly.


### 3. Wounds (Multi-Class: Severity from Kaggle)

In [None]:
# Download & Load Wounds Dataset
!kaggle datasets download -d ibrahimfateen/wound-classification -p /content/wounds --unzip

wounds_root = Path('/content/wounds/Wound_dataset')
rows_wounds = []

if not wounds_root.exists():
    print(f"Error: Dataset directory not found at {wounds_root}. Please ensure the Kaggle dataset is downloaded and unzipped correctly.")
else:
    for label in wounds_root.iterdir():
        if label.is_dir():
            for img in label.glob('*.jpg'):
                rows_wounds.append([str(img.resolve()), label.name])

    df_wounds = pd.DataFrame(rows_wounds, columns=["image_path", "label"])
    df_wounds = df_wounds.sample(frac=1, random_state=SEED).reset_index(drop=True)

    # Sample for speed (200 per class)
    df_wounds = df_wounds.groupby('label').apply(lambda x: x.sample(min(200, len(x)), random_state=SEED)).reset_index(drop=True)

    print("Wounds Dataset (Sampled):")
    print(df_wounds['label'].value_counts())
    print("\nSample:")
    display(df_wounds.head())

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/kaggle/cli.py", line 68, in main
    out = args.func(**command_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/kaggle/api/kaggle_api_extended.py", line 1741, in dataset_download_cli
    with self.build_kaggle_client() as kaggle:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/kaggle/api/kaggle_api_extended.py", line 688, in build_kaggle_client
    username=self.config_values['username'],
             ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'username'
Error: Dataset directory not found at /content/wounds/Wound_dataset. Please ensure the Kaggle dataset is downloaded and unzipped correctly.


## Prepare Data: Splits, Generators, Weights

In [None]:
IMG_SIZE = (224, 224)  # MobileNetV2 default
BATCH_SIZE = 32
# Common Augmentation
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)
val_datagen = ImageDataGenerator(rescale=1./255)

# Function to prepare splits & generators
def prepare_data(df, class_mode='binary', val_size=0.2, test_size=0.1):
    train_df, temp_df = train_test_split(df, test_size=val_size + test_size, stratify=df['label'], random_state=SEED)
    val_df, test_df = train_test_split(temp_df, test_size=test_size/(val_size + test_size), stratify=temp_df['label'], random_state=SEED)

    train_gen = train_datagen.flow_from_dataframe(train_df, x_col='image_path', y_col='label', target_size=IMG_SIZE, batch_size=BATCH_SIZE, class_mode=class_mode, shuffle=True)
    val_gen = val_datagen.flow_from_dataframe(val_df, x_col='image_path', y_col='label', target_size=IMG_SIZE, batch_size=BATCH_SIZE, class_mode=class_mode, shuffle=False)
    test_gen = val_datagen.flow_from_dataframe(test_df, x_col='image_path', y_col='label', target_size=IMG_SIZE, batch_size=BATCH_SIZE, class_mode=class_mode, shuffle=False)

    # Class weights for imbalance
    classes = np.unique(df['label'])
    class_weights = compute_class_weight('balanced', classes=classes, y=df['label'])
    class_weight_dict = dict(zip(classes, class_weights))

    return train_gen, val_gen, test_gen, class_weight_dict, test_df

# Prepare for each
train_gen_eyes, val_gen_eyes, test_gen_eyes, weights_eyes, test_df_eyes = prepare_data(df_eyes, 'binary')
train_gen_rashes, val_gen_rashes, test_gen_rashes, weights_rashes, test_df_rashes = prepare_data(df_rashes, 'categorical')
train_gen_wounds, val_gen_wounds, test_gen_wounds, weights_wounds, test_df_wounds = prepare_data(df_wounds, 'categorical')

ValueError: With n_samples=0, test_size=0.30000000000000004 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

###Fine-Tune Models with MobileNetV2