<a href="https://colab.research.google.com/github/Romeo-the-rebel/COS711-Assignment-3/blob/RomeoBranch/COS711_Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ✅ COS711 Assignment 3 – CNN + Transfer Learning Checklist

### 🗂️ Setup & Environment
- [ ] Mount Google Drive in Colab (`drive.mount('/content/drive')`)
- [ ] Upload files to `COS711_Assignment3/` folder
- [ ] Unzip data (`typ.zip`, `exo.zip`, `unl.zip`)
- [ ] Install dependencies (`torch`, `torchvision`, `pandas`, `astropy`, etc.)
- [ ] Enable GPU runtime

### 🧭 Data Preparation
- [ ] Inspect `labels.csv` and `test.csv`
- [ ] Parse coordinates from filenames
- [ ] Match images to nearest label (Astropy / Euclidean)
- [ ] Combine into dataframe `[path, ra, dec, label]`
- [ ] Split data into training/validation
- [ ] Visualize some labeled images

### 🧠 Model Definition (CNN + Transfer Learning)
- [ ] Choose pretrained CNN backbone (ResNet50 / EfficientNet)
- [ ] Replace final layer for multi-label output
- [ ] Add sigmoid activation, use `BCEWithLogitsLoss`
- [ ] Define optimizer (Adam/SGD) and LR scheduler
- [ ] Apply augmentations (rotation, flip, normalize)

### ⚙️ Training
- [ ] Load data with `DataLoader`
- [ ] Train baseline CNN (few epochs)
- [ ] Plot training/validation loss
- [ ] Evaluate precision, recall, F1-score
- [ ] Save baseline model (`cnn_baseline.pt`)

### 🧩 Pseudo-Labelling
- [ ] Predict labels on unlabeled dataset
- [ ] Filter with high-confidence threshold (≥ 0.9)
- [ ] Create `generated_labels.csv`
- [ ] Merge pseudo-labeled data into training set
- [ ] Retrain CNN and evaluate improvement

### 📊 Testing & Results
- [ ] Predict on `test.csv`
- [ ] Create `test_labels.csv`
- [ ] Save final model weights
- [ ] Generate evaluation plots (confusion matrix)

### 🧾 Reporting & Deliverables
- [ ] Prepare `test_labels.csv`
- [ ] Prepare `generated_labels.csv`
- [ ] Write `README.md`
- [ ] Record 12-minute video presentation (≤ 50 MB)
- [ ] Add team contribution breakdown

### 🌟 Optional (for extra marks)
- [ ] Add early stopping / learning-rate scheduler
- [ ] Compare ResNet vs EfficientNet
- [ ] Add Grad-CAM visualizations
- [ ] Tune pseudo-labelling threshold
- [ ] Ensemble multiple CNNs


In [6]:

import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
labeled_images = '/content/drive/MyDrive/COS711/labels.csv'
test_images = '/content/drive/MyDrive/COS711/test.csv'
exoitc_images = '/content/drive/MyDrive/COS711/typ_PNG'
typical_images='/content/drive/MyDrive/COS711/typ_PNG'
unlabeled_images='/content/drive/MyDrive/COS711/unl_PNG'

#put all the information into dataframes
labeled_df = pd.read_csv(labeled_images)
test_df = pd.read_csv(test_images)
print(labeled_df.head())
print("="*50)
print(test_df.head())


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
   10.32822108  -20.47635725         FR II Unnamed: 3 Unnamed: 4
0    92.109802    -49.431413       typical        NaN        NaN
1    88.916825    -59.431868  Point Source        NaN        NaN
2     5.457981    -25.589637         FR II        NaN        NaN
3   119.417608    -53.396711         FR II        NaN        NaN
4   144.100249    -76.517079          Bent        NaN        NaN
   201.7436567  -31.32163727
0   234.261286    -46.590846
1    66.793081    -62.375058
2   108.760518    -59.958776
3   202.148240    -31.432391
4    57.025208    -73.861150


In [7]:
#list of all the streamed images
import os
image_files = sorted([f for f in os.listdir(exoitc_images) if f.endswith(('.png', '.jpg', '.jpeg', '.gif'))])
image_files2 = sorted([f for f in os.listdir(typical_images) if f.endswith(('.png', '.jpg', '.jpeg', '.gif'))])
image_files3 = sorted([f for f in os.listdir(unlabeled_images) if f.endswith(('.png', '.jpg', '.jpeg', '.gif'))])
print(image_files[:5])
print("="*50)
print(image_files2[:5])
print("="*50)
print(image_files3[:5])

['0.523 -24.568_[0.04242756 0.04242756] deg_(Abell_141_1pln-forPyBDSF.FITS).fits.png', '1.028 -25.064_[0.07124072 0.07124072] deg_(Abell_141_1pln-forPyBDSF.FITS).fits.png', '1.076 -24.429_[0.03492131 0.03492131] deg_(Abell_141_1pln-forPyBDSF.FITS).fits.png', '1.448 -24.492_[0.04284089 0.04284089] deg_(Abell_141_1pln-forPyBDSF.FITS).fits.png', '108.614 -60.373_[0.11265306 0.11265306] deg_(J0712.0-6030.Fix.1pln-forPyBDSF.FITS).fits.png']
['0.250 -25.084_[0.02238656 0.02238656] deg_(Abell_141_1pln-forPyBDSF.FITS).fits.png', '0.316 -24.707_[0.0166409 0.0166409] deg_(Abell_141_1pln-forPyBDSF.FITS).fits.png', '0.327 -24.571_[0.02705031 0.02705031] deg_(Abell_141_1pln-forPyBDSF.FITS).fits.png', '0.371 -24.554_[0.009 0.009] deg_(Abell_141_1pln-forPyBDSF.FITS).fits.png', '0.425 -25.211_[0.02334787 0.02334787] deg_(Abell_141_1pln-forPyBDSF.FITS).fits.png']
['0.280 -84.980_[0.0173389 0.0173389] deg_(J2340.1-8510.Fix.1pln-forPyBDSF.FITS).fits.png', '0.315 -85.347_[0.009 0.009] deg_(J2340.1-8510.Fi

In [20]:
#maps an image(exotic and typical) using it's title to a label(in label.csv)
import pandas as pd
import os
import numpy as np
import glob
from pathlib import Path

try:
    from astropy.coordinates import SkyCoord
    import astropy.units as u
    ASTROPY_AVAILABLE = True
except Exception:
    ASTROPY_AVAILABLE = False


labels_csv = "/content/drive/MyDrive/COS711/labels.csv"
typ_dir = "/content/drive/MyDrive/COS711/typ_PNG"
exo_dir = "/content/drive/MyDrive/COS711/exo_PNG"


labels_df = pd.read_csv(labels_csv, header=1)
labels_df.rename(columns={labels_df.columns[0]: 'ra', labels_df.columns[1]: 'dec', labels_df.columns[2]: 'label'}, inplace=True) # Rename columns

def parse_coords_from_filename(filename):
    """Extract RA and DEC (as floats) from an image filename."""
    base = Path(filename).stem
    parts = base.replace(",", "_").replace("-", "_-").split("_")
    floats = []
    for p in parts:
        try:
            val = float(p)
            floats.append(val)
            if len(floats) == 2:
                return floats[0], floats[1]
        except:
            continue
    return None, None

def find_nearest_label(ra_img, dec_img, labels_df):
    """Find nearest label in labels_df for given RA/DEC."""
    if ASTROPY_AVAILABLE:
        c_img = SkyCoord(ra=ra_img*u.deg, dec=dec_img*u.deg)
        c_labels = SkyCoord(ra=labels_df['ra'].values*u.deg, dec=labels_df['dec'].values*u.deg)
        sep = c_img.separation(c_labels).arcsec
        idx = np.argmin(sep)
        return labels_df.iloc[idx]['label']
    else:
        coords = labels_df[['ra', 'dec']].values
        dists = np.sqrt((coords[:,0]-ra_img)**2 + (coords[:,1]-dec_img)**2)
        idx = np.argmin(dists)
        return labels_df.iloc[idx]['label']

def map_images_to_labels(image_dir, labels_df):
    """Return DataFrame with filename, RA, DEC, and matched label."""
    image_files = sorted(glob.glob(os.path.join(image_dir, "*")))
    mapped_data = []

    for f in image_files:
        ra, dec = parse_coords_from_filename(f)
        if ra is not None and dec is not None:
            label = find_nearest_label(ra, dec, labels_df)
            mapped_data.append([f, ra, dec, label])

    mapped_df = pd.DataFrame(mapped_data, columns=["filename", "ra", "dec", "label"])
    return mapped_df
typ_mapped_df = map_images_to_labels(typ_dir, labels_df)
exo_mapped_df = map_images_to_labels(exo_dir, labels_df)
print(" Typical mapped:", typ_mapped_df.shape)
print(" Exotic mapped:", exo_mapped_df.shape)
all_mapped_df = pd.concat([typ_mapped_df, exo_mapped_df], ignore_index=True)
display(all_mapped_df.head())
all_mapped_df.to_csv("/content/drive/MyDrive/COS711/mapped_labels.csv", index=False)
print("💾 Mapped labels saved to /content/drive/MyDrive/COS711/mapped_labels.csv")

✅ Typical mapped: (2049, 4)
✅ Exotic mapped: (58, 4)


Unnamed: 0,filename,ra,dec,label
0,/content/drive/MyDrive/COS711/typ_PNG/0.250 -2...,0.25,-25.084,FR II
1,/content/drive/MyDrive/COS711/typ_PNG/0.316 -2...,0.316,-24.707,FR II
2,/content/drive/MyDrive/COS711/typ_PNG/0.327 -2...,0.327,-24.571,Bent
3,/content/drive/MyDrive/COS711/typ_PNG/0.371 -2...,0.371,-24.554,Bent
4,/content/drive/MyDrive/COS711/typ_PNG/0.425 -2...,0.425,-25.211,Bent


💾 Mapped labels saved to /content/drive/MyDrive/COS711/mapped_labels.csv


In [9]:
# cell 1: essentials
import os, glob, math
import pandas as pd
import numpy as np
from pathlib import Path

# optional: use astropy for accurate sky distances (recommended)
try:
    from astropy.coordinates import SkyCoord
    import astropy.units as u
    ASTROPY_AVAILABLE = True
except Exception:
    ASTROPY_AVAILABLE = False

# paths (adjust if needed)
labels_csv = "/content/drive/MyDrive/COS711/labels.csv"    # provided labels file (from assignment zip)
test_csv   = "/content/drive/MyDrive/COS711/test.csv"
typ_dir    = "/content/drive/MyDrive/COS711/typ_PNG"
exo_dir    = "/content/drive/MyDrive/COS711/typ_PNG"
unl_dir    = "/content/drive/MyDrive/COS711/unl_PNG"

# helper to parse filenames like "RA_DEC_...png" (the actual separator may vary)
def parse_coords_from_filename(fname):
    # adapt to your filename pattern; this is a robust attempt
    base = Path(fname).stem
    parts = base.replace(",", "_").split("_")
    # find two tokens that parse as floats
    floats = []
    for p in parts:
        try:
            floats.append(float(p))
            if len(floats)==2:
                return floats[0], floats[1]
        except:
            continue
    return None, None

# load labels
labels_df = pd.read_csv(labels_csv, header=1) # Load with header=1
labels_df.rename(columns={labels_df.columns[0]: 'ra', labels_df.columns[1]: 'dec', labels_df.columns[2]: 'label'}, inplace=True) # Rename columns
# inspect
labels_df.head()

Unnamed: 0,10.32822108,-20.47635725,FR II,Unnamed: 3,Unnamed: 4
0,92.109802,-49.431413,typical,,
1,88.916825,-59.431868,Point Source,,
2,5.457981,-25.589637,FR II,,
3,119.417608,-53.396711,FR II,,
4,144.100249,-76.517079,Bent,,


In [10]:
# cell 2: map images -> nearest label (using astropy if available)
def nearest_label_for_image(img_path, labels_df):
    ra_img, dec_img = parse_coords_from_filename(img_path)
    if ra_img is None:
        return None
    if ASTROPY_AVAILABLE:
        c_img = SkyCoord(ra=ra_img*u.deg, dec=dec_img*u.deg, unit=(u.deg,u.deg))
        c_labels = SkyCoord(ra=labels_df['ra'].values*u.deg, dec=labels_df['dec'].values*u.deg, unit=(u.deg,u.deg))
        sep = c_img.separation(c_labels).arcsec  # angular separation in arcsec
        idx = np.argmin(sep)
        return labels_df.iloc[idx]
    else:
        # fallback: Euclidean distance in coordinate space (works roughly if coords in degrees)
        coords = labels_df[['ra','dec']].values
        dists = np.sqrt((coords[:,0]-ra_img)**2 + (coords[:,1]-dec_img)**2)
        idx = np.argmin(dists)
        return labels_df.iloc[idx]

# example: map typical dir
typ_files = glob.glob(os.path.join(typ_dir, "*"))
mapped = []
for f in typ_files[:200]:   # don't iterate all now if large -- just test
    lab = nearest_label_for_image(f, labels_df)
    mapped.append((f, lab['label'] if lab is not None else None))

mapped[:10]


[('/content/drive/MyDrive/COS711/typ_PNG/355.180 -8.899_[0.009 0.009] deg_(Abell_2645_1pln-forPyBDSF.FITS).fits.png',
  None),
 ('/content/drive/MyDrive/COS711/typ_PNG/202.541 -31.778_[0.009 0.009] deg_(Abell_3562.APSC.1pln-forPyBDSF.FITS).fits.png',
  None),
 ('/content/drive/MyDrive/COS711/typ_PNG/86.994 -21.782_[0.01591703 0.01591703] deg_(Abell_3365.1pln-forPyBDSF.FITS).fits.png',
  None),
 ('/content/drive/MyDrive/COS711/typ_PNG/53.824 -40.614_[0.00904137 0.00904137] deg_(J0336.3-4037.Fix.1pln-forPyBDSF.FITS).fits.png',
  None),
 ('/content/drive/MyDrive/COS711/typ_PNG/342.523 -16.926_[0.01175578 0.01175578] deg_(Abell_2485_1pln-forPyBDSF.FITS).fits.png',
  None),
 ('/content/drive/MyDrive/COS711/typ_PNG/2.138 -24.612_[0.03064421 0.03064421] deg_(Abell_141_1pln-forPyBDSF.FITS).fits.png',
  None),
 ('/content/drive/MyDrive/COS711/typ_PNG/149.526 -76.136_[0.009 0.009] deg_(J0943.4-7619.Fix.1pln-forPyBDSF.FITS).fits.png',
  None),
 ('/content/drive/MyDrive/COS711/typ_PNG/191.215 -48.

In [11]:
# cell: PyTorch dataloader skeleton
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import torchvision.transforms as T

# define the set of classes you'll predict (example; adapt to real labels)
CLASSES = ["Point", "FRI", "FRII", "Bent", "XRG", "ZRG", "ShouldBeDiscarded", "Other", "Exotic"]
class_to_idx = {c:i for i,c in enumerate(CLASSES)}

def labels_to_multihot(label_list):
    mh = np.zeros(len(CLASSES), dtype=np.float32)
    for lab in label_list:
        if lab in class_to_idx:
            mh[class_to_idx[lab]] = 1.0
    return mh

train_transforms = T.Compose([
    T.Resize((224,224)),
    T.RandomRotation(30),
    T.RandomHorizontalFlip(),
    T.RandomVerticalFlip(),
    T.ToTensor(),
    # do normalization if using a pretrained model expecting ImageNet stats
    T.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]),
])

class RadioDataset(Dataset):
    def __init__(self, rows, transform=None):
        # rows: list of (img_path, [list_of_labels])
        self.rows = rows
        self.transform = transform
    def __len__(self):
        return len(self.rows)
    def __getitem__(self, idx):
        p, labs = self.rows[idx]
        img = Image.open(p).convert("RGB")
        if self.transform:
            img = self.transform(img)
        label = torch.tensor(labels_to_multihot(labs))
        return img, label

# Example instantiation:
# train_rows = [(path, ["FRI","Bent"]), ...]
# ds = RadioDataset(train_rows, transform=train_transforms)
# dl = DataLoader(ds, batch_size=16, shuffle=True, num_workers=4)


In [12]:
import torchvision.models as models
import torch.nn as nn
import torch.optim as optim

def get_model(num_classes=len(CLASSES), backbone="resnet50", pretrained=True):
    if backbone=="resnet50":
        m = models.resnet50(pretrained=pretrained)
        nfeats = m.fc.in_features
        m.fc = nn.Linear(nfeats, num_classes)
    else:
        # swap to EfficientNet, etc., as needed
        m = models.resnet18(pretrained=pretrained)
        nfeats = m.fc.in_features
        m.fc = nn.Linear(nfeats, num_classes)
    return m

model = get_model()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

criterion = nn.BCEWithLogitsLoss()  # for multi-label
optimizer = optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-5)




Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth


100%|██████████| 97.8M/97.8M [00:01<00:00, 94.4MB/s]


In [13]:
def train_one_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss = 0.0
    for imgs, labels in loader:
        imgs = imgs.to(device)
        labels = labels.to(device)
        optimizer.zero_grad()
        out = model(imgs)
        loss = criterion(out, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * imgs.size(0)
    return total_loss / len(loader.dataset)


In [14]:
# Load model if you trained it earlier and saved weights
model_path = "/content/drive/MyDrive/COS711/cnn_baseline.pt"  # change if needed
if os.path.exists(model_path):
    model.load_state_dict(torch.load(model_path, map_location=device))
    print("✅ Model weights loaded successfully.")
else:
    print("⚠️ Model checkpoint not found — using current model.")


⚠️ Model checkpoint not found — using current model.


In [15]:
from tqdm import tqdm

model.eval()

T_HIGH = 0.90
unlabeled_predictions = []

unl_files = sorted(glob.glob(os.path.join(unl_dir, "*")))

for path in tqdm(unl_files, desc="Predicting unlabeled images"):
    try:
        img = Image.open(path).convert("RGB")
        img_t = train_transforms(img).unsqueeze(0).to(device)
        with torch.no_grad():
            out = model(img_t)
            probs = torch.sigmoid(out).cpu().numpy()[0]

        labels_pred = [CLASSES[i] for i, p in enumerate(probs) if p >= T_HIGH]
        if not labels_pred:
            labels_pred = [CLASSES[int(np.argmax(probs))]]
        ra, dec = parse_coords_from_filename(path)
        unlabeled_predictions.append([ra, dec, ";".join(labels_pred)])
    except Exception as e:
        print(f"Error processing {path}: {e}")

#storing image name and predicted label to csv
generated_labels_df = pd.DataFrame(unlabeled_predictions, columns=["ID", "predicted_labels"])
generated_labels_path = "/content/drive/MyDrive/COS711/generated_labels.csv"
generated_labels_df.to_csv(generated_labels_path, index=False)
print(f"✅ generated_labels.csv saved to: {generated_labels_path}")


Predicting unlabeled images: 100%|██████████| 13821/13821 [1:05:26<00:00,  3.52it/s]

✅ generated_labels.csv saved to: /content/drive/MyDrive/COS711/generated_labels.csv





In [None]:
test_predictions = []

for _, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Predicting test set"):
    ra, dec = row["ra"], row["dec"]
    all_images = glob.glob(os.path.join(typ_dir, "*")) + glob.glob(os.path.join(exo_dir, "*"))
    min_dist, best_path = float("inf"), None
    for img_path in all_images:
        ra_img, dec_img = parse_coords_from_filename(img_path)
        if ra_img is None:
            continue
        dist = math.sqrt((ra - ra_img)**2 + (dec - dec_img)**2)
        if dist < min_dist:
            min_dist, best_path = dist, img_path

    if best_path:
        try:
            img = Image.open(best_path).convert("RGB")
            img_t = train_transforms(img).unsqueeze(0).to(device)
            with torch.no_grad():
                out = model(img_t)
                probs = torch.sigmoid(out).cpu().numpy()[0]
            labels_pred = [CLASSES[i] for i, p in enumerate(probs) if p >= 0.5]
            if not labels_pred:
                labels_pred = [CLASSES[int(np.argmax(probs))]]
        except Exception:
            labels_pred = ["Unknown"]
    else:
        labels_pred = ["Unknown"]

    test_predictions.append([ra, dec, ";".join(labels_pred)])

# Save results
test_labels_df = pd.DataFrame(test_predictions, columns=["ra", "dec", "predicted_labels"])
test_labels_path = "/content/drive/MyDrive/COS711/test_labels.csv"
test_labels_df.to_csv(test_labels_path, index=False)
print(f"✅ test_labels.csv saved to: {test_labels_path}")
