<a href="https://colab.research.google.com/github/Gemlala/AI-Projekt/blob/main/Image_Captioning_f%C3%BCr_Kreiselbilder_(PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

üöÄ Notebook-Ger√ºst: Image Captioning f√ºr Kreiselbilder (PyTorch)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# ================================================
# 0. SETUP
# ================================================
!pip install torch torchvision pillow pandas numpy tqdm transformers

import os
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from PIL import Image
from tqdm import tqdm
from transformers import AutoTokenizer, VisionEncoderDecoderModel, AutoFeatureExtractor

device = "cuda" if torch.cuda.is_available() else "cpu"
device




'cpu'

In [None]:
import pandas as pd

csv_path = "/content/drive/MyDrive/Colab Notebooks/Bsc/gdf_u_filtered_subset.csv"

# CSV einlesen
df = pd.read_csv(csv_path)

# Die ersten 5 Zeilen anzeigen
print(df.head())


   Unfall-Nr (UAP) Unfalldatum (UAP)  \
0          1145168        2021-03-16   
1           243027        2023-07-19   
2              663        2023-06-09   
3     620201000197        2020-10-29   
4     202307001851        2023-07-09   

                                     Unfalltyp (UAP)  \
0  Kollision beim Rechtseinbiegen mit von links k...   
1                       Anderer Unfall beim Abbiegen   
2  Kollision beim Rechtseinbiegen mit von links k...   
3  Kollision beim Rechtseinbiegen mit von links k...   
4                                     Ohne Kollision   

               Unfalltyp Gruppe  Anzahl beteiligte Fahrr√§der (UAP-Objekt)  \
0                Einbiegeunfall                                         0   
1                 Abbiegeunfall                                         1   
2                Einbiegeunfall                                         0   
3                Einbiegeunfall                                         0   
4  Schleuder- oder Selbstunfall     

1Ô∏è‚É£ Daten vorbereiten

Angenommen:

deine Bilder liegen hier:
/content/drive/MyDrive/Colab Notebooks/Bsc/SwissImageTiles_Kreisel_JPG

die Bildnamen sind: 0.jpg, 1.jpg, ..., 3369.jpg

du hast eine CSV mit Beschreibungen (Caption-Text)

Beispiel-CSV:

id	caption
0	"Ein grosser Kreisel mit vier Zufahrten..."

Falls du die Caption-Texte erst generieren willst ‚Üí sag Bescheid.

In [None]:
# ================================================
# 1. DATEN LADEN
# ================================================

IMG_DIR = "/content/drive/MyDrive/Colab Notebooks/Bsc/SwissImageTiles_Kreisel_PNG_A"
CAPTION_FILE = "/content/drive/MyDrive/Colab Notebooks/Bsc/captions.csv"

df = pd.read_csv(CAPTION_FILE)
df.head()


2Ô∏è‚É£ Dataset & Dataloader

In [None]:
# ================================================
# 2. DATASET DEFINIEREN
# ================================================

tokenizer = AutoTokenizer.from_pretrained("gpt2")
# GPT-2 ben√∂tigt pad_token
tokenizer.pad_token = tokenizer.eos_token

class RoundaboutCaptionDataset(Dataset):
    def __init__(self, df, img_dir, feature_extractor, tokenizer):
        self.df = df
        self.img_dir = img_dir
        self.feature_extractor = feature_extractor
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_id = row["id"]
        caption = row["caption"]

        img = Image.open(os.path.join(self.img_dir, f"{img_id}.jpg")).convert("RGB")

        pixel_values = self.feature_extractor(images=img, return_tensors="pt")["pixel_values"].squeeze()

        encoded = tokenizer(
            caption,
            padding="max_length",
            max_length=64,
            truncation=True,
            return_tensors="pt"
        )

        return {
            "pixel_values": pixel_values,
            "input_ids": encoded["input_ids"].squeeze(),
            "attention_mask": encoded["attention_mask"].squeeze()
        }

feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")

dataset = RoundaboutCaptionDataset(df, IMG_DIR, feature_extractor, tokenizer)
dataloader = DataLoader(dataset, batch_size=8, shuffle=True)


3Ô∏è‚É£ Modell definieren

Wir nutzen ein VisionEncoderDecoder-Modell:

Encoder: ViT (Vision Transformer)

Decoder: GPT-2

In [None]:
# ================================================
# 3. MODELL
# ================================================

model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    "google/vit-base-patch16-224-in21k",
    "gpt2"
)

model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.pad_token_id = tokenizer.pad_token_id

model.to(device)


4Ô∏è‚É£ Training Loop

In [None]:
# ================================================
# 4. TRAINING LOOP
# ================================================

optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

EPOCHS = 3

for epoch in range(EPOCHS):
    model.train()
    pbar = tqdm(dataloader, desc=f"Epoch {epoch+1}/{EPOCHS}")

    for batch in pbar:
        pixel_values = batch["pixel_values"].to(device)
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        outputs = model(
            pixel_values=pixel_values,
            labels=input_ids,
            attention_mask=attention_mask
        )

        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        pbar.set_postfix({"loss": loss.item()})


5Ô∏è‚É£ Caption generieren

In [None]:
# ================================================
# 5. INFERENCE
# ================================================

def generate_caption(img_path):
    img = Image.open(img_path).convert("RGB")
    pixel_values = feature_extractor(images=img, return_tensors="pt")["pixel_values"].to(device)

    output_ids = model.generate(
        pixel_values,
        max_length=64,
        num_beams=4
    )[0]

    caption = tokenizer.decode(output_ids, skip_special_tokens=True)
    return caption

# Beispiel
generate_caption(os.path.join(IMG_DIR, "10.jpg"))


6Ô∏è‚É£ Optional: qualitative Evaluation

In [None]:
import matplotlib.pyplot as plt

img_id = 42
img = Image.open(os.path.join(IMG_DIR, f"{img_id}.jpg"))
plt.imshow(img)
plt.axis("off")
print("Predicted:", generate_caption(os.path.join(IMG_DIR, f"{img_id}.jpg")))
print("GT:", df[df.id == img_id].caption.values[0])


üéâ Fertig!

Damit hast du den vollst√§ndigen Rahmen f√ºr ein Image Captioning Projekt mit Kreiselbildern:

ViT + GPT-2 Modell

Trainingsloop

Caption-Inference

Bild-Ausgabe

Willst du als N√§chstes:

‚úÖ automatisch Caption-Texte aus Unfalldaten generieren?
‚úÖ die Captions zu Risikoklassen (low/medium/high) erweitern?
‚úÖ statt GPT-2 ein LLama-Decoder nutzen?