📝 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

# **COCO Dataset** 🖼️
https://cocodataset.org/

- mehr als 200.000 reale Bilder aus unterschiedlichsten Szenarien
- unterteilt in 80 Kategorien (Personen, Fahrzeuge, Tiere, ...)
- jedes Bild detailliert annotiert, unter anderem mit Bildunterschriften (Captions)


### **(1) Dataset herunterladen und entpacken**

In [1]:
import os
import requests
import zipfile

# URLs für das COCO Dataset
base_url = "http://images.cocodataset.org/"
files = {
    "train_images": "zips/train2017.zip",
    "val_images": "zips/val2017.zip",
    "annotations": "annotations/annotations_trainval2017.zip"
}

# Zielverzeichnis
dataset_dir = "coco_dataset"

# Überprüfen, ob das Verzeichnis bereits existiert
if not os.path.exists(dataset_dir):
    os.makedirs(dataset_dir)
    print(f"Verzeichnis erstellt: {dataset_dir}")
else:
    print(f"Verzeichnis existiert bereits: {dataset_dir}")

def download_and_extract(url, dest_dir):
    # Dateiname
    file_name = url.split("/")[-1]
    file_path = os.path.join(dest_dir, file_name)
    
    # Datei herunterladen
    response = requests.get(url, stream=True)
    total_size = int(response.headers.get('content-length', 0))
    print(f"Größe der Datei {file_name}: {total_size / (1024 * 1024):.2f} MB")
    
    block_size = 1024  # 1 KB
    with open(file_path, 'wb') as file:
        for data in response.iter_content(block_size):
            file.write(data)
    
    # Datei extrahieren
    with zipfile.ZipFile(file_path, 'r') as zip_ref:
        zip_ref.extractall(dest_dir)
    
    # ZIP-Datei entfernen
    os.remove(file_path)

# Dateien nur herunterladen und extrahieren, wenn das Verzeichnis nicht existiert
if not os.path.exists(dataset_dir):
    for key, relative_url in files.items():
        url = base_url + relative_url
        print(f"Herunterladen und extrahieren: {url}")
        download_and_extract(url, dataset_dir)
else:
    print("Download wird übersprungen, da das Verzeichnis bereits existiert.")

Verzeichnis existiert bereits: coco_dataset
Download wird übersprungen, da das Verzeichnis bereits existiert.


### **(2) Subset extrahieren (Hunde)**

In [9]:
img_path = '/coco_dataset/train2017'
save_path  = '/content/'

In [6]:
import json
jf = open('coco_dataset/annotations/captions_train2017.json')
captions = json.load(jf)

# Zeige beispielhafte Annotation
print(captions['annotations'][10])

# Gesamtzahl
print(len(captions['annotations']))

# Mapping
mapping = {}
for idx,cap in enumerate(captions['images']):
  mapping[cap['id']] = idx

for i in range(1000000):
  caption = captions['annotations'][i]['caption']
  if("face" in caption):
    print(caption)


{'image_id': 106140, 'id': 221, 'caption': 'An airplane that is, either, landing or just taking off.'}
591753
A young person has his face close to the toilet bowl.
A bathroom with a poster of an ugly face above the toilette.
A jar filled with liquid sits on a wood surface.
A large stone building with a clock face on it.
A teddy bear with it's face in the trashcan. 
A tan building with windows and a clock face on top. 
An older woman in the street has a bus with a face advertisement behind her.
A cat has his face buried in the bowl of a white toilet.
An older man carrying a plate of food and making a silly face. 
A motorcycle driver faces forward with a covered passenger riding side saddle.
A girl is holding a paper up over her face as a man is shown behind in a mirror talking on a phone.
a woman holding onto a piece of paper and covering her face with it
A herding dog faces a group of sheep grazing in a field.
A person sits at a white table holding a white piece of paper in front of he

IndexError: list index out of range