### Splitting Flicker8k into train-val-test split

Requires:
Flickr8k dataset structure
- Text file containing all image names and captions

You may download flickr8k dataset here:
https://www.kaggle.com/datasets/adityajn105/flickr8k

In [9]:
import pandas as pd

def convert_labels_to_dataframe(labels_path:str):
    """
    Convert the text file captions to a dataframe

    Args:
        labels_path (str): path to flicker text tile

    Returns:
        data: dataframe of image_filename and image_captions
    """
    image_names = []
    captions = []

    with open(labels_path, 'r') as file:
        for line in file:
            line_details = line.strip().split(' ', 1)  #split at the first space to handle captions with spaces
            if len(line_details) == 2:
                image_name = line_details[0].split('#') #remove #caption_num
                image_names.append(image_name[0].strip())
                captions.append(line_details[1].strip())
            else:
                print("Invalid line:", line)

    # Create a DataFrame
    data = pd.DataFrame({'image_filename': image_names, 'image_caption': captions})

    return data

caption_data = convert_labels_to_dataframe('../input/Flicker8k/Flickr8k.token.txt')
caption_data

Invalid line: 2428275562_4bde2bc5ea.jpg#0	A

Invalid line: 3640443200_b8066f37f6.jpg#0	a



Unnamed: 0,image_filename,image_caption
0,1000268201_693b08cb0e.jpg,child in a pink dress is climbing up a set of ...
1,1000268201_693b08cb0e.jpg,girl going into a wooden building .
2,1000268201_693b08cb0e.jpg,little girl climbing into a wooden playhouse .
3,1000268201_693b08cb0e.jpg,little girl climbing the stairs to her playhou...
4,1000268201_693b08cb0e.jpg,little girl in a pink dress going into a woode...
...,...,...
40453,997722733_0cb5439472.jpg,man in a pink shirt climbs a rock face
40454,997722733_0cb5439472.jpg,man is rock climbing high in the air .
40455,997722733_0cb5439472.jpg,person in a red shirt climbing up a rock face ...
40456,997722733_0cb5439472.jpg,rock climber in a red shirt .


In [30]:
import shutil
import os
from tqdm import tqdm


def create_flicker_dataset(image_txt_files:str, output_folder:str, captions_data):
    """
    Create flicker dataset based on the text files
    This assumes that your train, test and validation data has been split and are in respective text files
    (This is the input of your existing flicker8k dataset if downloaded from flickr)

    Existin train, test and validation text files only contain the image name, it will filter against all data in the flicker (inside captions_data)

    Args:
        image_txt_files (str): path to text file
        output_folder (str): output folder the images and captions will be moved to, usually Train, Val or Test
        captions_data (dataframe): Dataframe of all captions and imaages of flicker

    Returns:
        captions_dataset: dataframe of the filtered captions
    """
    image_path = f'../input/Flicker8k/{output_folder}/Images/'
    label_path = f'../input/Flicker8k/{output_folder}/Labels/'

    source_folder = f'../input/Flicker8k/Flicker8k_Dataset/'

    if not os.path.exists(image_path):
        os.makedirs(image_path)

    if not os.path.exists(label_path):
        os.makedirs(label_path)

    captions_dataset = pd.DataFrame(columns=['image_filename','image_caption'])

    with open(image_txt_files, 'r') as file:
        image_names = file.readlines()

    for image_name in tqdm(image_names):
        image_name = image_name.strip()
        source_path = os.path.join(source_folder, image_name)
        destination_path = os.path.join(image_path, image_name)

        filtered_captions = captions_data[captions_data['image_filename'] == image_name]
        captions_dataset = pd.concat([captions_dataset, filtered_captions], ignore_index=True)

        try:
            shutil.move(source_path, destination_path)
        except FileNotFoundError:
            print(f"File '{image_name}' not found in '{source_folder}'")
        except shutil.Error as e:
            print(f"Error occurred while moving '{image_name}': {e}")

    captions_dataset.to_csv(f"{label_path}Label.csv")
    print(f"Completed, {len(image_names)} images moved and {len(captions_dataset)} captions moved")
    return captions_dataset

In [31]:
create_flicker_dataset('../input/Flicker8k/Flickr_8k.trainImages.txt', 'Train', caption_data)

  0%|          | 0/6000 [00:00<?, ?it/s]

100%|██████████| 6000/6000 [03:00<00:00, 33.30it/s]


Completed, 6000 images moved and 30000 captions moved


In [32]:
create_flicker_dataset('../input/Flicker8k/Flickr_8k.testImages.txt', 'Test', caption_data)

100%|██████████| 1000/1000 [00:14<00:00, 70.42it/s]

Completed, 1000 images moved and 5000 captions moved





Unnamed: 0,image_filename,image_caption
0,3385593926_d3e9c21170.jpg,dogs are in the snow in front of a fence .
1,3385593926_d3e9c21170.jpg,dogs play on the snow .
2,3385593926_d3e9c21170.jpg,brown dogs playfully fight in the snow .
3,3385593926_d3e9c21170.jpg,brown dogs wrestle in the snow .
4,3385593926_d3e9c21170.jpg,dogs playing in the snow .
...,...,...
4995,3490736665_38710f4b91.jpg,big dog stands on his hand leg as tennis balls...
4996,3490736665_38710f4b91.jpg,brown and white dog in front of a shed overwhe...
4997,3490736665_38710f4b91.jpg,brown and white dogs stands in front of a wood...
4998,3490736665_38710f4b91.jpg,dog jumps for several tennis balls thrown at h...


In [33]:
create_flicker_dataset('../input/Flicker8k/Flickr_8k.devImages.txt', 'Validation', caption_data)

100%|██████████| 1000/1000 [00:10<00:00, 93.45it/s]

Completed, 1000 images moved and 5000 captions moved





Unnamed: 0,image_filename,image_caption
0,2090545563_a4e66ec76b.jpg,boy laying face down on a skateboard is being ...
1,2090545563_a4e66ec76b.jpg,girls play on a skateboard in a courtyard .
2,2090545563_a4e66ec76b.jpg,people play on a long skateboard .
3,2090545563_a4e66ec76b.jpg,small children in red shirts playing on a skat...
4,2090545563_a4e66ec76b.jpg,young children on a skateboard going across a ...
...,...,...
4995,522652105_a89f1cf260.jpg,girl playing is a pile of colorful balls .
4996,522652105_a89f1cf260.jpg,little girl plays in a ball pit .
4997,522652105_a89f1cf260.jpg,little girl plays in a pit of colorful balls .
4998,522652105_a89f1cf260.jpg,small girl is playing in a ball pit
