# Preprocessing

Before we train/make the model, the dataset must be prepared. Since the downloading the dataset will take forever for each person to run each time, the resulting file will be available so that it can be accessed locally. 

## Import libraries

In [1]:
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

import re
import numpy as np
import os
import time
import json
from glob import glob
from PIL import Image
import pickle

## Download the dataset

For this project, we'll be using the 2014 training dataset. To avoid overfitting the data onto the model, a smaller random sample will be used for training.
This dataset comes with images and their respective captions, but these will be separated so that the model can be tested for accuracy.
*One thing to note is that downloading the dataset will take up to a few hours, so don't run the cell below if you don't have to.

In [3]:
# Download training captions
annotation_folder = '/annotations/'
if not os.path.exists(os.path.abspath('.') + annotation_folder):
  annotation_zip = tf.keras.utils.get_file('captions.zip',
                                          cache_subdir=os.path.abspath('.'),
                                          origin = 'http://images.cocodataset.org/annotations/annotations_trainval2014.zip',
                                          extract = True)
  annotation_file = os.path.dirname(annotation_zip)+'/annotations/captions_train2014.json'
  os.remove(annotation_zip)

# Download training images
image_folder = '/train2014/'
if not os.path.exists(os.path.abspath('.') + image_folder):
  image_zip = tf.keras.utils.get_file('train2014.zip',
                                      cache_subdir=os.path.abspath('.'),
                                      origin = 'http://images.cocodataset.org/zips/train2014.zip',
                                      extract = True)
  PATH = os.path.dirname(image_zip) + image_folder
  os.remove(image_zip)
else:
  PATH = os.path.abspath('.') + image_folder

Downloading data from http://images.cocodataset.org/zips/train2014.zip


In [13]:
# Read the file so the captions can be extracted (as well as associated images)
with open(annotation_file, 'r') as f:
    annotations = json.load(f)

all_captions = []
all_img_names = []

for annot in annotations['annotations']:
    caption = '<start> ' + annot['caption'] + ' <end>'
    image_id = annot['image_id']
    full_coco_image_path = PATH + 'COCO_train2014_' + '%012d.jpg' % (image_id)
    
    all_img_names.append(full_coco_image_path)
    all_captions.append(caption)

Create multiple reduced sets to figure out the runtime for the smallest set, and eventually work up to the largest one.

In [14]:
print("Number of total captions:", len(all_captions))
print("Number of 1% of captions:", round(len(all_captions)*.01))
print("Number of 5% of captions:", round(len(all_captions)*.05))
print("Number of 10% of captions:", round(len(all_captions)*.1))

Number of total captions: 414113
Number of 1% of captions: 4141
Number of 5% of captions: 20706
Number of 10% of captions: 41411


Resample the data accordingly, shuffling after resampling so that they don't have the same captions.

In [15]:
# Shuffle the order of the captions and the image names
# Each respective caption should stay linked with its image
train_captions_01 = []
img_name_01 = []
train_captions, img_name = shuffle(all_captions, all_img_names, random_state = 1)
train_captions_01 = train_captions[:4141]
img_name_01 = img_name[:4141]

train_captions_05 = []
img_name_05 = []
train_captions, img_name = shuffle(all_captions, all_img_names, random_state = 1)
train_captions_05 = train_captions[:20706]
img_name_05 = img_name[:20706]

train_captions_1 = []
img_name_1 = []
train_captions, img_name = shuffle(all_captions, all_img_names, random_state = 1)
train_captions_1 = train_captions[:41411]
img_name_1 = img_name[:41411]