[View in Colaboratory](https://colab.research.google.com/github/SwapnilSParkhe/Image-Caption-Generation/blob/master/PreprocessingData_Image_Text.ipynb)

# Preparing Data for Image and Text

## Image Data

**Getting data**

In [17]:
#Getting the data from the web
!wget http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/Flickr8k_Dataset.zip -P drive/app

#Unzipping the data
import zipfile
zip_ref = zipfile.ZipFile('Flickr8k_Dataset.zip', 'r')
zip_ref.extractall()
zip_ref.close()

--2018-04-14 23:34:16--  http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/Flickr8k_Dataset.zip
Resolving nlp.cs.illinois.edu (nlp.cs.illinois.edu)... 192.17.58.132
Connecting to nlp.cs.illinois.edu (nlp.cs.illinois.edu)|192.17.58.132|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115419746 (1.0G) [application/zip]
Saving to: ‘drive/app/Flickr8k_Dataset.zip’


2018-04-14 23:34:32 (69.0 MB/s) - ‘drive/app/Flickr8k_Dataset.zip’ saved [1115419746/1115419746]



**Importing relevant libraries**

In [0]:
from os import listdir   #Lists all names in a given directory's path
from keras.applications.vgg16 import VGG16   #VGG16 model's weights to extract features of images from last layer
from keras.preprocessing.image import load_img   #Load an image to a target PIL (Python Imaging Library) image
from keras.preprocessing.image import img_to_array   #Converts PIL image to an array
from keras.applications.vgg16 import preprocess_input, decode_predictions 
from keras.models import Model
from pickle import dump   #Save outputs to a pickle file

**Extracting features of image(s) in our directory and saving to a file**

In [28]:
#A function to extract features of images in a directory (Using VGG16 here)
def ImageFeature_Extractor(ImageDirectory):    
    #Loading the model
    model=VGG16(weights="imagenet")

    #Restructing the model (removing the last softmax classification layer so as to retain the penultimate FCC-4096)
    model.layers.pop()
    model=Model(inputs=model.inputs, outputs=model.layers[-1].output)

    #Summarizing our re-structured model
    print(model.summary())

    #Extracting features from each image (jpg) from our image directory (present in work directory)   
    features=dict()
    for ImageName in listdir(ImageDirectory):   #accessing elements of our ImageDirectory
        image_file_name=ImageDirectory + '/' + ImageName   #accessing the Image files in our directory  
        image=load_img(image_file_name,target_size=(224,224))   #loading each image to adhere with model inputs
        image=img_to_array(image)   #converting PIL image to array
        image=image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))   #to adhere with model inputs
        image=preprocess_input(image)   #final preprocessing of image for our model
        image_feature=model.predict(image,verbose=0)   #getting image features
        image_id=ImageName.split('.')[0]   #getting the image_id as name of image before '.'
        features[image_id]=image_feature   #mapping and storing image_features to image_id
        if len(features)%1000==0:
          print("No. of images processed",len(features))
    return features

#Calling the above created function for our ImageDirectory
features = ImageFeature_Extractor('Flicker8k_Dataset')
print('Extracted Features: %d' % len(features))

#Saving above extracted features to file (for future use)
dump(features, open('features.pkl', 'wb'))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
__________

## Text Data

**Getting data**

In [16]:
#Getting the data from the web
!wget http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/Flickr8k_text.zip -P drive/app

#Unzipping the data 
import zipfile
zip_ref = zipfile.ZipFile('Flickr8k_text.zip', 'r')
zip_ref.extractall()
zip_ref.close()

--2018-04-14 23:33:34--  http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/Flickr8k_text.zip
Resolving nlp.cs.illinois.edu (nlp.cs.illinois.edu)... 192.17.58.132
Connecting to nlp.cs.illinois.edu (nlp.cs.illinois.edu)|192.17.58.132|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2340801 (2.2M) [application/zip]
Saving to: ‘drive/app/Flickr8k_text.zip’


2018-04-14 23:33:35 (5.04 MB/s) - ‘drive/app/Flickr8k_text.zip’ saved [2340801/2340801]



**Preparing text data (Cleaning and manipulating)**

In [0]:
#Imporitng: Loading and Reading text file (containing photo identifier and corresponding multiple descriptions)
def import_file(input_file):
    text_file=open(input_file,'r')   #creating a bridge btwn OS file and Py file for reading
    text=text_file.read()   #reading from OS file to Py file
    text_file.close()
    return text

imported_file=import_file('Flickr8k.token.txt')

#Organising imported_file file so as to map image's ID and descriptions
def organise_file(file):
    mapping=dict()
    for line in file.split('\n'):   #accessing each line in doc spearated by '\n'
        tokens=line.split()   #tokenizing based on white spaces into image's ID and many desc based tokens
        if len(line)<2:
            continue
        image_id, image_desc=tokens[0],tokens[1:]   #getting ID and desc for images
        image_id=image_id.split('.')[0]   #removing "jpg#_" from image's filename
        image_desc=' '.join(image_desc)   #creating desc string from tokens
        if image_id not in mapping:   #creating a list of 5 desc for each new imageID 
            mapping[image_id]=list()
        mapping[image_id].append(image_desc)   #appending desc for each same imageID
    return mapping

organised_file=organise_file(imported_file)

#Cleaning data (lowercasing, retaining alphabetical words, removing punctuation, len(word)>1)
import string
def clean_desc(file):
    table=str.maketrans('','',string.punctuation)
    for key, desc_list in file.items():
        for i in range(len(desc_list)):
            desc=desc_list[i]   
            desc=desc.split()   #tokenizing
            desc=[word.lower() for word in desc]   #lowercasing
            desc=[word for word in desc if word.isalpha()]   #retaining alphabetical words
            desc=[w.translate(table) for w in desc]   #removing punctuation
            desc=[word for word in desc if len(word)>1]   #removing single letters like 's' or 'a'
            desc_list[i]=' '.join(desc)   #replacing original and storing cleaned version of text
    return file

cln_orgnse_text=clean_desc(organised_file)

#Exporting: save desc to file, one per line
def export_file(output_file, filename):
    lines = list()
    for key, desc_list in output_file.items():
        for desc in desc_list:
            lines.append(key + ' ' + desc)   #for same key, appending desc (for all keys) as lines
    data = '\n'.join(lines)   #storing lines separated by '/n'
    file = open(filename, 'w')   #creating a bridge between Py and OS for writing
    file.write(data)   #writing data to Py file and thus OS file
    file.close()

export_file(cln_orgnse_text, 'cln_orgnse_text.txt')