[View in Colaboratory](https://colab.research.google.com/github/SwapnilSParkhe/Project-Image_Caption_Generation/blob/master/Building_ModelArchitecture.ipynb)

# Building Model Architecture

## Analytical Data - Train & Validation (merged with preprocessed data)

**Image ID or Identifiers**

In [4]:
#Library for Uploading data from local to cloud
from google.colab import files

#Importing file: reading the content into Py file
def import_file(input_file):
    file=open(input_file,'r')   #creating a bridge btwn OS and Py files
    content=file.read()   #reading content via the bridge
    file.close()   #closing the bridge
    return content

files.upload()   #upload files from local to cloud (using google.colab lib)
imported_train=import_file('Flickr_8k.trainImages.txt')    

#Creating a set of image-IDs
def create_img_set(file):
    imgID_set=list()
    for item in file.split('\n'):   #accessing line by line
        if len(item)<1:   #rejecting empty spaces
            continue
        imgID=item.split('.')[0]   #only taking imgID (rejecting 'jpg')
        imgID_set.append(imgID)   #appending imgIDs to imgID_set
    return set(imgID_set)

imgID_trainset=create_img_set(imported_train)

Saving Flickr_8k.trainImages.txt to Flickr_8k.trainImages.txt


**Importing previously created files (from PreprocessingData NoteBook): Img desc and Img features**

In [5]:
#Importing image desc files for this image data germane to training set
def import_prepro_desc(prepro_file, trainset):
    file=import_file(prepro_file)
    desc=dict()
    for item in file.split('\n'):
        tokens=item.split()   #splitting by whitespaces
        image_ID,image_desc=tokens[0],tokens[1:]   #separating ID, desc
        if image_ID in trainset:   #inner join imgID & training imgID 
            if image_ID not in desc:   #new list for new image_ID key 
                desc[image_ID]=list()
            desc_='start ' + ' '.join(image_desc)+' end'   #wrap in tokens
            desc[image_ID].append(desc_)
    return desc

files.upload()   #upload files from local to cloud (using google.colab lib)
desc_train=import_prepro_desc('cln_orgnse_text.txt',imgID_trainset)

#Importing image features for this image data germane to training set
from pickle import load
def import_features(feature_file, trainset):
    all_features = load(open(feature_file, 'rb'))  #load all features
    features = {k: all_features[k] for k in trainset} #inner join
    return features
  
files.upload()   #upload files from local to cloud (using google.colab lib)
feature_train=import_features('features.pkl',imgID_trainset)

Saving cln_orgnse_text.txt to cln_orgnse_text (1).txt


Saving features.pkl to features.pkl


**Creating a custom Tokeizer function: Tokenizing descriptions**

In [6]:
#Creating a simple list of desc from dict of desc
def dict2list(input_dict):
    desc_list=list()
    for key in input_dict.keys():
        [desc_list.append(d) for d in input_dict[key]]
    return desc_list

desc_train_list=dict2list(desc_train)

#Tokenizing (could be improved by filetring english stopwords later)
#Note: turning each text into sequence of integers (integer: token ID)
from keras.preprocessing.text import Tokenizer
def tokenize(input_list):
    tokenizer=Tokenizer()
    tokenizer.fit_on_texts(input_list)
    return tokenizer

tokenizer=tokenize(desc_train_list)
vocab_size=len(tokenizer.word_index)+1
print("Vocab Size:",vocab_size)

Using TensorFlow backend.


Vocab Size: 7264


**LSTM's Analytical Dataset: Input(ImageID and Seq_item)-Ouput(SeqWord) data**

In [7]:
#Creating ADS for LSTM: Input(Image_ID and Seq_item)-Ouput(SeqWord)
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
def create_ADS(tokenizer, max_length, desc_dict, img):
    X_img_ID, X_desc_item, y=list(), list(), list()
    for key, desc_list in desc_dict.items(): #key to access desc, img
        for desc in desc_list:
            seq=tokenizer.texts_to_sequences([desc])[0] #encoding seq
            for i in range(1,len(seq)):#split seq into multi X,y pairs
                in_seq, out_seq=seq[:i], seq[i] #desc input-output pair
                in_seq=pad_sequences([in_seq], maxlen=max_length)[0]
                out_seq=to_categorical([out_seq], num_classes=vocab_size)[0]
                X_img_ID.append(img[key][0]) #same imgID for multi X-y pair
                X_desc_item.append(in_seq)  #multi X-y pairs encoding
                y.append(out_seq)   #oneHot encoded version of output word
    return array(X_img_ID), array(X_desc_item), array(y)

#Longest desc check
def longest_desc(desc_list):
    max_len=max([len(item.split()) for item in desc_list])
    print("Max_len:",max_len)
    print("Desc:", [item for item in desc_list if len(item.split())==max_len])

longest_desc(desc_train_list)

Max_len: 33
Desc: ['start an man wearing green sweatshirt and blue vest is holding up dollar bills in front of his face while standing on busy sidewalk in front of group of men playing instruments end']
