<a href="https://colab.research.google.com/github/ShaunakSen/Deep-Learning/blob/master/Image_Captioning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## How to Use The Pre-Trained VGG Model to Classify Objects in Photographs

[link](https://machinelearningmastery.com/use-pre-trained-vgg-model-classify-objects-photographs/)

Convolutional neural networks are now capable of outperforming humans on some computer vision tasks, such as classifying images.

That is, given a photograph of an object, answer the question as to which of 1,000 specific objects the photograph shows.

A competition-winning model for this task is the VGG model by researchers at Oxford. What is important about this model, besides its capability of classifying objects in photographs, is that the model weights are freely available and can be loaded and used in your own models and applications.

### ImageNet

ImageNet is a research project to develop a large database of images with annotations, e.g. images and their descriptions.

The images and their annotations have been the basis for an image classification challenge called the ImageNet Large Scale Visual Recognition Challenge or ILSVRC since 2010. The result is that research organizations battle it out on pre-defined datasets to see who has the best model for classifying the objects in images.

For the classification task, images must be classified into one of 1,000 different categories.

For the last few years very deep convolutional neural network models have been used to win these challenges and results on the tasks have exceeded human performance.

### The Oxford VGG Models

Researchers from the Oxford Visual Geometry Group, or VGG for short, participate in the ILSVRC challenge.

In 2014, convolutional neural network models (CNN) developed by the VGG won the image classification tasks.

VGG released two different CNN models, specifically a 16-layer model and a 19-layer model.

The VGG models are not longer state-of-the-art by only a few percentage points. Nevertheless, they are very powerful models and useful both as image classifiers and as the basis for new models that use image inputs.

### Load the VGG Model in Keras

The VGG model can be loaded and used in the Keras deep learning library.

Keras provides an Applications interface for loading and using pre-trained models.

Using this interface, you can create a VGG model using the pre-trained weights provided by the Oxford group and use it as a starting point in your own model, or use it as a model directly for classifying images.

In this tutorial, we will focus on the use case of classifying new images using the VGG model.

Keras provides both the 16-layer and 19-layer version via the VGG16 and VGG19 classes. Let’s focus on the VGG16 model.

The model can be created as follows:



In [0]:
from keras.applications.vgg16 import VGG16

model = VGG16()

That’s it.

The first time you run this example, Keras will download the weight files from the Internet and store them in the ~/.keras/models directory.

Note that the weights are about 528 megabytes, so the download may take a few minutes depending on the speed of your Internet connection.

The weights are only downloaded once. The next time you run the example, the weights are loaded locally and the model should be ready to use in seconds.

We can use the standard Keras tools for inspecting the model structure.

For example, you can print a summary of the network layers as follows:



In [8]:
print (model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
__________

You can see that the model is huge.

You can also see that, by default, the model expects images as input with the size 224 x 224 pixels with 3 channels (e.g. color).

![](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/08/Plot-of-Layers-in-the-VGG-Model.png)

The VGG() class takes a few arguments that may only interest you if you are looking to use the model in your own project, e.g. for transfer learning.


For example:

- include_top (True): Whether or not to include the output layers for the model. You don’t need these if you are fitting the model on your own problem.
- weights (‘imagenet‘): What weights to load. You can specify None to not load pre-trained weights if you are interested in training the model yourself from scratch.
- input_tensor (None): A new input layer if you intend to fit the model on new data of a different size.
- input_shape (None): The size of images that the model is expected to take if you change the input layer.
- pooling (None): The type of pooling to use when you are training a new set of output layers.
- classes (1000): The number of classes (e.g. size of output vector) for the model.


Next, let’s look at using the loaded VGG model to classify ad hoc photographs.



### Develop a Simple Photo Classifier

Next, we can load the image as pixel data and prepare it to be presented to the network.

Keras provides some tools to help with this step.

First, we can use the load_img() function to load the image and resize it to the required size of 224×224 pixels.

In [11]:
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
# load img from file
image = load_img(path='./4994221690_d070e8a355_z.jpg', target_size=(224, 224))

# Next, we can convert the pixels to a NumPy array so that we can work with it in Keras.
# We can use the img_to_array() function for this.

image = img_to_array(img=image)

print (image.shape)

(224, 224, 3)


The network expects one or more images as input; that means the input array will need to be 4-dimensional: `[samples, rows, columns, and channels]`.

We only have one sample (one image). We can reshape the array by calling reshape() and adding the extra dimension.


Next, the image pixels need to be prepared in the same way as the ImageNet training data was prepared. Specifically, from the paper:

> The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel.

Keras provides a function called preprocess_input() to prepare new input for the network.





In [13]:
# reshape data for the model
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))

print (image.shape)

(1, 224, 224, 3)


In [14]:
from keras.applications.vgg16 import preprocess_input

# prepare the image for the VGG model
image = preprocess_input(image)

print (image.shape)

(1, 224, 224, 3)


We are now ready to make a prediction for our loaded and prepared image.

We can call the predict() function on the model in order to get a prediction of the probability of the image belonging to each of the 1000 known object types.



In [16]:
# predict the probability across all output classes
yhat = model.predict(image)

print (yhat.shape)

(1, 1000)


Keras provides a function to interpret the probabilities called decode_predictions().

It can return a list of classes and their probabilities in case you would like to present the top 3 objects that may be in the photo.

We will just report the first most likely object.

In [24]:
from keras.applications.vgg16 import decode_predictions

# convert the probabilities to class labels
label = decode_predictions(yhat)

print (len(label[0]))

# retrieve the most likely result, e.g. highest probability

label = label[0][0]

print (label)

print('%s (%.2f%%)' % (label[1], label[2]*100))

5
('n03063599', 'coffee_mug', 0.7336321)
coffee_mug (73.36%)


## How to Develop a Deep Learning Photo Caption Generator from Scratch

[link](https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/)

### Download and extract the dataset

In [0]:
from urllib.request import urlopen
from zipfile import ZipFile

zipurl = 'https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip'
    # Download the file from the URL
zipresp = urlopen(zipurl)
    # Create a new file on the hard drive
tempzip = open("/tmp/Flickr8k_Dataset.zip", "wb")
    # Write the contents of the downloaded file into the new file
tempzip.write(zipresp.read())
    # Close the newly-created file
tempzip.close()


In [0]:
# Re-open the newly-created file with ZipFile()
zf = ZipFile("/tmp/Flickr8k_Dataset.zip")
    # Extract its contents into <extraction_path>
    # note that extractall will automatically create the path
zf.extractall(path = './Flickr8k_Dataset')
    # close the ZipFile instance
zf.close()

In [0]:
# Re-open the newly-created file with ZipFile()
zf = ZipFile("./Flickr8k_text.zip")
    # Extract its contents into <extraction_path>
    # note that extractall will automatically create the path
zf.extractall(path = './Flickr8k_text')
    # close the ZipFile instance
zf.close()

The dataset is present in the following locations:

1. Flickr8k_Dataset
2. Flickr8k_text

The dataset has a pre-defined training dataset (6,000 images), development dataset (1,000 images), and test dataset (1,000 images).

One measure that can be used to evaluate the skill of the model are BLEU scores.

- BLEU-1: 0.401 to 0.578.
- BLEU-2: 0.176 to 0.390.
- BLEU-3: 0.099 to 0.260.
- BLEU-4: 0.059 to 0.170.

We describe the BLEU metric more later when we work on evaluating our model.

Next, let’s look at how to load the images.

### Prepare Photo Data

We will use a pre-trained model to interpret the content of the photos.

There are many models to choose from. In this case, we will use the Oxford Visual Geometry Group, or VGG, model that won the ImageNet competition in 2014. Learn more about the model here:

[](http://www.robots.ox.ac.uk/~vgg/research/very_deep/)

Keras provides this pre-trained model directly. Note, the first time you use this model, Keras will download the model weights from the Internet, which are about 500 Megabytes. This may take a few minutes depending on your internet connection.


We could use this model as part of a broader image caption model. The problem is, it is a large model and running each photo through the network every time we want to test a new language model configuration (downstream) is redundant.

Instead, we can pre-compute the “photo features” using the pre-trained model and save them to file. We can then load these features later and feed them into our model as the interpretation of a given photo in the dataset. It is no different to running the photo through the full VGG model; it is just we will have done it once in advance.

This is an optimization that will make training our models faster and consume less memory.

We can load the VGG model in Keras using the VGG class. We will remove the last layer from the loaded model, as this is the model used to predict a classification for a photo. We are not interested in classifying images, but we are interested in the internal representation of the photo right before a classification is made. These are the “features” that the model has extracted from the photo.

Keras also provides tools for reshaping the loaded photo into the preferred size for the model (e.g. 3 channel 224 x 224 pixel image).

Below is a function named extract_features() that, given a directory name, will load each photo, prepare it for VGG, and collect the predicted features from the VGG model. The image features are a 1-dimensional 4,096 element vector.

The function returns a dictionary of image identifier to image features.

In [0]:
from os import listdir
from pickle import dump

from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model

We can call this function to prepare the photo data for testing our models, then save the resulting dictionary to a file named ‘features.pkl‘.

In [6]:
model=VGG16()

model.summary()

W0723 09:46:29.956114 139695895652224 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0723 09:46:30.007027 139695895652224 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0723 09:46:30.020456 139695895652224 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0723 09:46:30.068988 139695895652224 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3976: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.



Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels.h5


W0723 09:47:10.254584 139695895652224 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

W0723 09:47:10.256208 139695895652224 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:181: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
__________

In [28]:
print (model.layers[-1])

<keras.layers.core.Dense object at 0x7f0d203f2c50>


In [34]:
print ("No of images:", len(listdir(path='./Flickr8k_Dataset/Flicker8k_Dataset/')))

No of images: 8091


In [0]:
def extract_features(directory):
  """
  extract features from each photo in the directory
  """
  
  # load the model
  model = VGG16()
  
  # restructure the model
  model.layers.pop()
  model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
  
  # summarize
  print (model.summary())
  
  
  # extract features from each photo
  features = dict()
  
  # Return a list containing the names of the files in the directory.
  for name in listdir(path=directory):
    
    # load an image from file
    filename = directory + '/' + name
    image = load_img(path=filename, target_size=(224,224))
    
    # convert the image pixels to a numpy array
    image = img_to_array(img=image)
    
    # reshape data for the model
    # The network expects one or more images as input; 
    # that means the input array will need to be 4-dimensional: 
    # [samples, rows, columns, and channels]
    image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
    
    # prepare the image for the VGG model
    image = preprocess_input(image)
    
    # get features
    feature = model.predict(x=image, verbose=0)
    
    # get image id
    image_id = name.split('.')[0]
    
    # store feature in the dict
    features[image_id] = feature
    print('>%s' % name)
    
    
  return features
  
directory = './Flickr8k_Dataset/Flicker8k_Dataset/'

features = extract_features(directory)
print('Extracted Features: %d' % len(features))

In [0]:
# save to file
dump(features, open('features.pkl', 'wb'))

### Prepare Text Data

The dataset contains multiple descriptions for each photograph and the text of the descriptions requires some minimal cleaning.


First, we will load the file containing all of the descriptions.



In [0]:
# load the doc into memory
def load_doc(filename):
  # open the file as read only
  file = open(file=filename, mode='r')
  # read all text
  text = file.read()
  # close the file
  file.close()
  return text

filename = './Flickr8k_text/Flickr8k.token.txt'

doc = load_doc(filename)

In [44]:
print(doc[:200])

1000268201_693b08cb0e.jpg#0	A child in a pink dress is climbing up a set of stairs in an entry way .
1000268201_693b08cb0e.jpg#1	A girl going into a wooden building .
1000268201_693b08cb0e.jpg#2	A lit


Each photo has a unique identifier. This identifier is used on the photo filename and in the text file of descriptions.

Next, we will step through the list of photo descriptions. Below defines a function load_descriptions() that, given the loaded document text, will return a dictionary of photo identifiers to descriptions. Each photo identifier maps to a list of one or more textual descriptions.

In [49]:
doc.split("\n")[:10][0].split()[1:]

['A',
 'child',
 'in',
 'a',
 'pink',
 'dress',
 'is',
 'climbing',
 'up',
 'a',
 'set',
 'of',
 'stairs',
 'in',
 'an',
 'entry',
 'way',
 '.']

In [0]:
# extract descriptions for images
def load_descriptions(doc):
  mapping=dict()

  # process line by line
  for line in doc.split("\n"):
    # split line by white space
    tokens = line.split()
    # check min length
    if len(line) < 2:
      continue
    # take the first token as the image id, the rest as the description
    image_id, image_desc = tokens[0], tokens[1:]

    # remove filename from image id
    image_id = image_id.split('.')[0]

    # convert description tokens back to string
    image_desc = ' '.join(image_desc)

    # create an emty list for a new image_id
    if image_id not in mapping:
      mapping[image_id] = list()

    # append desc for the corr image_id

    mapping[image_id].append(image_desc)

  return mapping

# parse descriptions
descriptions = load_descriptions(doc)
  

Next, we need to clean the description text. The descriptions are already tokenized and easy to work with.

We will clean the text in the following ways in order to reduce the size of the vocabulary of words we will need to work with:

- Convert all words to lowercase.
- Remove all punctuation.
- Remove all words that are one character or less in length (e.g. ‘a’).
- Remove all words with numbers in them.

Below defines the clean_descriptions() function that, given the dictionary of image identifiers to descriptions, steps through each description and cleans the text.



In [0]:
import string

def clean_descriptions(descriptions):
  # prepare translation table for removing punctuation
  table = str.maketrans('', '', string.punctuation)
  
  for key, desc_list in descriptions.items():
    # for each desc of an image:
    for i in range(len(desc_list)):
      desc = desc_list[i]

      # tokenize
      desc = desc.split()

      # convert to lowercase
      desc = [word.lower() for word in desc]

      # remove punctuation from each token
      desc = [w.translate(table) for w in desc]

      # remove hanging 's' and 'a'
      desc = [word for word in desc if len(word)>1]

      # remove tokens with numbers in them
      desc = [word for word in desc if word.isalpha()]

      # replace it in that index position
      desc_list[i] = ' '.join(desc)


# clean descriptions
clean_descriptions(descriptions)
      

Once cleaned, we can summarize the size of the vocabulary.

Ideally, we want a vocabulary that is both expressive and as small as possible. A smaller vocabulary will result in a smaller model that will train faster.

For reference, we can transform the clean descriptions into a set and print its size to get an idea of the size of our dataset vocabulary.

In [66]:
# convert the loaded descriptions into a vocabulary of words
def to_vocabulary(descriptions):
	# build a list of all description strings
	all_desc = set()
	for key in descriptions.keys():
		[all_desc.update(d.split()) for d in descriptions[key]]
	return all_desc

# summarize vocabulary
vocabulary = to_vocabulary(descriptions)

print(vocabulary)

print('Vocabulary Size: %d' % len(vocabulary))

Vocabulary Size: 8763


Finally, we can save the dictionary of image identifiers and descriptions to a new file named descriptions.txt, with one image identifier and description per line.

Below defines the save_descriptions() function that, given a dictionary containing the mapping of identifiers to descriptions and a filename, saves the mapping to file

In [0]:
# save descriptions to file, one per line
def save_descriptions(descriptions, filename):
  lines = list()
  
  for key, desc_list in descriptions.items():
    for desc in desc_list:
      lines.append(key + ' ' + desc)
  
  data = '\n'.join(lines)
  file = open(file=filename, mode='w')
  file.write(data)
  file.close()
  
  
save_descriptions(descriptions, 'descriptions.txt')