# Image Embeddings

**Authors:** Itamar Zaltsman<br>
**Date created:** 2021/06/12<br>
**Description:** Creating image embeddings using Siamese netwrok model.

## Introduction

Our project goal is to find similar products in large datasets. It may be for the use of a company which wants to ensure they provide the best prices or a customer who wants to find alternatives.

In both cases, we want the results to be relevant in aspect of time.
Assuming we already know the main retailers we will be working with, we can reduce significantly the runtime by preparing image embeddings in advance.


## Setup

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os
import random
import tensorflow as tf
from pathlib import Path
from tensorflow.keras import applications
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import optimizers
from tensorflow.keras import metrics
from tensorflow.keras import Model
from tensorflow.keras.applications import resnet
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
import pandas as pd

target_shape = (200, 200)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Load datasets

Loading zip file file with all images 'train_images.zip' and 3 csv files:

  * `X_train.csv` contains path to the images that the model was trained on.
  * `X_val.csv` contains path to the images that we will use to evaluate our model.
  * `X_test.csv` contains path to the images that we will use as a test set.

In [None]:
! unzip /content/drive/MyDrive/ITC/final_project/Shopee/data/train_images.zip

! mkdir train_images
! mv *.jpg train_images

In [None]:
X_train = pd.read_csv('/content/drive/MyDrive/ITC/final_project/Shopee/data/X_train.csv')
X_val = pd.read_csv('/content/drive/MyDrive/ITC/final_project/Shopee/data/X_val.csv')
X_test = pd.read_csv('/content/drive/MyDrive/ITC/final_project/Shopee/data/X_test.csv')

X_train.shape, X_val.shape, X_test.shape

((27444, 6), (3357, 6), (3449, 6))


## Preparing the data

We are going to use a `tf.data` pipeline to load the data and generate the images we want to create embedding for.

We'll set up the pipeline using a zipped list with images path. The pipeline will load and preprocess the corresponding images.

In [None]:
def preprocess_image(filename):
    """
    Load the specified file as a JPEG image, preprocess it and
    resize it to the target shape.
    """

    image_string = tf.io.read_file(filename)
    image = tf.image.decode_jpeg(image_string, channels=3)
    image = tf.image.convert_image_dtype(image, tf.float32)
    image = tf.image.resize(image, target_shape)
    return image

In [None]:
def build_images_dataset(X, path_to_dir):
  """returns preprocessed images dataset generator
  """

  images = X['image'].apply(lambda x: path_to_dir + x).tolist()

  images_dataset = tf.data.Dataset.from_tensor_slices(images)

  images_dataset = images_dataset.map(preprocess_image)

  images_dataset = images_dataset.batch(32, drop_remainder=False)
  images_dataset = images_dataset.prefetch(8)

  return images_dataset

## Setting up the embedding model

In [None]:
# ResNet50 based embedding model

base_cnn = resnet.ResNet50(
    weights="imagenet", input_shape=target_shape + (3,), include_top=False
)

flatten = layers.Flatten()(base_cnn.output)
dense1 = layers.Dense(512, activation="relu")(flatten)
dense1 = layers.BatchNormalization()(dense1)
dense2 = layers.Dense(256, activation="relu")(dense1)
dense2 = layers.BatchNormalization()(dense2)
output = layers.Dense(256)(dense2)
# layers.Lambda(lambda x: tf.math.l2_normalize(x, axis=1)

embedding = Model(base_cnn.input, output, name="Embedding")

trainable = False
for layer in base_cnn.layers:
    if layer.name == "conv5_block1_out":
        trainable = True
    layer.trainable = trainable

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5


In [None]:
# vgg based embedding model

base_vgg = tf.keras.applications.vgg16.VGG16(
    include_top=False,
    input_shape=target_shape + (3,)
)

vgg_embedding = tf.keras.Sequential([
                             base_vgg,
                             tf.keras.layers.Flatten(),
                             tf.keras.layers.Dense(1024, activation=None), # No activation on final dense layer
                             tf.keras.layers.Lambda(lambda x: tf.math.l2_normalize(x, axis=1)) # L2 normalize embeddings

])

base_vgg.trainable=False

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5


In [None]:
# comment in order to train ResNet50 based model
# load saved weights

embedding = vgg_embedding

checkpoint_filepath = '/content/drive/MyDrive/ITC/final_project/Shopee/siamese_model/vgg_embedding_checkpoint'
embedding.load_weights(checkpoint_filepath)

In [None]:
# Load entire model

model_filepath = '/content/drive/MyDrive/ITC/final_project/Shopee/siamese_model/vgg_embedding_model'
embedding.load(model_filepath)

## Embedding

In [None]:
path_to_dir = '/content/train_images/'
X = X_val.copy()


# build image data generator
images_dataset = build_images_dataset(X, path_to_dir)

# embedding
image_embeddings = embedding.predict(images_dataset)

# saving embeddings and data to csv
image_embeddings_dir = '/content/drive/MyDrive/ITC/final_project/Shopee/data/siamese_image_embedding' + file_path
X_emb_dir = '/content/drive/MyDrive/ITC/final_project/Shopee/data/siamese_data' + file_path

np.savetxt(image_embeddings_dir, image_embeddings, delimiter=",")
X.to_csv(X_emb_dir)