# Image similarity based recommendation system for E-commerce platforms

The project has been implemented on image dataset provided by [Amazon-Berkeley](https://amazon-berkeley-objects.s3.amazonaws.com/index.html), specifically the small version of the dataset. So, the first step will be to import the necessary packages.

In [4]:
import numpy as np
from tensorflow import keras
from glob import glob
from PIL import Image
import os
from annoy import AnnoyIndex
from tqdm import tqdm
import pickle as pkl
import albumentations as albu
from pyspark.sql import SparkSession


### Data preprocessing with Spark
The dataset consists of some images with a very small aspect ratio. These essentially text labels of the products and practically useless for our application. Therefore, we will be removing them using spark because spark provides a convenient way to filter out such images quickly and without writing much code, so initialize a spark session in the next step.

In [3]:
spark = SparkSession.builder.getOrCreate()

22/10/07 18:20:59 WARN Utils: Your hostname, starfire resolves to a loopback address: 127.0.1.1; using 192.168.1.6 instead (on interface enp4s0)
22/10/07 18:20:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/07 18:20:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Load all the images in the nested directory into a spark dataframe.

In [4]:
image_df = spark.read.format("image").load("./small/*/*")

                                                                                

22/10/07 18:00:42 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.


Show the dataframe schema

In [5]:
image_df.printSchema()

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)



Display the first few entries

In [8]:
image_df.select("image.origin", "image.width", "image.height").show(truncate=False)

[Stage 1:>                                                          (0 + 1) / 1]

+-------------------------------------------------------------------------------+-----+------+
|origin                                                                         |width|height|
+-------------------------------------------------------------------------------+-----+------+
|file:/home/raj/Projects/Image_RecommendationSystem/images/small/f5/f570c185.jpg|244  |256   |
|file:/home/raj/Projects/Image_RecommendationSystem/images/small/dc/dc7e130d.png|256  |144   |
|file:/home/raj/Projects/Image_RecommendationSystem/images/small/81/81614547.jpg|241  |256   |
|file:/home/raj/Projects/Image_RecommendationSystem/images/small/51/51b6b436.jpg|256  |256   |
|file:/home/raj/Projects/Image_RecommendationSystem/images/small/09/098bc917.jpg|256  |256   |
|file:/home/raj/Projects/Image_RecommendationSystem/images/small/a4/a42faf60.jpg|256  |256   |
|file:/home/raj/Projects/Image_RecommendationSystem/images/small/e2/e21f8a61.jpg|256  |256   |
|file:/home/raj/Projects/Image_RecommendationSyste

                                                                                

In the next step we filter images that have a difference of more than 200 between their width and height and then extract their paths into a list

In [8]:
from  pyspark.sql.functions import abs
paths = image_df.filter((abs(image_df.image.width - image_df.image.height))>=200).select("image.origin").rdd.flatMap(lambda x: x).collect()

                                                                                

Remove all the filtered images

In [None]:
for path in paths:
    os.remove(path.lstrip('file:'))

### Feature extraction with pre-trained VGG16 model

Import the pre-trained VGG16 model from keras and attach a convolution layer over it with 128 filters to reduce the output features. This will act as our feature extractor.

In [None]:
vgg = keras.applications.VGG16(input_shape = (256, 256, 3), include_top = False, weights = 'imagenet')
x = keras.layers.Conv2D(128, 5, activation='relu')(vgg.output)
x = keras.layers.GlobalAveragePooling2D()(x)
model = keras.Model(inputs=vgg.input, outputs=x)

Since our method is an unsupervised one, it is likely that it will be biased with objects of the same colour and there is also a possibility that the object of interest might be in a cluttered environment, so we need to apply a pre-processing to the images that enhances the contrast and then converts it to gray scale. For this we will use the albumentations library.

In [None]:
preprocess = albu.Compose([
    albu.CLAHE(p=1),
    albu.ToGray(p=1),
])

Get all the image paths in our dataset

In [5]:
image_paths = glob("./small/*/*")

The dataset has a very large number of images, it will be wise to make use of data generator for the feature extraction. Let's define one first.

In [None]:
class DataGenerator(keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, image_paths, batch_size):
        'Initialization'
        self.image_paths = image_paths
        self.batch_size = batch_size
    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.image_paths) / self.batch_size))

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        paths = self.image_paths[index*self.batch_size:(index+1)*self.batch_size]


        # Generate data
        X, y = self.__data_generation(paths)

        return X, y

    def __data_generation(self, paths):
        'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
        # Initialization
        X = np.empty((self.batch_size, 256, 256, 3))
        y = list()

        # Generate data
        for i, path in enumerate(paths):
            image = Image.open(path)
            image = image.convert('RGB')
            image = image.resize((256, 256))
            image = np.array(image)
            image = preprocess(image = image)
            X[i,] = image['image']/255
            y.append(path)

        return X, y

Instantiate the data generator with a batch size of 100

In [None]:
generator = DataGenerator(image_paths, 100)

Extract features from image batches and save them in a file along with their respective file paths in two separate folders: features and labels respectively

In [None]:
for i, (im, p) in enumerate(tqdm(generator)):
    with open("./features/batch{}.npy".format(i), "wb") as f:
        np.save(f, model(im))
    with open("./labels/batch{}.npy".format(i), "wb") as g:
        np.save(g, np.array(p))

### Building an Approximate Nearest Neighbor search model 

Considering the massive size of the database, the conventional nearest neighbor search algorithm will be very slow. To mitigate this, we have used approximate nearest neighbor search algorithm implemented in the Spotify/Annoy library. Therefore, get the features from each of the files generated earlier and then feed them to an annoy index instance initialized with a feature length of 128 after normalizing each feature vector. Also, generate the list of image files corresponding to each feature index.

In [None]:
feature_files = os.listdir("./features")
t = AnnoyIndex(128, 'euclidean')
files_list = list()
c = 0
for file in feature_files:
    fea = np.load(os.path.join("./features", file))
    file_names = np.load(os.path.join("./labels", file))
    for i in range(fea.shape[0]):
        t.add_item(c, fea[i]/np.linalg.norm(fea[i]))
        c += 1
        files_list.append(file_names[i])

Build the annoy index with 10 trees and save it to file which can be used to design an actual application

In [None]:
t.build(10)
t.save("nearest_neighbor.ann")

Save the feature extraction deep learning model along with file list

In [None]:
model.save('model')
file_list = open("files.pkl", "wb")
pkl.dump(files_list, file_list)