<a href="https://colab.research.google.com/github/Andre-Williams22/fashion-recommendation-system/blob/master/Recommender_From_Scratch_Tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## User and Product Vector Recommendation Model

In [1]:
!pip install keras_resnet

Collecting keras_resnet
  Downloading keras-resnet-0.2.0.tar.gz (9.3 kB)
Building wheels for collected packages: keras-resnet
  Building wheel for keras-resnet (setup.py) ... [?25l[?25hdone
  Created wheel for keras-resnet: filename=keras_resnet-0.2.0-py2.py3-none-any.whl size=20486 sha256=3228766c46bc3955daae5ef1cdd54a909814dc18252a0659bdaafa331f182b76
  Stored in directory: /root/.cache/pip/wheels/bd/ef/06/5d65f696360436c3a423020c4b7fd8c558c09ef264a0e6c575
Successfully built keras-resnet
Installing collected packages: keras-resnet
Successfully installed keras-resnet-0.2.0


## Steps for building our Model 

1. Pick random numbers for each customer and each product 
2. Find a score for each customer and product 
3. Rank according to these scores 
4. Tweak the customer and user vectors to get better rankings 

## Goals 
We started off with random numbers but over time two things happen:
1. Overtime our user vector captures the taste and preferences of the user. 
2. The product vector captures the product style and features.


Because if I purchase three items and you purchase 2 of those 3 items then my user vector is probably similar in that these items both have high scores for the two products we have in common. And naturally you would get a high score for the 3rd product I bought that you didn't; hence, the 3rd product I bought would be recommended to the user. 


## Potential Success Metrics 

We know our model is good if the items the user buys is ranked at the top of our list for ranks for the user.

# Imports

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
import os

In [11]:
import tensorflow as tf
import keras
from keras import Model
tf.__version__

'2.7.0'

# Import training data

In [12]:
train = pd.read_parquet("https://raw.githubusercontent.com/ASOS/dsf2020/main/dsf_asos_train_with_alphanumeric_dummy_ids.parquet")
valid = pd.read_parquet("https://raw.githubusercontent.com/ASOS/dsf2020/main/dsf_asos_valid_with_alphanumeric_dummy_ids.parquet")
dummy_users = pd.read_csv("https://raw.githubusercontent.com/ASOS/dsf2020/main/dsf_asos_dummy_users_with_alphanumeric_dummy_ids.csv", header=None).values.flatten().astype(str)
products = pd.read_csv("https://raw.githubusercontent.com/ASOS/dsf2020/main/dsf_asos_productIds.csv", header=None).values.flatten().astype(int)

In [13]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [14]:
# train.to_csv('train_with_alphanumeric_dummy_and_product_ids.csv')
# valid.to_csv('valid_with_alphanumeric_dummy_and_product_ids.csv')

In [15]:
train.shape, valid.shape, dummy_users.shape, products.shape

((165042, 2), (35567, 2), (43607,), (29696,))

In [16]:
train.shape

(165042, 2)

In [17]:
valid.shape

(35567, 2)

In [18]:
train

Unnamed: 0,dummyUserId,productId
0,b'PIXcm7Ru5KmntCy0yA1K',10524048
1,b'd0RILFB1hUzNSINMY4Ow',9137713
2,b'Ebax7lyhnKRm4xeRlWW2',5808602
3,b'vtigDw2h2vxKt0sJpEeU',10548272
4,b'r4GfiEaUGxziyjX0PyU6',10988173
...,...,...
165037,b'7Eom5Ancozj01ozGxAMK',9071435
165038,b'zi9vZETHqSIZK0TM2nZc',10413104
165039,b'fVCveec9P946asY5wqGm',9859881
165040,b'VJtfpw602SZHh2qwarK4',10809487


In [19]:
valid

Unnamed: 0,dummyUserId,productId
0,b'I4Yc5Ztur3UNwY5SdvDh',10093853
1,b'nhWgcxEVY7jQ3MvvNxWL',12306408
2,b'3vriQXKwG095rvR1MSrz',11858310
3,b'MA8KmOxkGd1JQ42GXDGO',10072124
4,b'vax7VgJnswdiC8iHZSCi',10596405
...,...,...
35562,b'A5uRhbiMu4vlKCB9A0rc',8496251
35563,b'VVyZSPXhX62iE9AtPDen',11935204
35564,b'8ACEWhSBG4eyhzHmwf4C',10494419
35565,b'bSTkDJMjlco6hheq9lTQ',11270014


In [20]:
dummy_users

array(['pmfkU4BNZhmtLgJQwJ7x', 'UDRRwOlzlWVbu7H8YCCi',
       'QHGAef0TI6dhn0wTogvW', ..., 'lcORJ5hemOZc1iGo9z7k',
       '5CqDquDAszqJp27P7AL8', 'SSPNYxJMfuKhoe1dg24m'], dtype='<U20')

In [21]:
products

array([ 8650774,  9306139,  9961521, ..., 12058614, 12058615, 11927550])

# Define a Recommender Model

The embedding layer gives a list of random numbers for each user and each product.

In [22]:
embed1= tf.keras.layers.Embedding(5, 8)

In [23]:
embed1(2)

<tf.Tensor: shape=(8,), dtype=float32, numpy=
array([ 0.01134228, -0.00071181, -0.04613136,  0.01131717,  0.04415751,
        0.03265735,  0.03003417,  0.01619543], dtype=float32)>

In [24]:
embed1.get_weights()

[array([[ 0.00553378, -0.03432848, -0.00446044, -0.03160959,  0.04659954,
          0.00285913,  0.02077935, -0.01533008],
        [ 0.00661918,  0.04777214, -0.01108087, -0.02085702,  0.01950866,
          0.02604799,  0.04205657, -0.02668432],
        [ 0.01134228, -0.00071181, -0.04613136,  0.01131717,  0.04415751,
          0.03265735,  0.03003417,  0.01619543],
        [-0.01576966, -0.01180685, -0.04220375, -0.00179936,  0.0349973 ,
         -0.02003375, -0.00542346, -0.01737714],
        [ 0.01708038,  0.04596308,  0.03740125, -0.04706345,  0.04394009,
         -0.0206771 ,  0.04210819,  0.04616683]], dtype=float32)]

Scores can be found using the dot product.

In [25]:
# create an embedding for users and products 

# pass in list length of dummy users
dummy_user_embedding = tf.keras.layers.Embedding(len(dummy_users), 6)
product_embedding = tf.keras.layers.Embedding(len(products), 6)

In [26]:
# find embedding from first user
dummy_user_embedding(1)


<tf.Tensor: shape=(6,), dtype=float32, numpy=
array([ 0.0077244 ,  0.03596382, -0.04919672,  0.04469961, -0.01508667,
       -0.00804823], dtype=float32)>

In [27]:
# grab embeddings of a specific product 
product_embedding(20)

<tf.Tensor: shape=(6,), dtype=float32, numpy=
array([-0.04205021,  0.01066536, -0.02175217, -0.01927186, -0.02585499,
       -0.04035393], dtype=float32)>

In [28]:
# multiply the product embedding and user embedding together 
tf.tensordot(dummy_user_embedding(1), product_embedding(20), axes=[[0], [0]])

<tf.Tensor: shape=(), dtype=float32, numpy=0.000982288>

We can score multiple products at the same time, which is what we need to create a ranking.

In [29]:
example_products = tf.constant([1, 77, 104, 2043])
product_embedding(example_products)

<tf.Tensor: shape=(4, 6), dtype=float32, numpy=
array([[ 0.04414732, -0.02179394,  0.00709669, -0.00265307,  0.01048913,
         0.04217174],
       [-0.031643  ,  0.02694793, -0.0142641 ,  0.02123118,  0.0341704 ,
         0.03573424],
       [-0.02723848, -0.04511378, -0.04691534,  0.03485561,  0.00365376,
        -0.02157381],
       [-0.02272536,  0.00721973,  0.02097309, -0.00396943, -0.02678778,
        -0.04204319]], dtype=float32)>

And we can score multiple users for multiple products which we will need to do if we are to train quickly.

In [30]:
tf.tensordot(dummy_user_embedding(1), product_embedding(example_products), axes=[[0], [1]])


<tf.Tensor: shape=(4,), dtype=float32, numpy=array([-0.00140816,  0.00157238,  0.00215176, -0.00038262], dtype=float32)>

But we need to map product ids to embedding ids.

In [31]:
products

array([ 8650774,  9306139,  9961521, ..., 12058614, 12058615, 11927550])

In [32]:
# need a way to convert list of products into a hashtable to grab them by their name
product_table = tf.lookup.StaticHashTable(
    tf.lookup.KeyValueTensorInitializer(tf.constant(products, dtype=tf.int32), 
                                        range(len(products))), -1)

In [33]:
product_table.lookup(tf.constant([8650774]))

<tf.Tensor: shape=(1,), dtype=int32, numpy=array([0], dtype=int32)>

Let's put those two things together

In [55]:
class RecommenderModel(tf.keras.Model):
    def __init__(self, dummy_users, products, length_of_embedding):
        super(RecommenderModel, self).__init__()
        self.products = tf.constant(products, dtype=tf.int32)
        self.dummy_users = tf.constant(dummy_users, dtype=tf.string)
        self.dummy_user_table = tf.lookup.StaticHashTable(tf.lookup.KeyValueTensorInitializer(self.dummy_users, range(len(dummy_users))), -1)
        self.product_table = tf.lookup.StaticHashTable(tf.lookup.KeyValueTensorInitializer(self.products, range(len(products))), -1)
        
        self.user_embedding = tf.keras.layers.Embedding(len(dummy_users), length_of_embedding)
        self.product_embedding = tf.keras.layers.Embedding(len(products), length_of_embedding)
        
        # used to calculat dot product
        self.dot = tf.keras.layers.Dot(axes=-1)


    def call(self, inputs):
        user = inputs[0]
        products = inputs[1]
        
        # lookup in the table 
        user_embedding_index = self.dummy_user_table.lookup(user)
        product_embedding_index = self.product_table.lookup(products)

        user_embedding_values = self.user_embedding(user_embedding_index)
        product_embedding_values = self.product_embedding(product_embedding_index)

        return tf.squeeze(self.dot([user_embedding_values, product_embedding_values]))

    
    @tf.function
    def call_item_item(self, product):
      '''Find products that are similar by applying item-item similarity '''
      # grab the product
      product_x = self.product_table.lookup(product)
      # lookup embedding
      pe = tf.expand_dims(self.product_embedding(product_x), 0)
      
      all_pe = tf.expand_dims(self.product_embedding.embeddings, 0)#note this only works if the layer has been built!
      # take dot product of product embedding for item we want recs for and then all other products 
      scores = tf.reshape(self.dot([pe, all_pe]), [-1])
      # return scores of top items 
      top_scores, top_indices = tf.math.top_k(scores, k=100)
      top_ids = tf.gather(self.products, top_indices)
      return top_ids, top_scores

In [56]:
dummy_users

array(['pmfkU4BNZhmtLgJQwJ7x', 'UDRRwOlzlWVbu7H8YCCi',
       'QHGAef0TI6dhn0wTogvW', ..., 'lcORJ5hemOZc1iGo9z7k',
       '5CqDquDAszqJp27P7AL8', 'SSPNYxJMfuKhoe1dg24m'], dtype='<U20')

In [57]:
products

array([ 8650774,  9306139,  9961521, ..., 12058614, 12058615, 11927550])

In [58]:
rec1 = RecommenderModel(dummy_users, products, length_of_embedding=15)

# create recommendation scores for two different users starting with largest value to the first values
rec1([tf.constant([['pmfkU4BNZhmtLgJQwJ7x'], ['UDRRwOlzlWVbu7H8YCCi']]), 
      tf.constant([[8650774, 9306139,9961521],[12058614, 12058615, 11927550] ])])

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[ 0.0051709 , -0.0010311 ,  0.0079342 ],
       [-0.00168053,  0.00144482,  0.00316149]], dtype=float32)>

# Creating a dataset

First create a tf.data.Dataset from the user purchase pairs.

In [59]:
dummy_user_tensor = tf.constant(train[["dummyUserId"]].values, dtype=tf.string)
product_tensor = tf.constant(train[["productId"]].values, dtype=tf.int32)

dataset = tf.data.Dataset.from_tensor_slices((dummy_user_tensor, product_tensor))
for x, y in dataset:
    print(x)
    print(y)
    break

tf.Tensor([b'PIXcm7Ru5KmntCy0yA1K'], shape=(1,), dtype=string)
tf.Tensor([10524048], shape=(1,), dtype=int32)


In [60]:
products

array([ 8650774,  9306139,  9961521, ..., 12058614, 12058615, 11927550])

In [61]:
random_negative_indexs = tf.random.uniform((7, ), minval=0, maxval=len(products), dtype=tf.int32) 

random_negative_indexs

<tf.Tensor: shape=(7,), dtype=int32, numpy=array([13331, 21150,  9752, 16416, 24937,  8214, 18054], dtype=int32)>

In [62]:
tf.gather(products, random_negative_indexs)

<tf.Tensor: shape=(7,), dtype=int64, numpy=
array([ 8579194,  9531312,  8169846,  8723890,  9155417,  5279307,
       10697667])>

In [63]:
products[18218]

10698347

For each purchase let's sample a number of products that the user did not purchase. Then the model can score each of the products and we will know we are doing a good job if the product with the highest score is the product that the user actually purchased.

We can do this using dataset.map

In [64]:
tf.one_hot(0, depth=11)

<tf.Tensor: shape=(11,), dtype=float32, numpy=array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>

In [65]:
class Mapper():
    
    def __init__(self, possible_products, num_negative_products):
        self.num_possible_products = len(possible_products)
        self.possible_products_tensor = tf.constant(possible_products, dtype=tf.int32)
        
        self.num_negative_products = num_negative_products
        self.y = tf.one_hot(0, num_negative_products+1)
    
    def __call__(self, user, product):
      # gives us a list of indexes of products the user didn't buy 
      random_negative_indexs = tf.random.uniform((self.num_negative_products, ), minval=0, maxval=self.num_possible_products, dtype=tf.int32) 

      negatives = tf.gather(self.possible_products_tensor, random_negative_indexs)

      candidates = tf.concat([product, negatives], axis=0)
      
      return (user, candidates), self.y


In [66]:
# get a new dataset with all the products the user did and didn't purchase 
dataset = tf.data.Dataset.from_tensor_slices((dummy_user_tensor, product_tensor)).map(Mapper(products, 10))

dataset

<MapDataset shapes: (((1,), (11,)), (11,)), types: ((tf.string, tf.int32), tf.float32)>

In [67]:
for (u, c), y in dataset:
  print(u)
  print("1 product they bought and 10 they didn't buy",c)
  print("one hot encoded values for what they bought",y)
  break

tf.Tensor([b'PIXcm7Ru5KmntCy0yA1K'], shape=(1,), dtype=string)
1 product they bought and 10 they didn't buy tf.Tensor(
[10524048 10277069 10559376  9637845 12361270 10529177 11266426 11574727
 10929466 11419079 12386555], shape=(11,), dtype=int32)
one hot encoded values for what they bought tf.Tensor([1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(11,), dtype=float32)


In [68]:
def get_dataset(df, products, num_negative_products):
    dummy_user_tensor = tf.constant(df[["dummyUserId"]].values, dtype=tf.string)
    product_tensor = tf.constant(df[["productId"]].values, dtype=tf.int32)

    dataset = tf.data.Dataset.from_tensor_slices((dummy_user_tensor, product_tensor))

    dataset = dataset.map(Mapper(products, num_negative_products))

    # learn from multiple users at a time instead of one
    dataset = dataset.batch(1024)

    return dataset



In [69]:
for (u, c), y in get_dataset(train, products, 4):
  print(u)
  print(c)
  print(y)
  break

tf.Tensor(
[[b'PIXcm7Ru5KmntCy0yA1K']
 [b'd0RILFB1hUzNSINMY4Ow']
 [b'Ebax7lyhnKRm4xeRlWW2']
 ...
 [b'xuX9n8PHfSR0AP3UZ8ar']
 [b'iNnxsPFfOa9884fMjVPJ']
 [b'aD8Mn12im8lFPzXAY41P']], shape=(1024, 1), dtype=string)
tf.Tensor(
[[10524048  9920895 10002601 11518783 11934272]
 [ 9137713 10331694 12712816 11973717 10374110]
 [ 5808602 12787213 11404589 10811863 10105094]
 ...
 [11541336 11907906 10062628 11266690  9103553]
 [ 7779232  9099410 12973786 11794140 10742469]
 [ 4941259 12812240 11923383 12568105 12284417]], shape=(1024, 5), dtype=int32)
tf.Tensor(
[[1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 ...
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]], shape=(1024, 5), dtype=float32)


# Train a model

We need to compile a model, set the loss and create an evaluation metric. Then we need to train the model.

In [70]:
model = RecommenderModel(dummy_users, products, 15)
# pass in loss func for what we're trying to optimize: turning problem into classification problem to predict purchased product
# vs not purchased 
model.compile(loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True), # want actual scores 
              optimizer=tf.keras.optimizers.SGD(learning_rate=100), # look for direction on things that change the most
              metrics = [tf.keras.metrics.CategoricalAccuracy()])

model.fit(get_dataset(train, products, 100), validation_data = get_dataset(valid, products, 100), epochs=1)



<keras.callbacks.History at 0x7f5b5c56fe10>

Let's do a manual check on whether the model is any good.

In [71]:
test_product = 11698965

In [72]:
print("Recs for item {}: {}".format(test_product, model.call_item_item(tf.constant(test_product, dtype=tf.int32))))

Recs for item 11698965: (<tf.Tensor: shape=(100,), dtype=int32, numpy=
array([11698965,  6538011, 12111280, 10350386, 11378722,  9518140,
       11599017, 10183377, 12271958, 10245790, 12761976,  5026837,
       10614508,  9790722,  8320798, 11887832,  8615341, 12537101,
       11712888, 11389407,  9918646, 12207186, 11292580,  8316444,
       12689143, 12318972, 11723543, 12694212,  5120714,  9229560,
       11497880, 11531679,  9639361,  8745089, 11663540, 12356083,
       10377022,  8925167, 12125614, 11698846, 10624663, 10715863,
       10309914, 10351396,  5377071,  9819136, 10360318,  9108375,
       11210565, 10412439, 10943563,  7913858, 10686823, 10439335,
       10623500, 10183845, 11960361, 12779924, 10111374, 10171589,
       10239696,  8701479, 11368542, 12183213,  9143728, 11037862,
        9681455, 10735532, 11542348, 10253015, 12896059, 10783801,
       10100368,  7298323, 12032507, 10301200, 11395154, 10550480,
       12276093, 10493218, 12125335, 12770781,  8725667, 1