<a href="https://colab.research.google.com/github/Andre-Williams22/fashion-recommendation-system/blob/master/Recommender_From_Scratch_Tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## User and Product Vector Recommendation Model

In [1]:
!pip install keras_resnet

Collecting keras_resnet
  Downloading keras-resnet-0.2.0.tar.gz (9.3 kB)
Building wheels for collected packages: keras-resnet
  Building wheel for keras-resnet (setup.py) ... [?25l[?25hdone
  Created wheel for keras-resnet: filename=keras_resnet-0.2.0-py2.py3-none-any.whl size=20486 sha256=4588118a794a34543786eafbb53b59ce468b171478e74a312dff806b19d54c03
  Stored in directory: /root/.cache/pip/wheels/bd/ef/06/5d65f696360436c3a423020c4b7fd8c558c09ef264a0e6c575
Successfully built keras-resnet
Installing collected packages: keras-resnet
Successfully installed keras-resnet-0.2.0


## Steps for building our Model 

1. Pick random numbers for each customer and each product 
2. Find a score for each customer and product 
3. Rank according to these scores 
4. Tweak the customer and user vectors to get better rankings 

## Goals 
We started off with random numbers but over time two things happen:
1. Overtime our user vector captures the taste and preferences of the user. 
2. The product vector captures the product style and features.


Because if I purchase three items and you purchase 2 of those 3 items then my user vector is probably similar in that these items both have high scores for the two products we have in common. And naturally you would get a high score for the 3rd product I bought that you didn't; hence, the 3rd product I bought would be recommended to the user. 


## Potential Success Metrics 

We know our model is good if the items the user buys is ranked at the top of our list for ranks for the user.

# Imports

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
import os

In [37]:
import tensorflow as tf
import keras
from keras import Model
MaxPooling2D
tf.__version__

NameError: ignored

# Import training data

In [50]:
train = pd.read_parquet("https://raw.githubusercontent.com/ASOS/dsf2020/main/dsf_asos_train_with_alphanumeric_dummy_ids.parquet")
valid = pd.read_parquet("https://raw.githubusercontent.com/ASOS/dsf2020/main/dsf_asos_valid_with_alphanumeric_dummy_ids.parquet")
dummy_users = pd.read_csv("https://raw.githubusercontent.com/ASOS/dsf2020/main/dsf_asos_dummy_users_with_alphanumeric_dummy_ids.csv", header=None).values.flatten().astype(str)
products = pd.read_csv("https://raw.githubusercontent.com/ASOS/dsf2020/main/dsf_asos_productIds.csv", header=None).values.flatten().astype(int)

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
# train.to_csv('train_with_alphanumeric_dummy_and_product_ids.csv')
# valid.to_csv('valid_with_alphanumeric_dummy_and_product_ids.csv')

In [7]:
train.shape, valid.shape, dummy_users.shape, products.shape

((165042, 2), (35567, 2), (43607, 1), (29696, 1))

In [8]:
train.shape

(165042, 2)

In [9]:
valid.shape

(35567, 2)

In [10]:
train

Unnamed: 0,dummyUserId,productId
0,b'PIXcm7Ru5KmntCy0yA1K',10524048
1,b'd0RILFB1hUzNSINMY4Ow',9137713
2,b'Ebax7lyhnKRm4xeRlWW2',5808602
3,b'vtigDw2h2vxKt0sJpEeU',10548272
4,b'r4GfiEaUGxziyjX0PyU6',10988173
...,...,...
165037,b'7Eom5Ancozj01ozGxAMK',9071435
165038,b'zi9vZETHqSIZK0TM2nZc',10413104
165039,b'fVCveec9P946asY5wqGm',9859881
165040,b'VJtfpw602SZHh2qwarK4',10809487


In [11]:
valid

Unnamed: 0,dummyUserId,productId
0,b'I4Yc5Ztur3UNwY5SdvDh',10093853
1,b'nhWgcxEVY7jQ3MvvNxWL',12306408
2,b'3vriQXKwG095rvR1MSrz',11858310
3,b'MA8KmOxkGd1JQ42GXDGO',10072124
4,b'vax7VgJnswdiC8iHZSCi',10596405
...,...,...
35562,b'A5uRhbiMu4vlKCB9A0rc',8496251
35563,b'VVyZSPXhX62iE9AtPDen',11935204
35564,b'8ACEWhSBG4eyhzHmwf4C',10494419
35565,b'bSTkDJMjlco6hheq9lTQ',11270014


In [12]:
dummy_users

Unnamed: 0,0
0,pmfkU4BNZhmtLgJQwJ7x
1,UDRRwOlzlWVbu7H8YCCi
2,QHGAef0TI6dhn0wTogvW
3,xkDvstQDkA6uJlOfslX7
4,44dM2SXR9BWX5e0ozkF8
...,...
43602,1hsyohz0i37hinx6KX8x
43603,oGSJHmWWvRq8vSbMq2XA
43604,lcORJ5hemOZc1iGo9z7k
43605,5CqDquDAszqJp27P7AL8


In [13]:
products

Unnamed: 0,0
0,8650774
1,9306139
2,9961521
3,13238328
4,10485819
...,...
29691,11927533
29692,11272181
29693,12058614
29694,12058615


# Define a Recommender Model

The embedding layer gives a list of random numbers for each user and each product.

In [32]:
embed1= tf.keras.layers.Embedding(5, 8)

In [34]:
embed1(2)

<tf.Tensor: shape=(8,), dtype=float32, numpy=
array([-0.02866203,  0.01962494,  0.0169918 , -0.02567202,  0.00652628,
       -0.02465883, -0.01568688, -0.02088585], dtype=float32)>

In [36]:
embed1.get_weights()

[array([[-0.02958336, -0.03473998,  0.0145921 ,  0.03351617, -0.01130372,
         -0.03837059, -0.03909403, -0.00592911],
        [-0.04885773, -0.02455195, -0.04686287,  0.00506725, -0.03612177,
         -0.00311587, -0.01773872, -0.0283118 ],
        [-0.02866203,  0.01962494,  0.0169918 , -0.02567202,  0.00652628,
         -0.02465883, -0.01568688, -0.02088585],
        [-0.00779068, -0.04018898, -0.02437047, -0.01427352,  0.0081787 ,
         -0.03662685,  0.01933464, -0.01496754],
        [ 0.03784574,  0.0075291 ,  0.00362164, -0.0294101 ,  0.01330021,
         -0.02153105,  0.00383062, -0.00942637]], dtype=float32)]

Scores can be found using the dot product.

In [38]:
# create an embedding for users and products 

# pass in list length of dummy users
dummy_user_embedding = tf.keras.layers.Embedding(len(dummy_users), 6)
product_embedding = tf.keras.layers.Embedding(len(products), 6)

In [41]:
# find embedding from first user
dummy_user_embedding(1)


<tf.Tensor: shape=(6,), dtype=float32, numpy=
array([ 0.02040986, -0.03331436,  0.01280453, -0.04748262,  0.02327034,
        0.00638724], dtype=float32)>

In [42]:
# grab embeddings of a specific product 
product_embedding(20)

<tf.Tensor: shape=(6,), dtype=float32, numpy=
array([-0.01519806,  0.00334023,  0.01265842, -0.01361679, -0.03619184,
       -0.03142122], dtype=float32)>

In [45]:
# multiply the product embedding and user embedding together 
tf.tensordot(dummy_user_embedding(1), product_embedding(20), axes=[[0], [0]])

<tf.Tensor: shape=(), dtype=float32, numpy=-0.00065571367>

We can score multiple products at the same time, which is what we need to create a ranking.

In [46]:
example_products = tf.constant([1, 77, 104, 2043])
product_embedding(example_products)

<tf.Tensor: shape=(4, 6), dtype=float32, numpy=
array([[ 0.04267508,  0.03982818,  0.0268993 , -0.04055069, -0.02315345,
        -0.04049768],
       [-0.04584204, -0.02910911, -0.04349979, -0.00588558,  0.00326096,
        -0.04877746],
       [ 0.03768641, -0.03590404,  0.04858345, -0.03359304, -0.02642949,
        -0.0246422 ],
       [-0.04525249,  0.03269703, -0.04453097,  0.0077301 ,  0.04832972,
         0.00418402]], dtype=float32)>

And we can score multiple users for multiple products which we will need to do if we are to train quickly.

In [47]:
tf.tensordot(dummy_user_embedding(1), product_embedding(example_products), axes=[[0], [1]])


<tf.Tensor: shape=(4,), dtype=float32, numpy=array([ 0.00101657, -0.00047908,  0.00341005, -0.00179875], dtype=float32)>

But we need to map product ids to embedding ids.

In [51]:
products

array([ 8650774,  9306139,  9961521, ..., 12058614, 12058615, 11927550])

In [52]:
# need a way to convert list of products into a hashtable to grab them by their name
product_table = tf.lookup.StaticHashTable(
    tf.lookup.KeyValueTensorInitializer(tf.constant(products, dtype=tf.int32), 
                                        range(len(products))), -1)

In [53]:
product_table.lookup(tf.constant([8650774]))

<tf.Tensor: shape=(1,), dtype=int32, numpy=array([0], dtype=int32)>

Let's put those two things together

In [113]:
class RecommenderModel(tf.keras.Model):
    def __init__(self, dummy_users, products, length_of_embedding):
        super(RecommenderModel, self).__init__()
        self.products = tf.constant(products, dtype=tf.int32)
        self.dummy_users = tf.constant(dummy_users, dtype=tf.string)
        self.dummy_user_table = tf.lookup.StaticHashTable(tf.lookup.KeyValueTensorInitializer(self.dummy_users, range(len(dummy_users))), -1)
        self.product_table = tf.lookup.StaticHashTable(tf.lookup.KeyValueTensorInitializer(self.products, range(len(products))), -1)
        
        self.user_embedding = tf.keras.layers.Embedding(len(dummy_users), length_of_embedding)
        self.product_embedding = tf.keras.layers.Embedding(len(products), length_of_embedding)
        
        # used to calculat dot product
        self.dot = tf.keras.layers.Dot(axes=-1)


    def call(self, inputs):
        user = inputs[0]
        products = inputs[1]
        
        # lookup in the table 
        user_embedding_index = self.dummy_user_table.lookup(user)
        product_embedding_index = self.product_table.lookup(products)

        user_embedding_values = self.user_embedding(user_embedding_index)
        product_embedding_values = self.product_embedding(product_embedding_index)

        return tf.squeeze(self.dot([user_embedding_values, product_embedding_values]))

    
    @tf.function
    def call_item_item(self, product):
        product_x = self.product_table.lookup(product)
        pe = tf.expand_dims(self.product_embedding(product_x), 0)
        
        all_pe = tf.expand_dims(self.product_embedding.embeddings, 0)#note this only works if the layer has been built!
        scores = tf.reshape(self.dot([pe, all_pe]), [-1])
        
        top_scores, top_indices = tf.math.top_k(scores, k=100)
        top_ids = tf.gather(self.products, top_indices)
        return top_ids, top_scores

In [114]:
dummy_users

array(['pmfkU4BNZhmtLgJQwJ7x', 'UDRRwOlzlWVbu7H8YCCi',
       'QHGAef0TI6dhn0wTogvW', ..., 'lcORJ5hemOZc1iGo9z7k',
       '5CqDquDAszqJp27P7AL8', 'SSPNYxJMfuKhoe1dg24m'], dtype='<U20')

In [115]:
products

array([ 8650774,  9306139,  9961521, ..., 12058614, 12058615, 11927550])

In [116]:
rec1 = RecommenderModel(dummy_users, products, length_of_embedding=15)

# create recommendation scores for two different users starting with largest value to the first values
rec1([tf.constant([['pmfkU4BNZhmtLgJQwJ7x'], ['UDRRwOlzlWVbu7H8YCCi']]), 
      tf.constant([[8650774, 9306139,9961521],[12058614, 12058615, 11927550] ])])

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[-0.0037846 ,  0.00558789, -0.01052617],
       [-0.00210104,  0.00572162, -0.00126659]], dtype=float32)>

# Creating a dataset

First create a tf.data.Dataset from the user purchase pairs.

In [117]:
dummy_user_tensor = tf.constant(train[["dummyUserId"]].values, dtype=tf.string)
product_tensor = tf.constant(train[["productId"]].values, dtype=tf.int32)

dataset = tf.data.Dataset.from_tensor_slices((dummy_user_tensor, product_tensor))
for x, y in dataset:
    print(x)
    print(y)
    break

tf.Tensor([b'PIXcm7Ru5KmntCy0yA1K'], shape=(1,), dtype=string)
tf.Tensor([10524048], shape=(1,), dtype=int32)


In [118]:
products

array([ 8650774,  9306139,  9961521, ..., 12058614, 12058615, 11927550])

In [119]:
random_negative_indexs = tf.random.uniform((7, ), minval=0, maxval=len(products), dtype=tf.int32) 

random_negative_indexs

<tf.Tensor: shape=(7,), dtype=int32, numpy=array([20374, 10080, 14368, 13489, 19487, 26080, 22378], dtype=int32)>

In [120]:
tf.gather(products, random_negative_indexs)

<tf.Tensor: shape=(7,), dtype=int64, numpy=
array([10183845, 12759006,  9239049, 11725466, 10179766, 12175103,
       12420888])>

In [121]:
products[18218]

10698347

For each purchase let's sample a number of products that the user did not purchase. Then the model can score each of the products and we will know we are doing a good job if the product with the highest score is the product that the user actually purchased.

We can do this using dataset.map

In [122]:
tf.one_hot(0, depth=11)

<tf.Tensor: shape=(11,), dtype=float32, numpy=array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>

In [123]:
class Mapper():
    
    def __init__(self, possible_products, num_negative_products):
        self.num_possible_products = len(possible_products)
        self.possible_products_tensor = tf.constant(possible_products, dtype=tf.int32)
        
        self.num_negative_products = num_negative_products
        self.y = tf.one_hot(0, num_negative_products+1)
    
    def __call__(self, user, product):
      # gives us a list of indexes of products the user didn't buy 
      random_negative_indexs = tf.random.uniform((self.num_negative_products, ), minval=0, maxval=self.num_possible_products, dtype=tf.int32) 

      negatives = tf.gather(self.possible_products_tensor, random_negative_indexs)

      candidates = tf.concat([product, negatives], axis=0)
      
      return (user, candidates), self.y


In [124]:
# get a new dataset with all the products the user did and didn't purchase 
dataset = tf.data.Dataset.from_tensor_slices((dummy_user_tensor, product_tensor)).map(Mapper(products, 10))

dataset

<MapDataset shapes: (((1,), (11,)), (11,)), types: ((tf.string, tf.int32), tf.float32)>

In [125]:
for (u, c), y in dataset:
  print(u)
  print("1 product they bought and 10 they didn't buy",c)
  print("one hot encoded values for what they bought",y)
  break

tf.Tensor([b'PIXcm7Ru5KmntCy0yA1K'], shape=(1,), dtype=string)
1 product they bought and 10 they didn't buy tf.Tensor(
[10524048 11669878 10323312  8379290 11676304 10487341 11407918  9990157
 10714419 10591471 11434149], shape=(11,), dtype=int32)
one hot encoded values for what they bought tf.Tensor([1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(11,), dtype=float32)


In [126]:
def get_dataset(df, products, num_negative_products):
    dummy_user_tensor = tf.constant(df[["dummyUserId"]].values, dtype=tf.string)
    product_tensor = tf.constant(df[["productId"]].values, dtype=tf.int32)

    dataset = tf.data.Dataset.from_tensor_slices((dummy_user_tensor, product_tensor))

    dataset = dataset.map(Mapper(products, num_negative_products))

    # learn from multiple users at a time instead of one
    dataset = dataset.batch(1024)

    return dataset



In [127]:
for (u, c), y in get_dataset(train, products, 4):
  print(u)
  print(c)
  print(y)
  break

tf.Tensor(
[[b'PIXcm7Ru5KmntCy0yA1K']
 [b'd0RILFB1hUzNSINMY4Ow']
 [b'Ebax7lyhnKRm4xeRlWW2']
 ...
 [b'xuX9n8PHfSR0AP3UZ8ar']
 [b'iNnxsPFfOa9884fMjVPJ']
 [b'aD8Mn12im8lFPzXAY41P']], shape=(1024, 1), dtype=string)
tf.Tensor(
[[10524048 10702429 11394611 12963332 10647908]
 [ 9137713 11242424 10997347 10220706 10741283]
 [ 5808602 12733534 10693892 11446616 12049340]
 ...
 [11541336 10696742  9533975  8725708 11055311]
 [ 7779232 13097911 10854998 10279506  9135671]
 [ 4941259 10421356 10195683 10474018 12224463]], shape=(1024, 5), dtype=int32)
tf.Tensor(
[[1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 ...
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]], shape=(1024, 5), dtype=float32)


# Train a model

We need to compile a model, set the loss and create an evaluation metric. Then we need to train the model.

In [136]:
model = RecommenderModel(dummy_users, products, 15)
# pass in loss func for what we're trying to optimize: turning problem into classification problem to predict purchased product
# vs not purchased 
model.compile(loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True), # want actual scores 
              optimizer=tf.keras.optimizers.SGD(learning_rate=100), # look for direction on things that change the most
              metrics = [tf.keras.metrics.CategoricalAccuracy()])

model.fit(get_dataset(train, products, 100), validation_data = get_dataset(valid, products, 100), epochs=1)



<keras.callbacks.History at 0x7ff48485ac10>

Let's do a manual check on whether the model is any good.

In [137]:
test_product = 11698965

In [138]:
print("Recs for item {}: {}".format(test_product, model.call_item_item(tf.constant(test_product, dtype=tf.int32))))

Recs for item 11698965: (<tf.Tensor: shape=(100,), dtype=int32, numpy=
array([10352144, 11465462, 11276465,  7413512, 11698965,  4275756,
       10717559, 11178987,  9968967, 10338206, 10486747, 10319340,
        5801178, 10424334, 10219206, 11512016,  8548776, 10024101,
       10292598, 13566620, 10737036, 10824306, 10933032,  9526493,
        8084932, 10552208, 10994996, 10437097, 10477551, 12794478,
        9561677,  8752501,  9790133, 12947966, 11574737,  9439693,
       10984318,  9000169, 11473317, 10484392, 12747757, 10158192,
        9184017, 10970833, 11439705, 11880446, 11610155,  9168531,
        9229793, 10998890,  8770580, 10922273,  9177594, 13226228,
       11406238,  9554798, 11462825,  8884242,  8290010, 11209785,
       12940879,  8730081,  9904770, 11985464, 10478656, 13038678,
       11156089, 10308533, 11363430, 11255099, 11659469,  8646710,
       12966809, 12875989, 11187918,  9459817,  9349294, 10808998,
        8782953, 13773314,  9375812,  8579194, 12746074,  