# Recommendation System

Our recommendation system uses collaborative filtering, which is generally used in successful recommendation system nowadays. In simple term, the approach checks if a user has some similar preferences and dislikes of certain items with other user. If it does, then it assumes that the user will have similar opinion on other items with the other user. Through this approach, the system can predict the opinions of user on the items that user hasn’t bought.  

Given that we have the data of users, their ratings on the items they bought (rating depends on their ratings on feedback section or amount of purchases) and the geolocation of the items, our recommendation system will predict the preferences level of users on each of the items. 

We are using Pytorch to build the deep learning neural network model, which helps us infer the similarities of preferences between users and predict the preferences of all items for every users. The input of our existing data will be modified to replace strings with numbers, so it can be read easily by the neural network model.  

Firstly, we import the neccessary libraries.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import io
import os
import math
import copy
import pickle
import zipfile
from textwrap import wrap
from pathlib import Path
from itertools import zip_longest
from collections import defaultdict
from urllib.error import URLError
from urllib.request import urlopen

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import torch
from torch import nn
from torch import optim
from torch.nn import functional as F 
from torch.optim.lr_scheduler import _LRScheduler

# Get Datasets

We get the two datasets, which we created from the Data Preparation code. 

In [3]:
data_rating = pd.read_csv("user_rating.csv")

Our dataset may consists of strings, which have to be converted integers to be easily processed by the neural network model. 

In [4]:
def create_dataset(ratings, top=None):
    if top is not None:
        ratings.groupby('UserID')['Rating'].count()
    
    unique_users = ratings.UserID.unique()
    user_to_index = {old: new for new, old in enumerate(unique_users)}
    new_users = ratings.UserID.map(user_to_index)
    
    unique_items = ratings.ItemID.unique()
    item_to_index = {old: new for new, old in enumerate(unique_items)}
    new_items = ratings.ItemID.map(item_to_index)
    
    unique_postcode = ratings.Postcode.unique()
    postcode_to_index = {old: new for new, old in enumerate(unique_postcode)}
    new_postcode = ratings.Postcode.map(postcode_to_index)

    n_users = unique_users.shape[0]
    n_items = unique_items.shape[0]
    n_postcode = unique_postcode.shape[0]
    
    X = pd.DataFrame({'UserID': new_users, 'ItemID': new_items, 'Postcode': new_postcode})
    y = ratings['Rating'].astype(np.float32)
    return (n_users, n_items, n_postcode), (X, y), (user_to_index, item_to_index, postcode_to_index)

In [5]:
(n, m, l), (X, y), _ = create_dataset(data_rating)
print(f'Embeddings: {n} users, {m} items, {l} postcode')
print(f'Dataset shape: {X.shape}')
print(f'Target shape: {y.shape}')

Embeddings: 500 users, 500 items, 10 postcode
Dataset shape: (124738, 3)
Target shape: (124738,)


In [6]:
class ReviewsIterator:
    
    def __init__(self, X, y, batch_size=32, shuffle=True):
        X, y = np.asarray(X), np.asarray(y)
        
        if shuffle:
            index = np.random.permutation(X.shape[0])
            X, y = X[index], y[index]
            
        self.X = X
        self.y = y
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.n_batches = int(math.ceil(X.shape[0] // batch_size))
        self._current = 0
        
    def __iter__(self):
        return self
    
    def __next__(self):
        return self.next()
    
    def next(self):
        if self._current >= self.n_batches:
            raise StopIteration()
        k = self._current
        self._current += 1
        bs = self.batch_size
        return self.X[k*bs:(k + 1)*bs], self.y[k*bs:(k + 1)*bs]

In [7]:
def batches(X, y, bs=32, shuffle=True):
    for xb, yb in ReviewsIterator(X, y, bs, shuffle):
        xb = torch.LongTensor(xb)
        yb = torch.FloatTensor(yb)
        yield xb, yb.view(-1, 1)

In [8]:
for x_batch, y_batch in batches(X, y, bs=4):
    print(x_batch)
    print(y_batch)
    break

tensor([[201, 410,   6],
        [357,  37,   0],
        [147, 131,   1],
        [380, 242,   0]])
tensor([[2.],
        [0.],
        [4.],
        [4.]])


# Training Data

Split the data into training and testing data.

In [9]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1)
datasets = {'train': (X_train, y_train), 'val': (X_valid, y_valid)}
dataset_sizes = {'train': len(X_train), 'val': len(X_valid)}

In [10]:
minmax = [data_rating.Rating.min(), data_rating.Rating.max()]
minmax = torch.Tensor(minmax)

Create a neural network with embedding layers, hidden layers and dropout layers.

In [11]:
class EmbeddingNet(nn.Module):
    """
    Creates a dense network with embedding layers.
    
    Args:
    
        n_users:            
            Number of unique users in the dataset.

        n_items: 
            Number of unique items in the dataset.
        
        n_postcodes: 
            Number of unique postcodes in the dataset.

        n_factors: 
            Number of columns in the embeddings matrix.

        embedding_dropout: 
            Dropout rate to apply right after embeddings layer.

        hidden:
            A single integer or a list of integers defining the number of 
            units in hidden layer(s).

        dropouts: 
            A single integer or a list of integers defining the dropout 
            layers rates applyied right after each of hidden layers.
            
    """
    def __init__(self, n_users, n_items, n_postcodes,
                 n_factors=50, embedding_dropout=0.02, 
                 hidden=10, dropouts=0.2):
        
        super().__init__()
        hidden = get_list(hidden)
        dropouts = get_list(dropouts)
        n_last = hidden[-1]
        
        def gen_layers(n_in):
            """
            A generator that yields a sequence of hidden layers and 
            their activations/dropouts.
            
            Note that the function captures `hidden` and `dropouts` 
            values from the outer scope.
            """
            nonlocal hidden, dropouts
            assert len(dropouts) <= len(hidden)
            
            for n_out, rate in zip_longest(hidden, dropouts):
                yield nn.Linear(n_in, n_out)
                yield nn.ReLU()
                if rate is not None and rate > 0.:
                    yield nn.Dropout(rate)
                n_in = n_out
            
        self.u = nn.Embedding(n_users, n_factors)
        self.m = nn.Embedding(n_items, n_factors)
        self.p = nn.Embedding(n_postcodes, n_factors)
        self.drop = nn.Dropout(embedding_dropout)
        self.hidden = nn.Sequential(*list(gen_layers(n_factors * 3)))
        self.fc = nn.Linear(n_last, 1)
        self._init()
        
    def forward(self, users, items, postcodes, minmax=None):
        features = torch.cat([self.u(users), self.m(items), self.p(postcodes)], dim=1)
        x = self.drop(features)
        x = self.hidden(x)
        out = torch.sigmoid(self.fc(x))
        if minmax is not None:
            min_rating, max_rating = minmax
            out = out*(max_rating - min_rating + 1) + min_rating - 0.5
        return out
    
    def _init(self):
        """
        Setup embeddings and hidden layers with reasonable initial values.
        """
        
        def init(m):
            if type(m) == nn.Linear:
                torch.nn.init.xavier_uniform_(m.weight)
                m.bias.data.fill_(0.01)
                
        self.u.weight.data.uniform_(-0.05, 0.05)
        self.m.weight.data.uniform_(-0.05, 0.05)
        self.p.weight.data.uniform_(-0.05, 0.05)
        self.hidden.apply(init)
        init(self.fc)
    
    
def get_list(n):
    if isinstance(n, (int, float)):
        return [n]
    elif hasattr(n, '__iter__'):
        return list(n)
    raise TypeError('layers configuraiton should be a single number or a list of numbers')

In [12]:
net = EmbeddingNet(
    n_users=n, n_items=m, n_postcodes= l,
    n_factors=150, hidden=[500, 500, 500], 
    embedding_dropout=0.05, dropouts=[0.5, 0.5, 0.25])

In [13]:
lr = 1e-3
wd = 1e-5
bs = 2000
n_epochs = 100
patience = 10
no_improvements = 0
best_loss = np.inf
best_weights = None
history = []
lr_history = []

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

net.to(device)
criterion = nn.MSELoss(reduction='sum')
optimizer = optim.Adam(net.parameters(), lr=lr, weight_decay=wd)
iterations_per_epoch = int(math.ceil(dataset_sizes['train'] // bs))

for epoch in range(n_epochs):
    stats = {'epoch': epoch + 1, 'total': n_epochs}
    
    for phase in ('train', 'val'):
        training = phase == 'train'
        running_loss = 0.0
        n_batches = 0
        
        for batch in batches(*datasets[phase], shuffle=training, bs=bs):
            x_batch, y_batch = [b.to(device) for b in batch]
            optimizer.zero_grad()
        
            # compute gradients only during 'train' phase
            with torch.set_grad_enabled(training):
                outputs = net(x_batch[:, 0], x_batch[:, 1], x_batch[:, 2], minmax)
                loss = criterion(outputs, y_batch)
                
                # don't update weights and rates when in 'val' phase
                if training:
                    loss.backward()
                    optimizer.step()
                    
            running_loss += loss.item()
            
        epoch_loss = running_loss / dataset_sizes[phase]
        stats[phase] = epoch_loss
        
        # early stopping: save weights of the best model so far
        if phase == 'val':
            if epoch_loss < best_loss:
                print('loss improvement on epoch: %d' % (epoch + 1))
                best_loss = epoch_loss
                best_weights = copy.deepcopy(net.state_dict())
                no_improvements = 0
            else:
                no_improvements += 1
                
    history.append(stats)
    print('[{epoch:03d}/{total:03d}] train: {train:.4f} - val: {val:.4f}'.format(**stats))
    if no_improvements >= patience:
        print('early stopping after epoch {epoch:03d}'.format(**stats))
        break

loss improvement on epoch: 1
[001/100] train: 9.2756 - val: 8.9915
[002/100] train: 8.9774 - val: 9.0117
[003/100] train: 8.9346 - val: 9.0254
[004/100] train: 8.9165 - val: 9.0412
[005/100] train: 8.8748 - val: 9.1082
[006/100] train: 8.8046 - val: 9.1341
[007/100] train: 8.7121 - val: 9.1993
[008/100] train: 8.5703 - val: 9.3576
[009/100] train: 8.3385 - val: 9.5715
[010/100] train: 8.1186 - val: 9.8119
[011/100] train: 7.8230 - val: 9.9129
early stopping after epoch 011


In [14]:
net.load_state_dict(best_weights)

<All keys matched successfully>

Now we can calculate the loss function to see the overall performances of the training of model.

In [15]:
groud_truth, predictions = [], []

with torch.no_grad():
    for batch in batches(*datasets['val'], shuffle=False, bs=bs):
        x_batch, y_batch = [b.to(device) for b in batch]
        outputs = net(x_batch[:, 0], x_batch[:, 1], x_batch[:, 2], minmax)
        groud_truth.extend(y_batch.tolist())
        predictions.extend(outputs.tolist())

groud_truth = np.asarray(groud_truth).ravel()
predictions = np.asarray(predictions).ravel()

In [16]:
final_loss = np.sqrt(np.mean((predictions - groud_truth)**2))
print(f'Final RMSE: {final_loss:.4f}')

Final RMSE: 3.0565


In [17]:
with open('best.weights', 'wb') as file:
    pickle.dump(best_weights, file)

# Make Predictions

We are going to predict the preferences of User 1 for all items. Our input will be User 1 (list of 1s), list of items and list of items' corresponding postcode. For each element in the output tensor, the higher the value, the higher the preference value for its corresponding item.  

In [21]:
net.eval()
user = torch.tensor([1] * 500)
data_items = pd.read_csv("list_items.csv")

item_list = []
postcode_list = []
price_list = []

for data_val in data_items.values.tolist():
    item_list.append(data_val[0]-1)
    postcode_list.append(data_val[1])
    price_list.append(data_val[2])

item = torch.tensor(item_list)

unique_postcode = set(postcode_list)
postcode_to_index = {old: new for new, old in enumerate(unique_postcode)}
new_postcode = list(map(lambda x: postcode_to_index[x],postcode_list))

postcode = torch.tensor(new_postcode)

with torch.no_grad():
    output = net(user,item, postcode)

We have the output now. We create a dict where we map each items to its output score, which is the rating. Our rating is 0-1 instead of 1-10 because we implemented sigmoid function in our neural network.

In [22]:
output_list = list(output.numpy().flatten())
item_rating = {}

for i in range(0,499):
    item_rating[i] = output_list[i]

Finally, we sorted the dict by its rating in reverse order. In the user preference list, the first item has the highest preferences from User 1 and the last item has the lowest preferences from User 1.

In [23]:
user_preference = sorted(item_rating.items(), key=lambda kv: kv[1], reverse=True)
print(user_preference)

[(151, 0.32776675), (46, 0.32776558), (400, 0.32734692), (127, 0.32731152), (308, 0.32690844), (209, 0.32636848), (222, 0.32605797), (270, 0.3259815), (3, 0.32568696), (347, 0.3256444), (379, 0.32547078), (47, 0.3253154), (361, 0.32456896), (489, 0.32437065), (273, 0.3242113), (52, 0.32391176), (101, 0.32371026), (199, 0.32366723), (369, 0.32362142), (43, 0.32357666), (99, 0.3230511), (97, 0.32275176), (442, 0.322703), (15, 0.3226415), (145, 0.3225865), (319, 0.3223825), (85, 0.32210967), (125, 0.32206538), (7, 0.32196733), (255, 0.32157212), (386, 0.3215071), (444, 0.32145277), (131, 0.3214325), (251, 0.3213563), (404, 0.3209261), (141, 0.32084996), (166, 0.3208201), (5, 0.3207341), (402, 0.3207108), (437, 0.32046977), (248, 0.32040095), (252, 0.32020247), (443, 0.3201325), (205, 0.32011893), (304, 0.32002583), (474, 0.32000145), (491, 0.31999835), (419, 0.31995717), (28, 0.3197675), (375, 0.31975064), (453, 0.31968692), (144, 0.31965125), (213, 0.31948248), (228, 0.31947926), (429, 0

# Spending Capability

Our recommendation system does not only recommend the items that user likely to give the top rating, but also filter out items out of spending capability. If the price of item is higher than the user's current account balance, the system knows that the user is not going to purchase the item. So the system will not recommend the item. 

In [26]:
#Assume user has £70 remaining on account balance.
account_balance = 70

for i in range(0, len(price_list)):
    if(i > account_balance):
        item_to_remove = i
        for item in user_preference:
            if(item[0]==item_to_remove):
                user_preference.remove(item)
                break

print(user_preference)

[(46, 0.32776558), (3, 0.32568696), (47, 0.3253154), (52, 0.32391176), (43, 0.32357666), (15, 0.3226415), (7, 0.32196733), (5, 0.3207341), (28, 0.3197675), (12, 0.31863254), (66, 0.3180563), (21, 0.31711587), (26, 0.3166844), (61, 0.31665984), (57, 0.31662738), (49, 0.31639755), (17, 0.31618688), (35, 0.31539813), (18, 0.31537667), (34, 0.31493673), (50, 0.31474987), (38, 0.3142212), (39, 0.31377807), (10, 0.3134792), (70, 0.31311825), (23, 0.3130918), (62, 0.31293303), (63, 0.31211182), (16, 0.31192723), (24, 0.3118162), (68, 0.31166893), (67, 0.31122702), (1, 0.3110486), (37, 0.31047112), (2, 0.3101485), (20, 0.31011736), (40, 0.31008905), (36, 0.30976924), (60, 0.30970547), (30, 0.30953026), (64, 0.3094504), (51, 0.30912548), (41, 0.30911735), (4, 0.3089027), (45, 0.3079975), (55, 0.30678046), (9, 0.30656752), (14, 0.30585185), (22, 0.30565327), (44, 0.30531704), (33, 0.3051563), (53, 0.30510417), (42, 0.30508128), (27, 0.30486318), (69, 0.30427197), (11, 0.30409837), (31, 0.303967)