# Restricted Boltzmann Machine 

Here we will be creating a recommender engine to predict/classify the segment of whether a user will "like" or "not like" a movie based on the MovieLens Dataset with ratings of the movies. This is a real world dataset with serveral records of ratings by users.

We will be using the 100k dataset because the training and test datasets are already split up in the dataset for our purposes here.

The [dataset](https://grouplens.org/datasets/movielens/1m/) can be downloaded and checked out from the hyperlink


Lets start with Data Preprocessing

In [35]:
# Importing the libraries
import numpy as np
import pandas as pd
import torch

# For neural network
import torch.nn as nn

# For Parallel Computing
import torch.nn.parallel
import torch.optim as optim
import torch.utils.data
from torch.autograd import Variable

#### Importing all the dataset

1. movies is a dataframe of all the movies. here we are importing the 1M dataset
    * This dataset is separated by :: so using the appropriate separator
    * We have to add header=none becacuse the dataset does not have a dataset
    * Engine is to make sure that teh dataset is imported correctly
    * Some of the movie characters cocntain special characters. so we use latin-1 as preferred encoding
    
2. users is a dataframe of all the users

3. ratings is a dataframe for all the ratings provided for the movies by the users

We are doing this because we want to look at these datasets. We will be using a 100k dataset for training and testing

In [36]:
# Importing the dataset
movies = pd.read_csv('ml-1m/movies.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')
users = pd.read_csv('ml-1m/users.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')
ratings = pd.read_csv('ml-1m/ratings.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')

In [37]:
movies.columns=['movie_id','movie_name','genres']
movies.head()

Unnamed: 0,movie_id,movie_name,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [38]:
users.columns=['user_id','gender','age','job_id','zipcode']
users.head()

Unnamed: 0,user_id,gender,age,job_id,zipcode
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [39]:
ratings.columns=['user_id','movie_id','ratings','timestamp']
ratings.head()

Unnamed: 0,user_id,movie_id,ratings,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


### Training and Test Dataset

* Here the files we will be using are tab delimited
* In the folder are there are 5 training and test splits. 
* This is to allow K fold cross validations using 5 folds. In this we are not going to do kfold cross validation but build a straightforward RBM model

    1. Using u1.base as training set and u1.test as test set.
    2. Training Dataset has 80% of the 100k dataset
    3. Test Dataset has the remaining 20000 records
    4. Converting both the training set and test set as np arrays

In [40]:
# Preparing the training set and the test set
training_set = pd.read_csv('ml-100k/u1.base', delimiter = '\t')
training_set = np.array(training_set, dtype = 'int')
test_set = pd.read_csv('ml-100k/u1.test', delimiter = '\t')
test_set = np.array(test_set, dtype = 'int')

In [41]:
training_set

array([[        1,         2,         3, 876893171],
       [        1,         3,         4, 878542960],
       [        1,         4,         3, 876893119],
       ..., 
       [      943,      1188,         3, 888640250],
       [      943,      1228,         3, 888640275],
       [      943,      1330,         3, 888692465]])

We need to create two variables with total number of users and movies. THis is because can use all combination of users and movies as input to the restricted boltzmann machine. If a certain combination of user and movie does not have a rating, we can substitute it with a 0

In [42]:
# Getting the number of users and movies
nb_users = int(max(max(training_set[:,0]), max(test_set[:,0])))
nb_movies = int(max(max(training_set[:,1]), max(test_set[:,1])))

### Data Processing

Now we need to make a specific structure of data that RBM expect.
RBM are a special type of Neural Network which have some input nodes that are the features and there are some observation going through the nn one by one.

Here were creating a list of lists with each list corresponds to the each user and in the nested list will have all the ratings for that customer

For ex: lets create a list of all the ratings by user 1,2 etc

User 1 ratings --> [3.0,1.0,.....,4.0]
User 2 ratings --> [2.0,3.0,.....,4.0]
.
.
.
User 943 ratings --> [5.0,2.0,.....0.0]

Finally put each of this list into another list

So we end up with a list of list where all the nested lists correspond to ratings by the corresponding user


In [43]:

# Converting the data into an array with users in lines and movies in columns
def convert(data):
    new_data = []
    for id_users in range(1, nb_users + 1):
        id_movies = data[:,1][data[:,0] == id_users]
        id_ratings = data[:,2][data[:,0] == id_users]
        ratings = np.zeros(nb_movies)
        ratings[id_movies - 1] = id_ratings
        new_data.append(list(ratings))
    return new_data
training_set = convert(training_set)
test_set = convert(test_set)


Lets look at the length of the training set. It should be the same as the number of users in our training set=943

In [44]:
len(training_set)

943

Now lets look at the length of each element in the training set
This should be equal to number of unique movies - 1682

So training set is a list of 943 lists - corresponding to 943 users
Each of these 943 lists have a length of 1682 movie ratings

In [45]:
len(training_set[0])

1682

#### Building PyTorch Tensors

Since RBM is an unsupervised model, the input that it does not need a special label column. What it needs is a speacial list of list vectors that are fed into the input nodes for the model.

We implement this using PyTorch Tensors since they are computationally less expensive and converge really fast. Similar to np array, we can convert the whole array into torch arrach using the command as shown below:


In [46]:

# Converting the data into Torch tensors
training_set = torch.FloatTensor(training_set)
test_set = torch.FloatTensor(test_set)

For a Boltzmann Machine, the output and inputs must be consistent. 

1. In our case, the inputs are ratings from 0 - 5 but we want the ratings in the output to be 0 or 1.
2. So to make everything consistent, we have to recode the input parameters based on a judgement to rate a certain threshold and above as "liked" =1 and "not liked"=0 

Here we are using any rating less than 3 as not liked by user and any rating >=3 as liked by user

In [47]:

# Converting the ratings into binary ratings 1 (Liked) or 0 (Not Liked)
training_set[training_set == 0] = -1
training_set[training_set == 1] = 0
training_set[training_set == 2] = 0
training_set[training_set >= 3] = 1
test_set[test_set == 0] = -1
test_set[test_set == 1] = 0
test_set[test_set == 2] = 0
test_set[test_set >= 3] = 1


Here we have to build a class to perform the probabilistic graphical model.

Lets build this architecture using the following:

1. Number of hidden and visible nodes
2. Weight
3. bias b for visible Node and a for hidden node
4. Activation function


#### Building a class
 Initialize -- Initialize all the hidden nodes, varaibles and activation function
 sample_h --> Probability of Hidden Node given the Visible nodes
 sample_v --> Probability of Visible Node given the hidden nodes
 
#### Bulding the methods
In a RBM, the hidden nodes are randomly sampled "turned on" 
* This sampling is based on Gibbs Sampling. 
* SO we need to calculate teh probability of a given hidden node to be active given a set of connected visible nodes
* All the above is calculated by sample_h
* X represents the connected visible nodes to that particular hidden node

At the same time the same also happens in the Visible Nodes:
* So we calculate the sample_v similar to sample_h

#### Contrastive Divergence

* Approximating the LogLikelihood Model. 
* Energy Based Model
* Minimizing the energy state of the model
* The goal is to minimize the energy by mazimizing the log likelihood. 
* So instead of calculating the log likelihood, we are appromimating the same
* The above is done through the train function
    * To do this we pull k iterations of gibbs sample and update the weights, and bias a & b for each iteration
    * Here in the first line we update the weights with weight + pv_given_h(0) + pv_given_h(k)
    * In the second line we update the bias for visible nodes b
    * in the third line we update the bias for hidden nodes a

In [52]:

# Creating the architecture of the Neural Network
class RBM():
    def __init__(self, nv, nh):
        self.W = torch.randn(nh, nv)
        self.a = torch.randn(1, nh)
        self.b = torch.randn(1, nv)
    def sample_h(self, x):
        wx = torch.mm(x, self.W.t())
        activation = wx + self.a.expand_as(wx)
        p_h_given_v = torch.sigmoid(activation)
        return p_h_given_v, torch.bernoulli(p_h_given_v)
    def sample_v(self, y):
        wy = torch.mm(y, self.W)
        activation = wy + self.b.expand_as(wy)
        p_v_given_h = torch.sigmoid(activation)
        return p_v_given_h, torch.bernoulli(p_v_given_h)
    def train(self, v0, vk, ph0, phk):
        self.W += torch.mm(v0.t(), ph0) - torch.mm(vk.t(), phk)
        self.b += torch.sum((v0 - vk), 0)
        self.a += torch.sum((ph0 - phk), 0)
nv = len(training_set[0])
nh = 1000
batch_size = 100
rbm = RBM(nv, nh)


Now that the class is created, lets start the training process.

In [53]:

# Training the RBM
nb_epoch = 20
for epoch in range(1, nb_epoch + 1):
    train_loss = 0 
    s = 0.
    for id_user in range(0, nb_users - batch_size, batch_size):
        vk = training_set[id_user:id_user+batch_size]
        v0 = training_set[id_user:id_user+batch_size]
        ph0,_ = rbm.sample_h(v0)
        for k in range(10):
            _,hk = rbm.sample_h(vk)
            _,vk = rbm.sample_v(hk)
            vk[v0<0] = v0[v0<0]
        phk,_ = rbm.sample_h(vk)
        rbm.train(v0, vk, ph0, phk)
        train_loss += torch.mean(torch.abs(v0[v0>=0] - vk[v0>=0]))
        s += 1.
    print('epoch: '+str(epoch)+' loss: '+str(train_loss/s))


  return self.add_(other)


epoch: 1 loss: 0.404788148712677
epoch: 2 loss: 0.2608543483557453
epoch: 3 loss: 0.2560392441543844
epoch: 4 loss: 0.25326235084890136
epoch: 5 loss: 0.25685413780237154
epoch: 6 loss: 0.2527687054788823
epoch: 7 loss: 0.25348444181146257
epoch: 8 loss: 0.24978572319769554
epoch: 9 loss: 0.25546859648328357
epoch: 10 loss: 0.25897645110409667
epoch: 11 loss: 0.2567810300443097
epoch: 12 loss: 0.26258069476225804
epoch: 13 loss: 0.2573744231337753
epoch: 14 loss: 0.25118818633180784
epoch: 15 loss: 0.25262101612186405
epoch: 16 loss: 0.2629822792814076
epoch: 17 loss: 0.2709902809684975
epoch: 18 loss: 0.27206881934121346
epoch: 19 loss: 0.2703391533954741
epoch: 20 loss: 0.26652384638754395


In [10]:

# Testing the RBM
test_loss = 0
s = 0.
for id_user in range(nb_users):
    v = training_set[id_user:id_user+1]
    vt = test_set[id_user:id_user+1]
    if len(vt[vt>=0]) > 0:
        _,h = rbm.sample_h(v)
        _,v = rbm.sample_v(h)
        test_loss += torch.mean(torch.abs(vt[vt>=0] - v[vt>=0]))
        s += 1.
print('test loss: '+str(test_loss/s))


test loss: 0.2441692528394889
