# Movie Recommendation System with Stacked Autoencoders

### Project Overview:

This project showcases the development of a sophisticated movie recommendation system using deep learning techniques. Stacked Autoencoders (SAEs) play a central role in this project, enabling the system to understand user preferences and provide personalized movie recommendations.

#### Key Components and Steps:

1. Data Preparation:

- The project begins by importing and preparing data from the MovieLens dataset. While the MovieLens dataset is rich in movie-related data, it is not used in this project. Instead, the focus is on building a recommendation system based on user-movie interactions.

2. Data Conversion and Preprocessing:

- The user-movie interaction data is converted into Torch tensors, making it compatible with PyTorch, a popular deep learning framework.
- Ratings are processed to represent binary ratings: 1 (liked) or 0 (not liked).

3. Stacked Autoencoder Architecture:

- A Stacked Autoencoder neural network is designed, comprising multiple layers: an input layer (movies) and several hidden layers.
- The SAE's architecture is specifically tailored for capturing complex patterns in user-movie interactions.

4. Training the SAE:

- The SAE is trained using a combination of Mean Squared Error (MSE) loss and backpropagation.
- During training, the SAE learns the underlying structure of user preferences, reducing the reconstruction error in representing user-movie interactions.

5. Testing and Evaluation:

- The trained SAE is tested using a separate test dataset to assess its performance.
- The test loss is calculated to measure the accuracy of the recommendations, providing insights into the model's ability to predict user preferences.

#### Key Insights:

- Stacked Autoencoders are powerful tools for modeling and understanding user preferences in recommendation systems.
- The project demonstrates how deep learning techniques can be used to provide personalized recommendations, enhancing user experience.

#### Dataset Source:

The MovieLens dataset serves as the source for this project. Although the dataset contains extensive movie-related data, this project focuses on the user-movie interaction aspect. The MovieLens dataset is widely used in the field of recommendation systems.

On the whole, this project highlights the application of deep learning in the development of recommendation systems and showcases the capabilities of Stacked Autoencoders in understanding and predicting user preferences.

## Importing the libraries

In [40]:
import numpy as np
import pandas as pd

In [41]:
import torch
import torch.nn as nn                # the module to implement neural network
import torch.nn.parallel             # for parallel computation
import torch.optim as optim          # the optimizer
import torch.utils.data
from torch.autograd import Variable  # for stochastic gradient discent 

## Importing the dataset

 the separator is '::'
 
 We use **encoding = latin-1** because some movie names have special characters.
 
 since our columns have no names, we add header = None.
 
 We will use just the first column which is the movie id.

In [42]:
movies = pd.read_csv('ml-1m/movies.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')
users = pd.read_csv('ml-1m/users.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')

In [43]:
print(movies.head())
print(movies.shape)

   0                                   1                             2
0  1                    Toy Story (1995)   Animation|Children's|Comedy
1  2                      Jumanji (1995)  Adventure|Children's|Fantasy
2  3             Grumpier Old Men (1995)                Comedy|Romance
3  4            Waiting to Exhale (1995)                  Comedy|Drama
4  5  Father of the Bride Part II (1995)                        Comedy
(3883, 3)


In [44]:
ratings = pd.read_csv('ml-1m/ratings.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')
# The first column is the user ID.
# The second column corresponds to the movie ID.
# The third column corresponds to the user's rating of that movie (1 to 5).

In [45]:
print(ratings.head())
print(ratings.shape)

   0     1  2          3
0  1  1193  5  978300760
1  1   661  3  978302109
2  1   914  3  978301968
3  1  3408  4  978300275
4  1  2355  5  978824291
(1000209, 4)


## Preparing the training set and the test set

In [46]:
training_set = pd.read_csv('ml-100k/u1.base', delimiter = '\t')
print(training_set.head())
training_set = np.array(training_set, dtype = 'int')
test_set = pd.read_csv('ml-100k/u1.test', delimiter = '\t')
test_set = np.array(test_set, dtype = 'int')

# In u1.base, the elements of each row are separated by a tab; hence delimiter = '\t'
# The 1st column is the user ID
# The 2nd column is the movie ID
# The 3rd column is the rating

   1  1.1  5  874965758
0  1    2  3  876893171
1  1    3  4  878542960
2  1    4  3  876893119
3  1    5  3  889751712
4  1    7  4  875071561


In [47]:
print(training_set.shape)
print(test_set.shape)
# 80/20

(79999, 4)
(19999, 4)


## Getting the number of users and movies

In [48]:
nb_users = int(max(max(training_set[:,0]), max(test_set[:,0])))
nb_movies = int(max(max(training_set[:,1]), max(test_set[:,1])))
print('number of users:',nb_users)
print('number of movies:', nb_movies)

number of users: 943
number of movies: 1682


## Converting the data into an array with users in lines and movies in columns

In [49]:
def convert(data):
    new_data = []
    for id_users in range(1, nb_users + 1):
        id_movies = data[:,1][data[:,0] == id_users]
        id_ratings = data[:,2][data[:,0] == id_users]
        ratings = np.zeros(nb_movies)
        ratings[id_movies - 1] = id_ratings
        new_data.append(list(ratings))
    return new_data
training_set = convert(training_set)
test_set = convert(test_set)

## Converting the data into Torch tensors

In [50]:
training_set = torch.FloatTensor(training_set)
test_set = torch.FloatTensor(test_set)

In [56]:
test_set.shape

torch.Size([943, 1682])

In [57]:
training_set.shape

torch.Size([943, 1682])

## Creating the architecture of the Neural Network

In [53]:
class SAE(nn.Module): # SAE class is an inheritance (child class) of an existing class named Module (parent class).
    def __init__(self, ):
        super(SAE, self).__init__()   # The super function is used to have access to the methods and functions of nn.Module
        self.fc1 = nn.Linear(nb_movies, 20)
        self.fc2 = nn.Linear(20, 10)
        self.fc3 = nn.Linear(10, 20)
        self.fc4 = nn.Linear(20, nb_movies)   # (num_movies)=f1=>>(20 nodes)=f2=>>(10 nodes)=f3=>>(20 nodes)=f4=>>(num_movies)
        self.activation = nn.Sigmoid()
    def forward(self, x):
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        x = self.activation(self.fc3(x))
        x = self.fc4(x)
        return x
sae = SAE()
criterion = nn.MSELoss()
optimizer = optim.RMSprop(sae.parameters(), lr = 0.01, weight_decay = 0.5) # stochastic gradiant discent

## Training the SAE

In [54]:
nb_epoch = 10
for epoch in range(1, nb_epoch + 1):
    train_loss = 0
    s = 0.   # It will count the number of users who rated at least one movie. [to extract those who did not rate any movie]
    for id_user in range(nb_users):
        input = Variable(training_set[id_user]).unsqueeze(0) # keras (pytorch) cannot accept a single vector of one dim. A batch of vec must be introduced.
        target = input.clone()  # input will be modified later so a copy (clone) of it is made.
        if torch.sum(target.data > 0) > 0:   # only those who rated at least one movie
            output = sae(input)
            target.require_grad = False   # We do not compute the grad-dis with respect to the target. 
            output[target == 0] = 0       # We do not want to predict the rates of the movies that the user did not rate. So when the rating is 0, it will remain 0 (not predicted)
            loss = criterion(output, target)
            mean_corrector = nb_movies/float(torch.sum(target.data > 0) + 1e-10) 
            loss.backward()
            train_loss += np.sqrt(loss.data*mean_corrector)
            s += 1.
            optimizer.step()
    print('epoch: '+str(epoch)+' loss: '+str(train_loss/s))

epoch: 1 loss: tensor(1.7720)
epoch: 2 loss: tensor(1.0967)
epoch: 3 loss: tensor(1.0535)
epoch: 4 loss: tensor(1.0383)
epoch: 5 loss: tensor(1.0311)
epoch: 6 loss: tensor(1.0265)
epoch: 7 loss: tensor(1.0238)
epoch: 8 loss: tensor(1.0220)
epoch: 9 loss: tensor(1.0205)
epoch: 10 loss: tensor(1.0197)


## Testing the SAE

In [55]:
test_loss = 0
s = 0.
for id_user in range(nb_users):
  input = Variable(training_set[id_user]).unsqueeze(0)
  target = Variable(test_set[id_user]).unsqueeze(0)
  if torch.sum(target.data > 0) > 0:
    output = sae(input)
    target.require_grad = False
    output[target == 0] = 0
    loss = criterion(output, target)
    mean_corrector = nb_movies/float(torch.sum(target.data > 0) + 1e-10)
    test_loss += np.sqrt(loss.data*mean_corrector)
    s += 1.
print('test loss: '+str(test_loss/s))

test loss: tensor(1.0270)
