# Movie recommendation 
### CPSC 585 Project 6
#### Erick Juarez 
This project will implement a [Hopfield Network](https://en.wikipedia.org/wiki/Hopfield_network) and use it for recommending movies. 
The network will be trained on the small [MovieLens Dataset](https://grouplens.org/datasets/movielens/latest/). 

### Experiment 1
Let's prepare our data before building the network. There are a couple steps before our data is ready. First,  we will use the contents of the file `ratings.csv` to create a dataset. The dataset should have a feature vector for each user, with each feature corresponding to a movie. Movies should be encoded such that a movie rated `3.0` or above by the user will be represented as a `+1`, and other movies will be labeled `-1`.

In [2]:
# import libraries 
import numpy as np
import pandas as pd

In [3]:
# read csv file using pandas 
df = pd.read_csv('ratings.csv')
# look at the imported dataframe
user_count = len(df.userId.unique())
movie_count = len(df.movieId.unique())
print(df)
print("number of users:", user_count)
print("number of movies:", movie_count)

        userId  movieId  rating   timestamp
0            1        1     4.0   964982703
1            1        3     4.0   964981247
2            1        6     4.0   964982224
3            1       47     5.0   964983815
4            1       50     5.0   964982931
...        ...      ...     ...         ...
100831     610   166534     4.0  1493848402
100832     610   168248     5.0  1493850091
100833     610   168250     5.0  1494273047
100834     610   168252     5.0  1493846352
100835     610   170875     3.0  1493846415

[100836 rows x 4 columns]
number of users: 610
number of movies: 9724


The important columns of the dataframe are the first three: `userId`, `movieId`, and `rating`. Our new numpy array ,`dataset`, will have 610 rows and 9724 columns representing users and movies respectively. The values stored at `dataset[user][movie]` will be `+1` for movies with `rating` of 3.0 or higher and `-1` for all other movies for that `user`.

In [4]:
# creeate array filled with -1 (ie. disliked or never seen movie)
dataset = np.full((user_count, movie_count), -1)
print(dataset)
print(dataset.shape)

[[-1 -1 -1 ... -1 -1 -1]
 [-1 -1 -1 ... -1 -1 -1]
 [-1 -1 -1 ... -1 -1 -1]
 ...
 [-1 -1 -1 ... -1 -1 -1]
 [-1 -1 -1 ... -1 -1 -1]
 [-1 -1 -1 ... -1 -1 -1]]
(610, 9724)


using loc\[\] we can extract each user and determine which movies they have watched. Each user's `movieId` colum contains the movies that user has seen. These movies also have a `rating` that we must check to set the values in `dataset` to `+1` for movies rated 3.0 or higher.  

In [5]:
# change values of array corresponding to movies rated 3.0 or above to +1 
df.set_index('userId', inplace=True)
features = df.movieId.unique() # each movie corresponds to a feature

In [32]:
# define a helper function that returns a unique index of a given movieId 
def id_to_index(movieId):
    for movie in range(len(features)):
        if movieId == features[movie]:
            return movie
    return None

In [46]:
# go through each user and find out what movies they have rated 3.0 or higher 
for user in range(user_count):
    ids = df.loc[[user+1]].movieId.to_numpy()
    ratings = df.loc[[user+1]].rating.to_numpy() 
    for movie in range(len(ids)):
        index = id_to_index(ids[movie])
        # has the user seen this movie
        if index is not None:
            # is the rating 3.0 or higher
            if ratings[movie] >= 3.0:
                dataset[user][index] = +1
print(dataset.shape)

(610, 9724)


Now we have a new dataset representing 610 users and 9724 movies

### Experiment 2
The next step in preprocessing our data is to set aside 10% of this new dataset for testing our network.

There are going to be 549 users in the new `train_set` and 61 in the `test_set`. We can calculate the storage capacity of our network. A hopfield network with *d* units can store about (.15)(*d*) training examples. This means our network with 9724 units should be able to store close to 1459 examples which is far greater than the size of `train_set` so this network should be able to store our dataset 

In [None]:
train_set = 