### Neural Collaborative Filtering (NCF)
The goal is to create a NCF deep learning model that will provide the user movie recommendations based off of the users reviews as well as similar movies to create detailed movie recommendations. The goal of this model is to build a collaborative filtering model using deep neural learning to offer a top 10 list of movie recommendations from a selected user.

In [1]:
# Import dependencies
import matplotlib.pyplot as plt
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import inspect
import config
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [2]:
#Log into the SQL database to retrieve the data
protocol = 'postgresql'
username = config.Username
password = config.Password
host = 'localhost'
port = 5432
database_name = 'movies_db'
rds_connection_string = f'{protocol}://{username}:{password}@{host}:{port}/{database_name}'
engine = create_engine(rds_connection_string)
insp = inspect(engine)

In [3]:
#Perform an SQL search query to select all of the data within the data set and create a pandas data frame from it
sql_join = r"""SELECT movies.movie_id, movies.title, movies.genres, ratings.user_id, ratings.rating, ratings.timestamps
FROM movies
INNER JOIN ratings
ON movies.movie_id=ratings.movie_id;"""
joined_movies_df=pd.read_sql_query(sql_join, con=engine)

In [4]:
#Now we have ported in the all the data from our SQL lets visualise it
joined_movies_df

Unnamed: 0,movie_id,title,genres,user_id,rating,timestamps
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,3,Grumpier Old Men (1995),Comedy|Romance,1,4.0,964981247
2,6,Heat (1995),Action|Crime|Thriller,1,4.0,964982224
3,47,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,1,5.0,964983815
4,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,1,5.0,964982931
...,...,...,...,...,...,...
100831,166534,Split (2017),Drama|Horror|Thriller,610,4.0,1493848402
100832,168248,John Wick: Chapter Two (2017),Action|Crime|Thriller,610,5.0,1493850091
100833,168250,Get Out (2017),Horror,610,5.0,1494273047
100834,168252,Logan (2017),Action|Sci-Fi,610,5.0,1493846352


In [5]:
#Not all of that data is needed yet, but will be later in the process. We only need the users feedback data
user_df = joined_movies_df[['user_id','movie_id','rating', 'timestamps']]
user_df

Unnamed: 0,user_id,movie_id,rating,timestamps
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


### Data Processing
The data that has now been imported needs to be processed. The 'user_id's and 'movie_id's are being enumerated as the machine learning will be more effective when the data is 0 indexed.

In [6]:
#We are going to create a unique list of every user within our dataset
User_list = user_df["user_id"].unique().tolist()
#Next we are going to enumerate each user within the list, this will show their user id and the order in which it appears within the data set.
#This is important as we may have missing users, it is also data that is used to construct the matrix later on to give recommendations.
User_list_enumerated = {x: i for i, x in enumerate(User_list)}
#Now we map the enumerated data to the user_id's and add it to the user_enumerated column
user_df["user_enumerated"] = user_df["user_id"].map(User_list_enumerated)
user_df["user_enumerated"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  user_df["user_enumerated"] = user_df["user_id"].map(User_list_enumerated)


0           0
1           0
2           0
3           0
4           0
         ... 
100831    609
100832    609
100833    609
100834    609
100835    609
Name: user_enumerated, Length: 100836, dtype: int64

Although the 'user_id's are not the only data we are using, we are aiming to include movies in the deep learning model as well. This means we need to enumerate the 'movie_id' so we can use them later in the process.

In [7]:
#We are going to create a unique list of every movie within our dataset
movie_list = user_df["movie_id"].unique().tolist()
#Next we are going to enumerate each user within the list, this will show their user id and the order in which it appears within the data set.
#This is important as we may have missing users, it is also data that is used to construct the matrix later on to give recommendations.
movie_list_enumerated = {x: i for i, x in enumerate(movie_list)}
#Now we map the enumerated data to the user_id's and add it to the user_enumerated column
user_df["movies_enumerated"] = user_df["movie_id"].map(movie_list_enumerated)
user_df["movies_enumerated"]

0            0
1            1
2            2
3            3
4            4
          ... 
100831    3120
100832    2035
100833    3121
100834    1392
100835    2873
Name: movies_enumerated, Length: 100836, dtype: int64

The next step is to split the data and create th training data

In [8]:
# Creating X values as "user_enumerated" & "movies_enumerated" in an array
x = user_df[["user_enumerated", "movies_enumerated"]].values
x

array([[   0,    0],
       [   0,    1],
       [   0,    2],
       ...,
       [ 609, 3121],
       [ 609, 1392],
       [ 609, 2873]], dtype=int64)

In [9]:
#Normalising the rating data between 0 and 1 makes the data easier to train

#Creating the min and max saves time when processing the data
min_rating = np.min(user_df["rating"])
max_rating = np.max(user_df["rating"])

#Normalising the rating data
y = user_df["rating"].apply(lambda x: (x - min_rating) / (max_rating - min_rating)).values
y

array([0.77777778, 0.77777778, 0.77777778, ..., 1.        , 1.        ,
       0.55555556])

In [10]:
# Splitting the data so 90% is used for training whilst 10% is used for results
train_indices = int(0.9 * user_df.shape[0])
train_indices

90752

In [11]:
#Creating the training data.
x_train, x_test, y_train, y_test = (
    x[:train_indices],
    x[train_indices:],
    y[:train_indices],
    y[train_indices:],
)

### Building a deep learning model

In [12]:
#Building the deep learning model
model = Sequential()
model.add(Dense(20, activation='relu'))
model.add(Dense(10, activation='relu'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [13]:
#Fitting the training data to the model
model.fit(x_train, y_train, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1c9b051ac70>

In [14]:
#Evaluate the keras model, [Loss, Accuracy]
accuracy = model.evaluate(x_test, y_test, verbose=1)
print(accuracy)

[7.30519437789917, 0.0126933753490448]


In [15]:
#Using some of the columns from the imported data
movie_df = joined_movies_df[["movie_id","title","genres"]]
#Select the user to recommend
userId = 315
#Find movies the user has seen and movies the user has not seen
movies_watched_by_user = user_df[user_df.user_id == userId]
#Make movies not watched a pandas series so it can be read into the learning model
movies_not_watched = movie_df[~movie_df["movie_id"].isin(movies_watched_by_user.movie_id.values)]["movie_id"]

In [16]:
#Takes the movies not watched and finds their enumerated keys
movies_not_watched = list(
    set(movies_not_watched).intersection(set(movie_list_enumerated.keys()))
)
movies_not_watched = [[movie_list_enumerated.get(x)] for x in movies_not_watched]
user_encoder = User_list_enumerated.get(userId)
#Formats the users movies they haven't watched for model.predict
user_movie_array = np.hstack(
    ([[user_encoder]] * len(movies_not_watched), movies_not_watched)
)
#Predicts the best suited movies for the selected user
movie_recs = model.predict(user_movie_array).flatten()



In [17]:
#Sorts the movies into a top 10
top_10_movie_recs = movie_recs.argsort()[-10:][::-1]
top_10_movie_recs

array([ 1320, 90540, 92440, 91910, 91720, 89170, 79720, 75420, 74540,
       74110], dtype=int64)

In [18]:
#Provides a list of the movies recommended to the user
movies_to_watch = movie_df[movie_df.index.isin(top_10_movie_recs)]
movies_to_watch

Unnamed: 0,movie_id,title,genres
1320,2002,Lethal Weapon 3 (1992),Action|Comedy|Crime|Drama
74110,3566,"Big Kahuna, The (2000)",Comedy|Drama
74540,6101,Missing (1982),Drama|Mystery|Thriller
75420,595,Beauty and the Beast (1991),Animation|Children|Fantasy|Musical|Romance|IMAX
79720,97938,Life of Pi (2012),Adventure|Drama|IMAX
89170,2146,St. Elmo's Fire (1985),Drama|Romance
90540,193,Showgirls (1995),Drama
91720,8620,"Exterminating Angel, The (Ángel exterminador, ...",Comedy|Drama|Fantasy|Mystery
91910,4979,"Royal Tenenbaums, The (2001)",Comedy|Drama
92440,2109,"Jerk, The (1979)",Comedy


In [19]:
#View users top 10 movie reviews for comparison
user_top_movies = joined_movies_df[["movie_id","title","genres","user_id","rating"]]
user_top_movies = user_top_movies[user_top_movies.user_id == userId]
user_top_movies.sort_values("rating", ascending=False)
user_top_movies = user_top_movies[["movie_id","title","genres"]]
user_top_movies.head(10)

Unnamed: 0,movie_id,title,genres
48709,154,Beauty of the Day (Belle de jour) (1967),Drama
48710,599,"Wild Bunch, The (1969)",Adventure|Western
48711,902,Breakfast at Tiffany's (1961),Drama|Romance
48712,909,"Apartment, The (1960)",Comedy|Drama|Romance
48713,914,My Fair Lady (1964),Comedy|Drama|Musical|Romance
48714,924,2001: A Space Odyssey (1968),Adventure|Drama|Sci-Fi
48715,1028,Mary Poppins (1964),Children|Comedy|Fantasy|Musical
48716,1035,"Sound of Music, The (1965)",Musical|Romance
48717,1084,Bonnie and Clyde (1967),Crime|Drama
48718,1201,"Good, the Bad and the Ugly, The (Buono, il bru...",Action|Adventure|Western


While the deep learning had worked, the results were very inaccurate and suffered a lot of lost data. Meaning the results from the learning model are probably not very useful. The next part will be to optimise the deep learning model to see if the accuracy increases.

## Optimisation 1
Adding more epochs and batch

In [20]:
#Building the deep learning model
model_opt1 = Sequential()
model_opt1.add(Dense(20, activation='relu'))
model_opt1.add(Dense(10, activation='relu'))
model_opt1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [21]:
#Fitting the training data to the model
model_opt1.fit(x_train, y_train, epochs=10, batch_size=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1c9b184aac0>

In [22]:
#Evaluate the keras model, [Loss, Accuracy]
accuracy = model_opt1.evaluate(x_test, y_test, verbose=1)
print(accuracy)

[7.568582534790039, 0.012495041824877262]


Adding more batch layers and increasing the epochs had increased the accuracy to 0.1357 from the initial 0.0331 accuracy. The model also had less losses going from 5.5 to 4.9, but the model still loses a lot of data meaning the results might not be very useful.

### Optimise 2
Adding more layers and increasing the density

In [23]:
#Building the deep learning model
model_opt2 = Sequential()
model_opt2.add(Dense(80, activation='relu'))
model_opt2.add(Dense(40, activation='relu'))
model_opt2.add(Dense(20, activation='relu'))
model_opt2.add(Dense(10, activation='relu'))
model_opt2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [24]:
#Fitting the training data to the model
model_opt2.fit(x_train, y_train, epochs=10, batch_size=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1c9b1ed0a90>

In [25]:
#Evaluate the keras model, [Loss, Accuracy]
accuracy = model_opt2.evaluate(x_test, y_test, verbose=1)
print(accuracy)

[6.931399822235107, 0.08905196189880371]


Seemingly this was a massive step back, the increase in density for each layer had increased the lost data to almost 7 and the accuracy is 0 which was very disappointing, potentially adding another layer that with another activation would help.

### Optimise 3
adding a sigmoid layer

In [26]:
#Building the deep learning model
model_opt3 = Sequential()
model_opt3.add(Dense(80, activation='relu'))
model_opt3.add(Dense(40, activation='relu'))
model_opt3.add(Dense(20, activation='relu'))
model_opt3.add(Dense(10, activation='relu'))
model_opt3.add(Dense(1, activation='sigmoid'))
model_opt3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [27]:
#Fitting the training data to the model
model_opt3.fit(x_train, y_train, epochs=10, batch_size=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1c9b17ea130>

In [28]:
#Evaluate the keras model, [Loss, Accuracy]
accuracy = model_opt3.evaluate(x_test, y_test, verbose=1)
print(accuracy)

[0.6715357303619385, 0.08905196189880371]


The sigmoid layer was a great improvement since it means the model has a more accurate final layer and better produces results. The accuracy is now 0.1357 and the loss is only 0.6310. While these are not particularly high results by themselves, but in comparison from previous results this is a good improvement.

### Optimisation 4
adding more relu layers as it seem more effective and lowering the density.

In [29]:
#Building the deep learning model
model_opt4 = Sequential()
model_opt4.add(Dense(55, activation='relu'))
model_opt4.add(Dense(50, activation='relu'))
model_opt4.add(Dense(45, activation='relu'))
model_opt4.add(Dense(40, activation='relu'))
model_opt4.add(Dense(20, activation='relu'))
model_opt4.add(Dense(10, activation='relu'))
model_opt4.add(Dense(1, activation='sigmoid'))
model_opt4.compile(loss='binary_crossentropy', optimizer="adam", metrics=['accuracy'])


In [30]:
#Fitting the training data to the model
model_opt4.fit(x_train, y_train, epochs=10, batch_size=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1c9b2f7c130>

In [31]:
#Evaluate the keras model, [Loss, Accuracy]
accuracy = model_opt4.evaluate(x_test, y_test, verbose=0)
print(accuracy)

[0.6636435389518738, 0.08905196189880371]


While the accuracy stayed at 0.1357 there was a small amount of data that was better retained but overall the final optimisation has had minimal effect on the results.

### Post optimisation results
After optimising the data the recommendation results are as follows.

In [32]:
#Using some of the columns from the imported data
movie_df = joined_movies_df[["movie_id","title","genres"]]
#Select the user to recommend
userId = 315
#Find movies the user has seen and movies the user has not seen
movies_watched_by_user = user_df[user_df.user_id == userId]
#Make movies not watched a pandas series so it can be read into the learning model
movies_not_watched = movie_df[~movie_df["movie_id"].isin(movies_watched_by_user.movie_id.values)]["movie_id"]

In [33]:
#Takes the movies not watched and finds their enumerated keys
movies_not_watched = list(
    set(movies_not_watched).intersection(set(movie_list_enumerated.keys()))
)
movies_not_watched = [[movie_list_enumerated.get(x)] for x in movies_not_watched]
user_encoder = User_list_enumerated.get(userId)
#Formats the users movies they haven't watched for model.predict
user_movie_array = np.hstack(
    ([[user_encoder]] * len(movies_not_watched), movies_not_watched)
)
#Predicts the best suited movies for the selected user
movie_recs = model_opt4.predict(user_movie_array).flatten()



In [34]:
#Sorts the movies into a top 10
top_10_movie_recs = movie_recs.argsort()[-10:][::-1]
top_10_movie_recs

array([1035, 1030, 1029, 1026, 1036,  940,  931, 1024, 1098,  941],
      dtype=int64)

In [35]:
#Provides a list of the movies recommended to the user
movies_to_watch = movie_df[movie_df.index.isin(top_10_movie_recs)]
movies_to_watch

Unnamed: 0,movie_id,title,genres
931,3623,Mission: Impossible II (2000),Action|Adventure|Thriller
940,4306,Shrek (2001),Adventure|Animation|Children|Comedy|Fantasy|Ro...
941,4310,Pearl Harbor (2001),Action|Drama|Romance|War
1024,49286,"Holiday, The (2006)",Comedy|Romance
1026,2,Jumanji (1995),Adventure|Children|Fantasy
1029,21,Get Shorty (1995),Comedy|Crime|Thriller
1030,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
1035,110,Braveheart (1995),Action|Drama|War
1036,141,"Birdcage, The (1996)",Comedy
1098,5218,Ice Age (2002),Adventure|Animation|Children|Comedy


In [36]:
#View users top 10 movie reviews for comparison
user_top_movies = joined_movies_df[["movie_id","title","genres","user_id","rating"]]
user_top_movies = user_top_movies[user_top_movies.user_id == userId]
user_top_movies.sort_values("rating", ascending=False)
user_top_movies = user_top_movies[["movie_id","title","genres"]]
user_top_movies.head(10)

Unnamed: 0,movie_id,title,genres
48709,154,Beauty of the Day (Belle de jour) (1967),Drama
48710,599,"Wild Bunch, The (1969)",Adventure|Western
48711,902,Breakfast at Tiffany's (1961),Drama|Romance
48712,909,"Apartment, The (1960)",Comedy|Drama|Romance
48713,914,My Fair Lady (1964),Comedy|Drama|Musical|Romance
48714,924,2001: A Space Odyssey (1968),Adventure|Drama|Sci-Fi
48715,1028,Mary Poppins (1964),Children|Comedy|Fantasy|Musical
48716,1035,"Sound of Music, The (1965)",Musical|Romance
48717,1084,Bonnie and Clyde (1967),Crime|Drama
48718,1201,"Good, the Bad and the Ugly, The (Buono, il bru...",Action|Adventure|Western


#### Final statistical analysis

In [37]:
f1_score, precision = model_opt4.evaluate(x_test, y_test, verbose=0)
print(f"F1 Score = {f1_score}, Precision = {precision}")

F1 Score = 0.6636435389518738, Precision = 0.08905196189880371


In [38]:
recall = model_opt4.evaluate(x_test, y_test, verbose=0)
print(f"Recall = {recall}")

Recall = [0.6636435389518738, 0.08905196189880371]
