# Neural Collaborative Filter
In this notebook, we will develop a neural collaborative folder that aims to recommend a movie to a user based on their previous watching patterns. We will use two architectures, one an item-based recommender and the other a user-based recommender. An item-based recommender will take the genres of movies and find similar movies based on the user's preferences. A user-based recommender will find other similar users based on this like their age, occupation and gender and then recommend movies that other similar people have enjoyed.

In [2]:
pip install keras-tuner

Note: you may need to restart the kernel to use updated packages.


In [3]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Flatten, Concatenate, Dense, Dropout
from tensorflow.keras.models import Model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import keras_tuner as kt
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Flatten, Concatenate, Dense, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split

from tensorflow.keras.models import load_model

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Flatten, Concatenate, Dense, Dropout
from tensorflow.keras.models import Model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

We will now import our test and train data sets. The test data set takes the latest 20% of ratings from each user. So this data is not random, but aims to simulate what would happen in real life. There is another train/test split available which just has the test data as the latest 20% of ratings. We will compare the difference in performance of the models for using each data split later in the project. However, for now we will only use the split that takes the last 20% of raing from each individual user.

In [4]:
train_df = pd.read_csv('../Katherine W/dataSets/user_train_df.csv')
test_df = pd.read_csv('../Katherine W/dataSets/user_test_df.csv')


we now prepare the input data for the first model

In [5]:
train_df.index = range(1, len(train_df) + 1)
test_df.index = range(1, len(test_df) + 1)

In [6]:
num_users = 943
num_items = 1682
num_genres = 19
num_occupations = train_df['Occupation'].nunique()
num_genders = train_df['Gender'].nunique()


In [7]:
# Convert all specified columns to numeric, setting errors='coerce' to convert non-numeric values to NaN
genre_columns = ['Fantasy', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western','Film-Noir','Unknown','Action','Adeventure','Animation','Childrens','Comedy','Crime','Documentary','Drama']

train_df[genre_columns] = train_df[genre_columns].apply(pd.to_numeric, errors='coerce')

# Drop rows where any of the specified columns contain NaN values
train_df = train_df.dropna(subset=genre_columns)


test_df[genre_columns] = test_df[genre_columns].apply(pd.to_numeric, errors='coerce')
test_df = test_df.dropna(subset=genre_columns)

In [8]:
# Encoders
occupation_encoder = LabelEncoder()
gender_encoder = LabelEncoder()

In [9]:
# Prepare Input Features
genre_features = genre_columns
train_genre_input = train_df[genre_features].values
test_genre_input = test_df[genre_features].values
train_df['Occupation'] = occupation_encoder.fit_transform(train_df['Occupation'])
test_df['Occupation'] = occupation_encoder.transform(test_df['Occupation'])
train_df['Gender'] = gender_encoder.fit_transform(train_df['Gender'])
test_df['Gender'] = gender_encoder.fit_transform(test_df['Gender'])

Now to build the model

In [10]:

# Model Architecture
# Inputs
user_input = Input(shape=(1,), name='User_Input')
item_input = Input(shape=(1,), name='Item_Input')
genre_input = Input(shape=(num_genres,), name='Genre_Input')
age_input = Input(shape=(1,), name='Age_Input')
occupation_input = Input(shape=(1,), name='Occupation_Input')
gender_input = Input(shape=(1,), name='Gender Input')

In [11]:

# Embedding layers for user and item
user_embedding = Embedding(num_users+1, 50, name='User_Embedding')(user_input)
item_embedding = Embedding(num_items+1, 50, name='Item_Embedding')(item_input)
age_embedding = Embedding(110, 50, name='Age_Embedding')(age_input)
occupation_embedding = Embedding(num_occupations, 50, name='Occupation_Embedding')(occupation_input)
gender_embedding = Embedding(num_genders, 50, name='Gender_Embedding')(gender_input)

In [12]:
# Flatten embeddings
user_vec = Flatten()(user_embedding)
item_vec = Flatten()(item_embedding)
age_vec = Flatten()(age_embedding)
occupation_vec = Flatten()(occupation_embedding)
gender_vec = Flatten()(gender_embedding)

In [13]:
# Concatenate embeddings and genre input
concat = Concatenate()([user_vec, item_vec, genre_input,age_vec,occupation_vec,gender_vec])

In [14]:
# Dense layers
dense1 = Dense(128, activation='relu')(concat)
dropout1 = Dropout(0.3)(dense1)
dense2 = Dense(64, activation='relu')(dropout1)

output = Dense(1, activation='sigmoid')(dense2)

In [15]:
# Model definition
model = Model(inputs=[user_input, item_input, genre_input,age_input,occupation_input,gender_input], outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


Prepare the training data. If the rating is above 3, we claim that the user enjoyed the film. We treat this as a binary variable, (1 if the user enjoyed the film and 0 otherwise)

In [16]:
# Preparing training data
train_user_input = train_df['User ID'].values
train_item_input = train_df['Item ID'].values
train_age_input = train_df['Age'].values
train_occupation_input = train_df['Occupation'].values
train_gender_input = train_df['Gender'].values
train_ratings = (train_df['Rating'] > 3).astype(int).values  # Binary rating: 1 if >3, else 0



test_user_input = test_df['User ID'].values
test_item_input = test_df['Item ID'].values
test_age_input = test_df['Age'].values
test_occuption_input = test_df['Occupation'].values
test_gender_input = test_df['Gender'].values
test_ratings = (test_df['Rating'] > 3).astype(int).values


In [17]:
model.summary()

Now to train the model

In [18]:
# Training the model
history = model.fit(
    [train_user_input, train_item_input, train_genre_input, train_age_input,train_occupation_input,train_gender_input],
    train_ratings,
    validation_data=([test_user_input, test_item_input, test_genre_input, test_age_input,test_occuption_input,test_gender_input], test_ratings),
    epochs=10,
    batch_size=64
)

Epoch 1/10




[1m499/499[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.6340 - loss: 0.6330 - val_accuracy: 0.7482 - val_loss: 0.5205
Epoch 2/10
[1m499/499[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7251 - loss: 0.5449 - val_accuracy: 0.7626 - val_loss: 0.4896
Epoch 3/10
[1m499/499[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7369 - loss: 0.5273 - val_accuracy: 0.7784 - val_loss: 0.4639
Epoch 4/10
[1m499/499[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7499 - loss: 0.5070 - val_accuracy: 0.7861 - val_loss: 0.4470
Epoch 5/10
[1m499/499[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7643 - loss: 0.4853 - val_accuracy: 0.8033 - val_loss: 0.4330
Epoch 6/10
[1m499/499[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7711 - loss: 0.4707 - val_accuracy: 0.8164 - val_loss: 0.4055
Epoch 7/10
[1m499/499[0m [32m━━━━━━━

The model has now been built and trained. Now to test it out. We take a random user, user_id, to see how the model behaves

In [21]:
# Predicting Recommendations for a User
user_id = 212  # Example: User ID for which to recommend
user_movies = np.array(range(num_items))  # All movies
# Find the age corresponding to the user ID in the train_df
user_age = train_df.loc[train_df['User ID'] == user_id, 'Age'].values[0]
user_occupation = train_df.loc[train_df['User ID'] == user_id, 'Occupation'].values[0]
user_gender = train_df.loc[train_df['User ID']== user_id, 'Gender'].values[0]

In [20]:
# Combine the DataFrames vertically (stacking them on top of each other)
combined_df = pd.concat([train_df, test_df], ignore_index=True)

genre_df = combined_df.copy()

genre_df.drop(columns=['timestamp','Age','Gender','Occupation','zip code','Release Date','URL','Movie Title','User ID','Item ID','Rating'],inplace=True)

# Genres for all movies
movie_genres_input = genre_df[:user_movies.shape[0]]

In [22]:
# Ensure all inputs are properly shaped and converted to the correct dtype
user_input_predict = np.full((user_movies.shape[0], 1), user_id, dtype=np.int32)  # Shape: (num_items, 1)
item_input_predict = user_movies.reshape(-1, 1).astype(np.int32)  # Shape: (num_items, 1)
movie_genres_input = movie_genres_input.astype(np.float32)  # Ensure genre input is float32
age_input_predict = np.full((user_movies.shape[0], 1), user_age, dtype=np.int32)  # Shape: (num_items, 1)
occupation_input_predict = np.full((user_movies.shape[0], 1), user_occupation, dtype=np.int32)  # Shape: (num_items, 1)
gender_input_predict = np.full((user_movies.shape[0], 1), user_gender, dtype=np.int32)  # Shape: (num_items, 1)

In [23]:
# Predict scores
predicted_scores = model.predict([np.full(user_movies.shape, user_id), user_movies, movie_genres_input,age_input_predict,occupation_input_predict,gender_input_predict])


[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step



[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step


In [24]:
# Recommend Top-5 Movies
recommended_movies = np.argsort(-predicted_scores.flatten())[:10]
print("Top 10 recommended movies for User ID", user_id, ":", recommended_movies)

Top 10 recommended movies for User ID 212 : [1500 1466 1368 1467 1599 1497 1625  174 1600  868]


Now we print the details of the recommended movies:

In [25]:
filtered_df = train_df[train_df['Item ID'].isin(recommended_movies)]

filtered_df.drop_duplicates(subset='Item ID', inplace=True)

filtered_df.drop(columns=['User ID','Rating','timestamp','Age','Gender','Occupation','zip code'],inplace=True)
filtered_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df.drop_duplicates(subset='Item ID', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df.drop(columns=['User ID','Rating','timestamp','Age','Gender','Occupation','zip code'],inplace=True)


Unnamed: 0,Item ID,Movie Title,Release Date,URL,Unknown,Action,Adeventure,Animation,Childrens,Comedy,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
53,174,Raiders of the Lost Ark (1981),01-Jan-1981,http://us.imdb.com/M/title-exact?Raiders%20of%...,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
618,868,Hearts and Minds (1996),10-Jan-1997,http://us.imdb.com/M/title-exact?Hearts%20and%...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5538,1368,Mina Tannenbaum (1994),01-Jan-1994,http://us.imdb.com/M/title-exact?Mina%20Tannen...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7652,1466,Margarets Museum (1995)|01-Jan-1995||http://us...,0,0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7706,1467,"Saint of Fort Washington, The (1993)",01-Jan-1993,http://us.imdb.com/M/title-exact?Saint%20of%20...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9123,1500,Santa with Muscles (1996),08-Nov-1996,http://us.imdb.com/M/title-exact?Santa%20with%...,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9226,1497,"Line King: Al Hirschfeld, The (1996)",11-Oct-1996,"http://us.imdb.com/M/title-exact?Line%20King,%...",0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16044,1599,Someone Elses America (1995)|10-May-1996||http...,0,0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16109,1600,Guantanamera (1994),16-May-1997,http://us.imdb.com/M/title-exact?Guantanamera%...,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21110,1625,Nightwatch (1997),22-Apr-1997,http://us.imdb.com/M/title-exact?Nightwatch%20...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Now we look at the genres for each of these movies:

In [26]:
# Function to get column names with positive entries for each row
def get_positive_columns(row):
    return [col for col in row.index if row[col] == 1]

# Apply the function to each row
positive_columns_per_row = filtered_df.apply(get_positive_columns, axis=1)

# Show the result
for idx, positive_cols in enumerate(positive_columns_per_row):
    print(f"Row {idx} has positive values in columns: {positive_cols}")

Row 0 has positive values in columns: ['Action', 'Adeventure']
Row 1 has positive values in columns: ['Drama']
Row 2 has positive values in columns: ['Drama']
Row 3 has positive values in columns: ['Comedy']
Row 4 has positive values in columns: ['Drama']
Row 5 has positive values in columns: ['Comedy']
Row 6 has positive values in columns: ['Documentary']
Row 7 has positive values in columns: ['Comedy']
Row 8 has positive values in columns: ['Comedy']
Row 9 has positive values in columns: ['Horror', 'Thriller']


Compare the genres to see if they are similar to the movies in the test data that the user has watched and liked.

In [27]:
true_ratings = test_df.loc[test_df['User ID'] == user_id].sort_values(by=['Rating'], ascending=False).head(5)

true_ratings.drop(columns=['User ID','URL','Rating','timestamp','Age','Gender','Occupation','zip code','Release Date'], inplace = True)

true_ratings

Unnamed: 0,Item ID,Movie Title,Unknown,Action,Adeventure,Animation,Childrens,Comedy,Crime,Documentary,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
1267,423,E.T. the Extra-Terrestrial (1982),0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1266,179,"Clockwork Orange, A (1971)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [28]:
# Apply the function to each row
positive_columns_per_row = true_ratings.apply(get_positive_columns, axis=1)

# Show the result
for idx, positive_cols in enumerate(positive_columns_per_row):
    print(f"Row {idx} has positive values in columns: {positive_cols}")

Row 0 has positive values in columns: ['Childrens', 'Drama', 'Fantasy', 'Sci-Fi']
Row 1 has positive values in columns: ['Sci-Fi']


## Look at tuning hyper-paramaters
We now look to optimise our model.

In [37]:
# Assuming df has been prepared with 'UserID', 'ItemID', 'Rating', and genre columns

# HyperModel for the neural network
class CollaborativeFilterHyperModel(kt.HyperModel):
    def build(self, hp):
        # Input dimensions
        num_users = 943
        num_items = 1682
        num_genres = 19
        num_occupations = train_df['Occupation'].nunique()
        num_genders = train_df['Gender'].nunique()
        # Model inputs
        user_input = Input(shape=(1,), name='User_Input')
        item_input = Input(shape=(1,), name='Item_Input')
        genre_input = Input(shape=(num_genres,), name='Genre_Input')
        occupation_input = Input(shape=(1,), name='Occupation_Input')
        gender_input = Input(shape=(1,), name='Gender_Input')

        # Hyperparameters for embeddings and dense layers
        user_embedding_dim = hp.Int('user_embedding_dim', min_value=32, max_value=128, step=32)
        item_embedding_dim = hp.Int('item_embedding_dim', min_value=32, max_value=128, step=32)
        age_embedding_dim = hp.Int('age_embedding_dim', min_value=32, max_value=128, step=32)
        occupation_embedding_dim = hp.Int('occupation_embedding_dim', min_value=32, max_value=128, step=32)
        gender_embedding_dim = hp.Int('gender_embedding_dim', min_value=32, max_value=128, step=32)
        
        # Embedding layers for user and item
        user_embedding = Embedding(num_users+1, user_embedding_dim, name='User_Embedding')(user_input)
        item_embedding = Embedding(num_items+1, item_embedding_dim, name='Item_Embedding')(item_input)
        age_embedding = Embedding(110, age_embedding_dim, name='Age_Embedding')(age_input)
        occupation_embedding = Embedding(num_occupations, occupation_embedding_dim, name='Occupation_Embedding')(occupation_input)
        gender_embedding = Embedding(num_genders, gender_embedding_dim, name='Gender_Embedding')(gender_input)
        
        # Flatten the embeddings
        user_vec = Flatten()(user_embedding)
        item_vec = Flatten()(item_embedding)
        age_vec = Flatten()(age_embedding)
        occupation_vec = Flatten()(occupation_embedding)
        gender_vec = Flatten()(gender_embedding)
        
        # Concatenate embeddings with genre input
        concat = Concatenate()([user_vec, item_vec, genre_input,age_vec,occupation_vec,gender_vec])

        # Dense layers with hyperparameter search
        dense1_units = hp.Int('dense1_units', min_value=64, max_value=512, step=64)
        dense2_units = hp.Int('dense2_units', min_value=32, max_value=256, step=32)

        dense1 = Dense(dense1_units, activation='relu')(concat)
        dropout1 = Dropout(hp.Float('dropout1', min_value=0.2, max_value=0.5, step=0.1))(dense1)
        dense2 = Dense(dense2_units, activation='relu')(dropout1)
        
        # Output layer
        output = Dense(1, activation='linear')(dense2)  # Rating is continuous, use linear activation

        # Model definition
        model = Model(inputs=[user_input, item_input, genre_input,age_input,occupation_input,gender_input], outputs=output)

        # Compile the model with an optimizer and loss function
        model.compile(optimizer=Adam(), loss='mse', metrics=['mae'])

        return model

# Preparing the training and testing data (just like before)


train_user_input = train_df['User ID'].values
train_item_input = train_df['Item ID'].values
train_genre_input = train_df[genre_columns].values
train_ratings = train_df['Rating'].values
train_age_input = train_df['Age'].values
train_occupation_input = train_df['Occupation'].values
train_gender_input = train_df['Gender'].values

test_user_input = test_df['User ID'].values
test_item_input = test_df['Item ID'].values
test_genre_input = test_df[genre_columns].values
test_ratings = test_df['Rating'].values
test_age_input = test_df['Age'].values
test_occupation_input = test_df['Occupation'].values
test_gender_input = test_df['Gender'].values

# Instantiate the tuner
tuner = kt.RandomSearch(
    CollaborativeFilterHyperModel(),
    objective='val_mae',  # We are optimizing for Mean Absolute Error
    max_trials=5,  # Number of different hyperparameter combinations to try
    executions_per_trial=3,  # Number of executions for each trial
    #directory='"C:/Users/kwhit/OneDrive/Documents/Maths 4th year/Data Science Toolbox 2024/Group-Assignment-2/Katherine W"',  # Directory to save tuning results
    project_name='collab_filter_search_all_features'  # Project name for Keras Tuner
)

# Search for the best hyperparameters
tuner.search(
    [train_user_input, train_item_input, train_genre_input,train_age_input,train_occupation_input,train_gender_input],
    train_ratings,
    validation_data=([test_user_input, test_item_input, test_genre_input,test_age_input,test_occupation_input,test_gender_input], test_ratings),
    epochs=10,
    batch_size=64
)

# Retrieve the best hyperparameters
best_hp = tuner.get_best_hyperparameters()[0]
print("Best hyperparameters:", best_hp.values)

# Build the model with the best hyperparameters
best_model = tuner.hypermodel.build(best_hp)

# Train the model using the best hyperparameters
history = best_model.fit(
    [train_user_input, train_item_input, train_genre_input,train_age_input,train_occupation_input,train_gender_input],
    train_ratings,
    validation_data=([test_user_input, test_item_input, test_genre_input,test_age_input,test_occupation_input,test_gender_input], test_ratings),
    epochs=10,
    batch_size=64
)

# Evaluate the best model
test_loss, test_mae = best_model.evaluate([test_user_input, test_item_input, test_genre_input,test_age_input,test_occupation_input,test_gender_input], test_ratings)
print(f"Test MAE: {test_mae}")

Trial 5 Complete [00h 00m 28s]
val_mae: 0.6559593478838602

Best val_mae So Far: 0.5421910683314005
Total elapsed time: 00h 03m 05s
Best hyperparameters: {'user_embedding_dim': 32, 'item_embedding_dim': 64, 'age_embedding_dim': 96, 'occupation_embedding_dim': 32, 'gender_embedding_dim': 64, 'dense1_units': 384, 'dense2_units': 96, 'dropout1': 0.2}
Epoch 1/10
[1m499/499[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - loss: 2.0715 - mae: 1.0788 - val_loss: 0.9057 - val_mae: 0.7473
Epoch 2/10
[1m499/499[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 0.9147 - mae: 0.7550 - val_loss: 0.8358 - val_mae: 0.7265
Epoch 3/10
[1m499/499[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 0.8632 - mae: 0.7347 - val_loss: 0.7931 - val_mae: 0.7028
Epoch 4/10
[1m499/499[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 0.8270 - mae: 0.7187 - val_loss: 0.7334 - val_mae: 0.6825
Epoch 5/10
[1m499/499[0m [32m━━━━━━━━━━

In [38]:
# Save the best model to a file
best_model.save('best_item_collab_filter_model.h5')
print("Model saved!")



Model saved!


In [39]:
from tensorflow.keras.losses import MeanSquaredError

# Register the custom metric
mse = MeanSquaredError()

# Load the saved model
item_model = load_model('best_item_collab_filter_model.h5', custom_objects={'mse': mse})
print("Model loaded!")



Model loaded!


In [40]:
item_model.summary()

Now time to test our new optimised model in a similar way to before:

In [41]:
# Predicting Recommendations for a User
user_id = 76  # Example: User ID for which to recommend
user_movies = np.array(range(num_items))  # All movies

In [42]:
# Combine the DataFrames vertically (stacking them on top of each other)
combined_df = pd.concat([train_df, test_df], ignore_index=True)

genre_df = combined_df.copy()

genre_df.drop(columns=['timestamp','Age','Gender','Occupation','zip code','Release Date','URL','Movie Title','User ID','Item ID','Rating'],inplace=True)

In [44]:
# Genres for all movies
movie_genres_input = genre_df[:user_movies.shape[0]]

In [43]:
# Find the age corresponding to the user ID in the train_df
user_age = train_df.loc[train_df['User ID'] == user_id, 'Age'].values[0]
print(f"The age corresponding to User ID {user_id} is: {user_age}")
# Find the occupation corresponding to the user ID in the train_df
user_occupation = train_df.loc[train_df['User ID'] == user_id, 'Occupation'].values[0]
print(f"The occupation corresponding to User ID {user_id} is: {user_occupation}")
# Find the gender corresponding to the user ID in the train_df
user_gender = train_df.loc[train_df['User ID'] == user_id, 'Gender'].values[0]
print(f"The gender corresponding to User ID {user_id} is: {user_gender}")

The age corresponding to User ID 76 is: 20
The occupation corresponding to User ID 76 is: 18
The gender corresponding to User ID 76 is: 1


In [47]:
# Ensure all inputs are properly shaped and converted to the correct dtype
user_input_predict = np.full((user_movies.shape[0], 1), user_id, dtype=np.int32)  # Shape: (num_items, 1)
item_input_predict = user_movies.reshape(-1, 1).astype(np.int32)  # Shape: (num_items, 1)
movie_genres_input = movie_genres_input.astype(np.float32)  # Ensure genre input is float32
age_input_predict = np.full((user_movies.shape[0], 1), user_age, dtype=np.int32)  # Shape: (num_items, 1)
occupation_input_predict = np.full((user_movies.shape[0], 1), user_occupation, dtype=np.int32)  # Shape: (num_items, 1)
gender_input_predict = np.full((user_movies.shape[0], 1), user_gender, dtype=np.int32)  # Shape: (num_items, 1)

In [48]:
# Predict scores
predicted_scores = item_model.predict([np.full(user_movies.shape, user_id), user_movies, movie_genres_input,age_input_predict,occupation_input_predict,gender_input_predict])


[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step



[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step


In [49]:
# Recommend Top-5 Movies
recommended_movies = np.argsort(-predicted_scores.flatten())[:10]
print("Top 10 recommended movies for User ID", user_id, ":", recommended_movies)

Top 10 recommended movies for User ID 76 : [1458 1467  868 1064 1646 1396  178 1639  169  427]


In [50]:
filtered_df = train_df[train_df['Item ID'].isin(recommended_movies)]

filtered_df.drop_duplicates(subset='Item ID', inplace=True)
filtered_df.drop(columns=['User ID','Rating','timestamp','Age','Gender','Occupation','zip code'],inplace=True)

filtered_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df.drop_duplicates(subset='Item ID', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df.drop(columns=['User ID','Rating','timestamp','Age','Gender','Occupation','zip code'],inplace=True)


Unnamed: 0,Item ID,Movie Title,Release Date,URL,Unknown,Action,Adeventure,Animation,Childrens,Comedy,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
104,169,"Wrong Trousers, The (1993)",01-Jan-1993,http://us.imdb.com/M/title-exact?Wrong%20Trous...,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
108,178,12 Angry Men (1957),01-Jan-1957,http://us.imdb.com/M/title-exact?12%20Angry%20...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
204,427,To Kill a Mockingbird (1962),01-Jan-1962,http://us.imdb.com/M/title-exact?To%20Kill%20a...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
618,868,Hearts and Minds (1996),10-Jan-1997,http://us.imdb.com/M/title-exact?Hearts%20and%...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1662,1064,Crossfire (1947),01-Jan-1947,http://us.imdb.com/M/title-exact?Crossfire%20(...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5725,1396,Stonewall (1995),26-Jul-1996,http://us.imdb.com/M/title-exact?Stonewall%20(...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7469,1458,"Damsel in Distress, A (1937)",01-Jan-1937,http://us.imdb.com/M/title-exact?Damsel%20in%2...,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
7706,1467,"Saint of Fort Washington, The (1993)",01-Jan-1993,http://us.imdb.com/M/title-exact?Saint%20of%20...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23477,1639,Bitter Sugar (Azucar Amargo) (1996),22-Nov-1996,http://us.imdb.com/M/title-exact?Bitter%20Suga...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23565,1646,Men With Guns (1997),06-Mar-1998,http://us.imdb.com/Title?Men+with+Guns+(1997/I),0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [51]:
# Function to get column names with positive entries for each row
def get_positive_columns(row):
    return [col for col in row.index if row[col] == 1]

# Apply the function to each row
positive_columns_per_row = filtered_df.apply(get_positive_columns, axis=1)

# Show the result
for idx, positive_cols in enumerate(positive_columns_per_row):
    print(f"Row {idx} has positive values in columns: {positive_cols}")

Row 0 has positive values in columns: ['Animation', 'Comedy']
Row 1 has positive values in columns: ['Drama']
Row 2 has positive values in columns: ['Drama']
Row 3 has positive values in columns: ['Drama']
Row 4 has positive values in columns: ['Crime', 'Film-Noir']
Row 5 has positive values in columns: ['Drama']
Row 6 has positive values in columns: ['Comedy', 'Musical', 'Romance']
Row 7 has positive values in columns: ['Drama']
Row 8 has positive values in columns: ['Drama']
Row 9 has positive values in columns: ['Action', 'Drama']


In [52]:
true_ratings = test_df.loc[test_df['User ID'] == user_id].sort_values(by=['Rating'], ascending=False).head(5)

true_ratings.drop(columns=['User ID','URL','Rating','timestamp','Age','Gender','Occupation','zip code','Release Date'], inplace = True)

true_ratings

Unnamed: 0,Item ID,Movie Title,Unknown,Action,Adeventure,Animation,Childrens,Comedy,Crime,Documentary,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
508,582,"Piano, The (1993)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
506,223,Sling Blade (1996),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
507,1048,Shes the One (1996)|23-Aug-1996||http://us.imd...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [53]:
# Apply the function to each row
positive_columns_per_row = true_ratings.apply(get_positive_columns, axis=1)

# Show the result
for idx, positive_cols in enumerate(positive_columns_per_row):
    print(f"Row {idx} has positive values in columns: {positive_cols}")

Row 0 has positive values in columns: ['Drama', 'Romance']
Row 1 has positive values in columns: ['Drama', 'Thriller']
Row 2 has positive values in columns: ['Adeventure', 'Horror']


# Look at coverage

In [55]:

# Step 1: Initialize a set to track recommended item IDs
recommended_items = set()

# Step 2: Loop through all users in the dataset
for user_id in train_df['User ID'].unique():
    user_movies = np.array(range(num_items))  # All movies
    user_age = train_df.loc[train_df['User ID'] == user_id, 'Age'].values[0]
    user_occupation = train_df.loc[train_df['User ID'] == user_id, 'Occupation'].values[0]
    user_gender = train_df.loc[train_df['User ID']==user_id, 'Gender'].values[0]
    # Combine the DataFrames vertically (stacking them on top of each other)
    combined_df = pd.concat([train_df, test_df], ignore_index=True)

    genre_df = combined_df.copy()
    genre_df.drop(columns=['timestamp', 'Age', 'Gender', 'Occupation', 'zip code', 
                           'Release Date', 'URL', 'Movie Title', 'User ID', 
                           'Item ID', 'Rating'], inplace=True)
    
    # Genres for all movies
    movie_genres_input = genre_df[:user_movies.shape[0]]
    
    # Ensure all inputs are properly shaped and converted to the correct dtype
    user_input_predict = np.full((user_movies.shape[0],), user_id, dtype=np.int32)  # Shape: (num_items,)
    item_input_predict = user_movies.astype(np.int32)  # Shape: (num_items,)
    movie_genres_input = movie_genres_input.astype(np.float32)  # Ensure genre input is float32
    age_input_predict = np.full((user_movies.shape[0], 1), user_age, dtype=np.int32)  # Shape: (num_items,)
    occupation_input_predict = np.full((user_movies.shape[0], 1), user_occupation, dtype=np.int32)  # Shape: (num_items,)
    gender_input_predict = np.full((user_movies.shape[0],1), user_gender, dtype=np.int32)  # Shape: (num_items,)
    # Predict scores
    predicted_scores = item_model.predict([user_input_predict, item_input_predict, movie_genres_input,age_input_predict,occupation_input_predict,gender_input_predict]).flatten()

    # Convert predictions to NumPy for easier manipulation
    predicted_scores = predicted_scores.flatten()

    # Get top N recommendations (e.g., N=10)
    top_n_indices = np.argsort(predicted_scores)[::-1][:10]  # Get indices of top N scores

    # Add the top recommended items to the set
    recommended_items.update(user_movies[top_n_indices])  # Use user_movies to map indices to actual item IDs

# Step 3: Calculate coverage
coverage = (len(recommended_items) / num_items) * 100
print(f"Coverage: {coverage:.2f}%")


[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step



[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 902us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 941us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 943us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 960us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 904us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 990us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 962us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━

In [56]:
len(recommended_items)

193

In [57]:
print(f"Coverage: {coverage:.2f}%")

Coverage: 11.47%


## MSE of model predictions

In [58]:

# Step 1: Initialize lists to store the results
user_ids = []
item_ids = []
actual_ratings = []
predicted_ratings = []

# Step 2: Loop through each user in the dataset
for user_id in combined_df['User ID'].unique():
    # Get all items rated by this user
    user_data = combined_df[combined_df['User ID'] == user_id]
    items_rated_by_user = user_data['Item ID'].values
    actual_ratings_dict = dict(zip(user_data['Item ID'], user_data['Rating']))

    user_age = user_data['Age'].values[0]
    user_occupation = user_data['Occupation'].values[0]
    user_gender = user_data['Gender'].values[0]
    
    # Predict scores for all items
    user_movies = np.array(range(num_items))  # All movies
    user_input = np.full((num_items,), user_id, dtype=np.int32)
    item_input = user_movies.astype(np.int32)
    genre_input = genre_df.values[:num_items]  # Ensure genre_input has the same size as num_items
    age_input = np.full((num_items, 1), user_age, dtype=np.int32)  # Shape: (num_items, 1)
    occupation_input = np.full((num_items, 1), user_occupation, dtype=np.int32)  # Shape: (num_items, 1)
    gender_input = np.full((num_items, 1), user_gender, dtype=np.int32)  # Shape: (num_items, 1)
    # Predict scores
    predicted_scores = item_model.predict([user_input, item_input, genre_input,age_input,occupation_input,gender_input]).flatten()

    # Append data for each movie
    for idx, item_id in enumerate(user_movies):
        user_ids.append(user_id)
        item_ids.append(item_id)
        predicted_ratings.append(predicted_scores[idx])
        # Get actual rating if it exists; otherwise, append NaN
        actual_ratings.append(actual_ratings_dict.get(item_id, np.nan))

# Step 3: Create a DataFrame
results_df = pd.DataFrame({
    'User ID': user_ids,
    'Item ID': item_ids,
    'Actual Rating': actual_ratings,
    'Predicted Rating': predicted_ratings
})

# Display the first few rows of the DataFrame
print(results_df.head())


[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 952us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 941us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 906us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 942us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 942us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 865us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 865us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 904us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━

In [59]:
results_df = results_df.dropna()
results_df

Unnamed: 0,User ID,Item ID,Actual Rating,Predicted Rating
1,1,1,5.0,3.566705
2,1,2,3.0,3.918499
3,1,3,4.0,3.369549
4,1,4,3.0,3.816869
5,1,5,3.0,2.117733
...,...,...,...,...
1583790,943,1028,2.0,2.333222
1583806,943,1044,3.0,3.301507
1583809,943,1047,2.0,3.035149
1583829,943,1067,2.0,3.213472


In [60]:
#Calculating the mse:


# Assuming 'results_df' has the columns 'UserID', 'PredictedRating', and 'ActualRating'

# Step 1: Initialize an empty list to store MSE values for each user
mse_per_user = []

# Step 2: Group by UserID and calculate MSE for each user
for user_id, user_data in results_df.groupby('User ID'):
    # Get actual and predicted ratings for this user
    actual_ratings = user_data['Actual Rating']
    predicted_ratings = user_data['Predicted Rating']
    
    # Calculate MSE for this user
    mse = np.mean((actual_ratings - predicted_ratings) ** 2)
    
    # Append the MSE value to the list
    mse_per_user.append({'User ID': user_id, 'MSE': mse})

# Convert the list to a DataFrame for better readability (optional)
mse_df = pd.DataFrame(mse_per_user)

# Show the MSE values per user
mse_df

Unnamed: 0,User ID,MSE
0,1,0.972181
1,2,1.602208
2,3,1.170620
3,4,0.906310
4,5,1.180528
...,...,...
937,939,0.412638
938,940,0.867505
939,941,0.694671
940,942,0.748788


In [61]:
mean_value = mse_df['MSE'].mean()

print(f"MSE of the model overal is: {mean_value}")

MSE of the model overal is: 0.9270201554327671


## Counting correlated suggestions

In [73]:

# Initialize a counter for the event
event_count = 0

# Iterate over each user in the test_df
for user_id in test_df['User ID'].unique():
    # Get the actual movies that the user has in the test_df (movies they rated)
    user_test_data = test_df[test_df['User ID'] == user_id]
    
    # Filter movies with a rating above 3
    user_test_data = user_test_data[user_test_data['Rating'] > 3]
    
    if user_test_data.empty:
        continue  # Skip the user if no movies have a rating above 3
    
    # Get the movies and their ratings
    test_movies = user_test_data['Item ID'].values
    test_ratings = user_test_data['Rating'].values
    
    # Get the predicted scores for this user (all movies)
    user_movies = np.array(range(num_items))  # All movies
    genre_input = genre_df.values  # Assuming genre_df has the necessary genre features
    user_age = user_test_data['Age'].values[0]
    user_occupation = user_test_data['Occupation'].values[0]
    # Ensure that user_input is shaped as (num_items, 1), so it's a column vector
    user_input_predict = np.full((user_movies.shape[0], 1), user_id, dtype=np.int32)  # Shape: (num_items, 1)
    item_input_predict = user_movies.reshape(-1, 1).astype(np.int32)  # Shape: (num_items, 1)
    age_input_predict = np.full((user_movies.shape[0], 1), user_age, dtype=np.int32)  # Shape: (num_items, 1)
    occupation_input_predict = np.full((user_movies.shape[0], 1), user_occupation, dtype=np.int32)  # Shape: (num_items, 1)
    # Ensure genre_input is of shape (num_items, n_genres)
    genre_input_predict = genre_input[:user_movies.shape[0], :].astype(np.int32)  # Shape: (num_items, n_genres)
    # Predict the scores for all items (movies)
    predicted_scores = item_model.predict([user_input_predict, item_input_predict, genre_input_predict,age_input_predict,occupation_input_predict]).flatten()
    
    # Get the top 10 recommended movies based on the predicted scores
    top_n_indices = np.argsort(predicted_scores)[::-1][:10]  # Top 10 movies
    
    # Loop through each movie in the test data
    for movie_id in test_movies:
        if movie_id in top_n_indices:
            event_count += 1

# Print the final count of events
print(f"Number of times a movie with a rating above 3 appeared in the top 10 recommendations: {event_count}")


[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step




[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 942us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step   
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 982us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 975us/step
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0

In [74]:
print('The number of times a recommended movie has appeared in the test set and actually liked (rating above 3) is:',event_count)

The number of times a recommended movie has appeared in the test set and actually liked (rating above 3) is: 95


In [75]:
# Filter the test_df to only include rows where the rating is greater than 3
ratings_above_3 = test_df[test_df['Rating'] > 3]

# Count how many rows meet this condition
count_ratings_above_3 = ratings_above_3.shape[0]

# Print the result
print(f"Number of ratings above 3: {count_ratings_above_3}")
print('Percentage of enjoyed movies that show up in recommendations is:',(event_count/count_ratings_above_3)*100)


Number of ratings above 3: 3012
Percentage of enjoyed movies that show up in recommendations is: 3.154050464807437


## Scaling the model

Recommender systems are likely to have to deal with huge volumes of data, we therefore think about how our model would cope with being scaled to 100000 times its size. Neural networks are known for coping well wih large amounts of data, and actually perform better when there is more data available. As more users join, and with more data being provided the neural netowrk will learn more and more pathways which had previously been quite sparse. Therefore it is expected that as the volume of data increases the neura netwrks performance will also increase. 

On the other hand, with larger volumes of data, the model will take a long time to train. 

## MSE
The MSE for the model is less about 0.9. As the ratings are out of 5, this would mean that on average a predicted rating is out by around 1. Although this isn't particuarly good, as the model ranks the top 10 it can be argued that it is likely that the model will suggest films that the user would enjoy (rating >=3). It is unlikely that the model would suggest a movie that the user would acually rate 2 or less.

## Coverage
The coverage of the model is quite low, at 10.54%, which means that the model favours particular movies over others. This could be because collaborative filters are quite heavily influenced on popularity which can lead to over-fitting. With this in mind, when we look to scale the model by 100,000 times it is likely that the coverage will increase. This is because wih more data, any sparse data will be filled in and so the network can learn patterns about the less popular options. With more data comes more diversity and so with better understanding of the more niche preferences it's hoped that the model will be able to provide more personalised recommendations.

## Matching Recommendations to Test Data
The proportion of recommendations that actually appeared in the test set is incredibly low at only 3.1%, however his does not provide a particuarly good measure for how well the model is performing. Although the users didnt actually watch many of the model's recommendations it does not mean that they wouldn't enjoy them. It's likely that the model recommends movies that they will enjoy more than the ones they actually watched (movies in test set). As the model is scaled up, it isn't particurly likely that this percentage will increase, however as explained this is not necessarily a bad thing.

# Now to look at user-Based recommender

We start by preparing the input data as before

In [None]:
num_users = 943
num_items = 1682
num_occupations = 21  # Number of unique occupations
num_genders = 2        # Gender (Male or Female)

In [None]:
# Encoding UserID, ItemID, Gender, Occupation
user_encoder = LabelEncoder()
item_encoder = LabelEncoder()
gender_encoder = LabelEncoder()
occupation_encoder = LabelEncoder()

train_df['User ID'] = user_encoder.fit_transform(train_df['User ID'])
train_df['Item ID'] = item_encoder.fit_transform(train_df['Item ID'])
train_df['Gender'] = gender_encoder.fit_transform(train_df['Gender'])
train_df['Occupation'] = occupation_encoder.fit_transform(train_df['Occupation'])

test_df['User ID'] = user_encoder.fit_transform(test_df['User ID'])
test_df['Item ID'] = item_encoder.fit_transform(test_df['Item ID'])
test_df['Gender'] = gender_encoder.fit_transform(test_df['Gender'])
test_df['Occupation'] = occupation_encoder.fit_transform(test_df['Occupation'])

In [None]:
# Normalize the exact age feature
scaler = MinMaxScaler()
train_df['Age'] = scaler.fit_transform(train_df[['Age']])  # Normalize age between 0 and 1
# Normalize the exact age feature
scaler = MinMaxScaler()
test_df['Age'] = scaler.fit_transform(test_df[['Age']])  # Normalize age between 0 and 1


#Now normalize ratings for each user
train_df['Rating'] = scaler.fit_transform(train_df[['Rating']])  
test_df['Rating'] = scaler.fit_transform(test_df[['Rating']])  


In [None]:
# Prepare Input Features
train_user_input = train_df[['User ID', 'Age', 'Gender', 'Occupation']].values
train_item_input = train_df['Item ID'].values
train_ratings = train_df['Rating'].values

test_user_input = test_df[['User ID', 'Age', 'Gender', 'Occupation']].values
test_item_input = test_df['Item ID'].values
test_ratings = test_df['Rating'].values

Now to build the model:

In [None]:
# Model Architecture
# Inputs
user_input = Input(shape=(4,), name='User_Input')  # 4 features: UserID, Age, Gender, Occupation
item_input = Input(shape=(1,), name='Item_Input')


In [None]:
# Embedding layers for user and item
user_embedding = Embedding(num_users, 50, name='User_Embedding')(user_input[:, 0])  # UserID
age_input = user_input[:, 1:]  # Age, Gender, Occupation (no embedding for continuous age)
age_vec = Dense(10, activation='relu')(age_input)  # Dense layer to process age, gender, occupation inputs
item_embedding = Embedding(num_items, 50, name='Item_Embedding')(item_input)


In [None]:
# Flatten embeddings
user_vec = Flatten()(user_embedding)
item_vec = Flatten()(item_embedding)

In [None]:
# Concatenate all embeddings (user, item, age, gender, occupation)
concat = Concatenate()([user_vec, age_vec, item_vec])


In [None]:
# Dense layers
dense1 = Dense(128, activation='relu')(concat)
dropout1 = Dropout(0.3)(dense1)
dense2 = Dense(64, activation='relu')(dropout1)
output = Dense(1, activation='linear')(dense2)  # Rating is a continuous value, so use linear activation


In [None]:
# Model definition
user_model = Model(inputs=[user_input, item_input], outputs=output)
user_model.compile(optimizer='adam', loss='mse', metrics=['mae'])


In [None]:
user_model.summary()

Now train the model

In [None]:
# Training the model
history = user_model.fit(
    [train_user_input, train_item_input],
    train_ratings,
    validation_data=([test_user_input, test_item_input], test_ratings),
    epochs=10,
    batch_size=64
)

Test on random users to see how it performs

In [None]:
# Predicting Recommendations for a User
user_id = 3 # Example: User ID for which to recommend
user_age = train_df.loc[train_df['User ID'] == user_id, 'Age'].values[0]
user_gender = train_df.loc[train_df['User ID'] == user_id, 'Gender'].values[0]
user_occupation = train_df.loc[train_df['User ID'] == user_id, 'Occupation'].values[0]

user_data = np.array([[user_id, user_age, user_gender, user_occupation]])  # Example: Age=0.5 (normalized), Gender=0, Occupation=0 for simplicity (you would use actual data)
user_movies = np.array(range(num_items))  # All movies


In [None]:
# Predict scores
predicted_scores = user_model.predict([np.tile(user_data, (num_items, 1)), user_movies])


In [None]:
#recommend Top-5 Movies
recommended_movies = np.argsort(-predicted_scores.flatten())[:10]
print("Top 10 recommended movies for User ID", user_id, ":", recommended_movies)

In [None]:
filtered_user_df = train_df[train_df['Item ID'].isin(recommended_movies)]
                            
filtered_user_df.drop_duplicates(subset='Item ID', inplace=True)
filtered_user_df.drop(columns=['Gender','Rating','Age','Occupation','zip code','User ID','URL'],inplace=True)

filtered_user_df

In [None]:
# Function to get column names with positive entries for each row
def get_positive_columns(row):
    return [col for col in row.index if row[col] == 1]


# Apply the function to each row
positive_columns_per_row_user = filtered_user_df.apply(get_positive_columns, axis=1)

# Show the result
for idx, positive_cols in enumerate(positive_columns_per_row_user):
    print(f"Row {idx} has positive values in columns: {positive_cols}")

In [None]:
true_user_ratings = test_df.loc[test_df['User ID'] == user_id].sort_values(by=['Rating'], ascending=False).head(5)

true_user_ratings.drop(columns=['User ID','URL','Rating','timestamp','Age','Gender','Occupation','zip code','Release Date'], inplace = True)

true_user_ratings

In [None]:
# Apply the function to each row
positive_columns_per_row = true_user_ratings.apply(get_positive_columns, axis=1)

# Show the result
for idx, positive_cols in enumerate(positive_columns_per_row):
    print(f"Row {idx} has positive values in columns: {positive_cols}")

## Tune hyper-parameters
Now to tune the hyper-paramters for this model.

In [None]:
# HyperModel for the neural network
class CollaborativeFilterHyperModel(kt.HyperModel):
    def build(self, hp):
        # Input dimensions
        num_users = 943
        num_items = 1682

        # Model inputs
        user_input = Input(shape=(1,), name='User_Input')
        item_input = Input(shape=(1,), name='Item_Input')
        age_input = Input(shape=(1,), name='Age_Input')
        occupation_input = Input(shape=(1,), name='Occupation_Input')  # Numerical occupation input
        gender_input = Input(shape=(1,), name='Gender_Input')  # Numerical gender input

        # Hyperparameters for embeddings and dense layers
        user_embedding_dim = hp.Int('user_embedding_dim', min_value=32, max_value=128, step=32)
        item_embedding_dim = hp.Int('item_embedding_dim', min_value=32, max_value=128, step=32)

        # Embedding layers for user and item
        user_embedding = Embedding(num_users + 1, user_embedding_dim, name='User_Embedding')(user_input)
        item_embedding = Embedding(num_items + 1, item_embedding_dim, name='Item_Embedding')(item_input)

        # Flatten the embeddings
        user_vec = Flatten()(user_embedding)
        item_vec = Flatten()(item_embedding)

        # Concatenate embeddings with additional scalar features
        concat = Concatenate()([user_vec, item_vec, age_input, occupation_input, gender_input])

        # Dense layers with hyperparameter search
        dense1_units = hp.Int('dense1_units', min_value=64, max_value=512, step=64)
        dense2_units = hp.Int('dense2_units', min_value=32, max_value=256, step=32)

        dense1 = Dense(dense1_units, activation='relu')(concat)
        dropout1 = Dropout(hp.Float('dropout1', min_value=0.2, max_value=0.5, step=0.1))(dense1)
        dense2 = Dense(dense2_units, activation='relu')(dropout1)

        # Output layer
        output = Dense(1, activation='linear')(dense2)  # Rating is continuous, use linear activation

        # Model definition
        model = Model(inputs=[user_input, item_input, age_input, occupation_input, gender_input], outputs=output)

        # Compile the model with an optimizer and loss function
        model.compile(optimizer=Adam(), loss='mse', metrics=['mae'])

        return model

# Preparing the training and testing data (just like before)
train_user_input = train_df['User ID'].values
train_item_input = train_df['Item ID'].values
train_age_input = train_df['Age'].values.reshape(-1, 1)
train_occupation_input = train_df['Occupation'].values.reshape(-1, 1)  # Numerical occupation
train_gender_input = train_df['Gender'].values.reshape(-1, 1)  # Numerical gender
train_ratings = train_df['Rating'].values

test_user_input = test_df['User ID'].values
test_item_input = test_df['Item ID'].values
test_age_input = test_df['Age'].values.reshape(-1, 1)
test_occupation_input = test_df['Occupation'].values.reshape(-1, 1)  # Numerical occupation
test_gender_input = test_df['Gender'].values.reshape(-1, 1)  # Numerical gender
test_ratings = test_df['Rating'].values

# Instantiate the tuner
tuner = kt.RandomSearch(
    CollaborativeFilterHyperModel(),
    objective='val_mae',  # We are optimizing for Mean Absolute Error
    max_trials=5,  # Number of different hyperparameter combinations to try
    executions_per_trial=3,  # Number of executions for each trial
    project_name='collab_filter_search'  # Project name for Keras Tuner
)

# Search for the best hyperparameters
tuner.search(
    [train_user_input, train_item_input, train_age_input, train_occupation_input, train_gender_input],
    train_ratings,
    validation_data=(
        [test_user_input, test_item_input, test_age_input, test_occupation_input, test_gender_input],
        test_ratings
    ),
    epochs=10,
    batch_size=64
)

# Retrieve the best hyperparameters
best_hp = tuner.get_best_hyperparameters()[0]
print("Best hyperparameters:", best_hp.values)

# Build the model with the best hyperparameters
best_user_model = tuner.hypermodel.build(best_hp)

# Train the model using the best hyperparameters
history = best_user_model.fit(
    [train_user_input, train_item_input, train_age_input, train_occupation_input, train_gender_input],
    train_ratings,
    validation_data=(
        [test_user_input, test_item_input, test_age_input, test_occupation_input, test_gender_input],
        test_ratings
    ),
    epochs=10,
    batch_size=64
)

# Evaluate the best model
test_loss, test_mae = best_user_model.evaluate(
    [test_user_input, test_item_input, test_age_input, test_occupation_input, test_gender_input],
    test_ratings
)
print(f"Test MAE: {test_mae}")


In [None]:
# Save the best model to a file
best_user_model.save('best_user_collab_filter_model.h5')
print("Model saved!")

In [None]:
# Load the saved model
best_user_model = load_model('best_user_collab_filter_model.h5', custom_objects={'mse': mse})
print("Model loaded!")


In [None]:
best_user_model.summary()

We now again look at the models performance

In [None]:
# Predicting Recommendations for a Specific User
user_id = 374  # Example: User ID for which to recommend

# Fetch user-specific data
user_age = train_df.loc[train_df['User ID'] == user_id, 'Age'].values[0]
user_gender = train_df.loc[train_df['User ID'] == user_id, 'Gender'].values[0]
user_occupation = train_df.loc[train_df['User ID'] == user_id, 'Occupation'].values[0]

# Prepare inputs for all movies for this user
user_input = np.full((num_items,), user_id)  # Same user_id for all movies
item_input = np.arange(1, num_items + 1)  # Movie IDs from 1 to num_items
age_input = np.full((num_items,), user_age)  # Same age for all movies
gender_input = np.full((num_items,), user_gender)  # Same gender for all movies
occupation_input = np.full((num_items,), user_occupation)  # Same occupation for all movies

# Make predictions
predicted_ratings = best_user_model.predict([user_input, item_input, age_input, occupation_input, gender_input])

# Combine results into a DataFrame for easy interpretation
recommendations = pd.DataFrame({
    'Movie ID': item_input,
    'Predicted Rating': predicted_ratings.flatten()
})

# Sort movies by predicted rating in descending order
recommended_movies = recommendations.sort_values(by='Predicted Rating', ascending=False)[:10]


In [None]:
# Recommend Top-5 Movies
recommended_movies = np.argsort(-predicted_scores.flatten())[:10]
print("Top 10 recommended movies for User ID", user_id, ":", recommended_movies)

In [None]:
filtered_user_df = train_df[train_df['Item ID'].isin(recommended_movies)]
                            
filtered_user_df.drop_duplicates(subset='Item ID', inplace=True)
filtered_user_df.drop(columns=['Gender','Rating','Age','Occupation','zip code','User ID','URL','timestamp'],inplace=True)

filtered_user_df

In [None]:
# Function to get column names with positive entries for each row
def get_positive_columns(row):
    return [col for col in row.index if row[col] == 1]


# Apply the function to each row
positive_columns_per_row_user = filtered_user_df.apply(get_positive_columns, axis=1)

# Show the result
for idx, positive_cols in enumerate(positive_columns_per_row_user):
    print(f"Row {idx} has positive values in columns: {positive_cols}")

In [None]:
true_user_ratings = test_df.loc[test_df['User ID'] == user_id].sort_values(by=['Rating'], ascending=False).head(5)

true_user_ratings.drop(columns=['User ID','URL','Rating','timestamp','Age','Gender','Occupation','zip code','Release Date'], inplace = True)

true_user_ratings

In [None]:
# Apply the function to each row
positive_columns_per_row = true_user_ratings.apply(get_positive_columns, axis=1)

# Show the result
for idx, positive_cols in enumerate(positive_columns_per_row):
    print(f"Row {idx} has positive values in columns: {positive_cols}")

This user-based model doesn't change its recommendations based on the user_id we test. This could need some further investigations. However the item-based recommender seems to work quite well. So we will take this model forward to compare with the other models built by the rest of the group.