# Movies Recommendation System
- This project was developed to practice building a recommendation system using a Neural Network (NN) model implemented in Keras. It applies a content-based filtering approach on movie data. The dataset includes user preferences (favorite genres), movie genre details, and user ratings. The model predicts a user’s rating for each movie and recommends the top 10 movies with the highest predicted ratings.
- The dataset used in this project is originally from the [MovieLens dataset](https://grouplens.org/datasets/movielens/latest/) but was accessed through the course materials provided in [ML Specialization](https://www.coursera.org/specializations/machine-learning-introduction). 

## Sections
- [Importing Packages](#Importing-Packages)
- [Data Exploration & Cleaning](#Data-Exploration-&-Cleaning)
- [Prepare the Data before Modeling](#Prepare-the-Data-before-Modeling)
- [Building the Model](#Building-the-Model)
- [Make Recommendations](#Make-Recommendations)

***الله المستعان***

---
---

## Importing Packages

In [1]:
# Data processing 
import numpy as np
import pandas as pd

# Packages for preprocessing, modeling, and evaluating
import tensorflow as tf
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import tensorflow.keras.ops as k

import importlib
import utils # custom functions

In [2]:
np.set_printoptions(precision = 4, suppress = True)
pd.set_option('display.max_columns', None)

---
---

## Data Exploration & Cleaning

### Users
First, we will load the users training data, understand the data variables & dimensions, and identify any noise. After that, we will clean the noisy data and remove the variables that are not relevant to our study.

#### Data Understanding

In [3]:
users_data = utils.load_data('users', 'users_header')

In [4]:
users_data.head()

Unnamed: 0,user id,rating count,rating ave,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Horror,Mystery,Romance,Sci-Fi,Thriller
0,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
1,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
2,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
3,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
4,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89


In [5]:
users_data.shape

(50884, 17)

In [6]:
users_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50884 entries, 0 to 50883
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   user id       50884 non-null  float64
 1   rating count  50884 non-null  float64
 2   rating ave    50884 non-null  float64
 3   Action        50884 non-null  float64
 4   Adventure     50884 non-null  float64
 5   Animation     50884 non-null  float64
 6   Children      50884 non-null  float64
 7   Comedy        50884 non-null  float64
 8   Crime         50884 non-null  float64
 9   Documentary   50884 non-null  float64
 10  Drama         50884 non-null  float64
 11  Fantasy       50884 non-null  float64
 12  Horror        50884 non-null  float64
 13  Mystery       50884 non-null  float64
 14  Romance       50884 non-null  float64
 15  Sci-Fi        50884 non-null  float64
 16  Thriller      50884 non-null  float64
dtypes: float64(17)
memory usage: 6.6 MB


In [7]:
users_data.isna().sum()

user id         0
rating count    0
rating ave      0
Action          0
Adventure       0
Animation       0
Children        0
Comedy          0
Crime           0
Documentary     0
Drama           0
Fantasy         0
Horror          0
Mystery         0
Romance         0
Sci-Fi          0
Thriller        0
dtype: int64

In [8]:
users_data[users_data.duplicated(keep = False)].sort_values(
    by = [col for col in users_data.columns],
).head(n = 10)

Unnamed: 0,user id,rating count,rating ave,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Horror,Mystery,Romance,Sci-Fi,Thriller
0,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
1,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
2,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
3,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
4,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
5,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
6,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
7,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
8,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
9,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89


In [9]:
len(np.unique(users_data['user id']))

397

***NOTES***  

There are 397 users represented in the dataset. `rating_count`, `rating_ave`, and a `per_genre_ave_rating` variables are available for each user. Those are the key findings:  
- The majority of the ratings samples have been duplicated. From the data source [ML Specialization](https://www.coursera.org/specializations/machine-learning-introduction), it was to boost the underrepresented samples.
- The final version of the data contains 50884 samples with 17 columns for each sample.
- Variables names should be cleaned such as removing spaces in `user_id` and standardize the cases.
- `user_id` variable is represented with float values. It will be casted into a string variable.
- For the purpose of the study, we will only use the `per_genre_ave_rating` variables and remove the other rating variables.

#### Data Cleaning

In [10]:
# clean the col names
users_data.columns = utils.clean_col_names(users_data)

In [11]:
users_data.head()

Unnamed: 0,user_id,rating_count,rating_ave,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,horror,mystery,romance,sci-fi,thriller
0,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
1,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
2,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
3,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89
4,2.0,22.0,4.0,3.95,4.25,0.0,0.0,4.0,4.12,4.0,4.04,0.0,3.0,4.0,0.0,3.88,3.89


In [12]:
# cast the user_id col
users_data['user_id'] = users_data['user_id'].astype(int)

In [13]:
users_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50884 entries, 0 to 50883
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   user_id       50884 non-null  int32  
 1   rating_count  50884 non-null  float64
 2   rating_ave    50884 non-null  float64
 3   action        50884 non-null  float64
 4   adventure     50884 non-null  float64
 5   animation     50884 non-null  float64
 6   children      50884 non-null  float64
 7   comedy        50884 non-null  float64
 8   crime         50884 non-null  float64
 9   documentary   50884 non-null  float64
 10  drama         50884 non-null  float64
 11  fantasy       50884 non-null  float64
 12  horror        50884 non-null  float64
 13  mystery       50884 non-null  float64
 14  romance       50884 non-null  float64
 15  sci-fi        50884 non-null  float64
 16  thriller      50884 non-null  float64
dtypes: float64(16), int32(1)
memory usage: 6.4 MB


In [14]:
# prepare the train data for the modeling process
users_preferences = np.array(
    users_data.drop(
        columns = ['user_id', 'rating_count', 'rating_ave']
    ).values
)

In [15]:
users_preferences

array([[3.95, 4.25, 0.  , ..., 0.  , 3.88, 3.89],
       [3.95, 4.25, 0.  , ..., 0.  , 3.88, 3.89],
       [3.95, 4.25, 0.  , ..., 0.  , 3.88, 3.89],
       ...,
       [3.55, 3.7 , 3.94, ..., 3.67, 3.61, 3.6 ],
       [3.55, 3.7 , 3.94, ..., 3.67, 3.61, 3.6 ],
       [3.55, 3.7 , 3.94, ..., 3.67, 3.61, 3.6 ]])

In [16]:
users_preferences.shape

(50884, 14)

### Movies
The same approach applied with the users data will be applied here.

#### Data Understanding

In [17]:
movies_data = utils.load_data('movies', 'movies_header')

In [18]:
movies_data.head()

Unnamed: 0,movie id,year,ave rating,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Horror,Mystery,Romance,Sci-Fi,Thriller
0,6874,2003,3.961832,1,0,0,0,0,1,0,0,0,0,0,0,0,1
1,8798,2004,3.761364,1,0,0,0,0,1,0,1,0,0,0,0,0,1
2,46970,2006,3.25,1,0,0,0,1,0,0,0,0,0,0,0,0,0
3,48516,2006,4.252336,0,0,0,0,0,1,0,1,0,0,0,0,0,1
4,58559,2008,4.238255,1,0,0,0,0,1,0,1,0,0,0,0,0,0


In [19]:
movies_data.shape

(50884, 17)

In [20]:
movies_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50884 entries, 0 to 50883
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   movie id     50884 non-null  int64  
 1   year         50884 non-null  int64  
 2   ave rating   50884 non-null  float64
 3   Action       50884 non-null  int64  
 4   Adventure    50884 non-null  int64  
 5   Animation    50884 non-null  int64  
 6   Children     50884 non-null  int64  
 7   Comedy       50884 non-null  int64  
 8   Crime        50884 non-null  int64  
 9   Documentary  50884 non-null  int64  
 10  Drama        50884 non-null  int64  
 11  Fantasy      50884 non-null  int64  
 12  Horror       50884 non-null  int64  
 13  Mystery      50884 non-null  int64  
 14  Romance      50884 non-null  int64  
 15  Sci-Fi       50884 non-null  int64  
 16  Thriller     50884 non-null  int64  
dtypes: float64(1), int64(16)
memory usage: 6.6 MB


In [21]:
movies_data.isna().sum()

movie id       0
year           0
ave rating     0
Action         0
Adventure      0
Animation      0
Children       0
Comedy         0
Crime          0
Documentary    0
Drama          0
Fantasy        0
Horror         0
Mystery        0
Romance        0
Sci-Fi         0
Thriller       0
dtype: int64

In [22]:
movies_data.duplicated().sum()

50037

In [23]:
movies_data[movies_data.duplicated(keep = False)].sort_values(
    by = [col for col in movies_data.columns]
).head(10)

Unnamed: 0,movie id,year,ave rating,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Horror,Mystery,Romance,Sci-Fi,Thriller
1015,4054,2001,2.84375,0,0,0,0,0,0,0,1,0,0,0,1,0,0
3062,4054,2001,2.84375,0,0,0,0,0,0,0,1,0,0,0,1,0,0
4675,4054,2001,2.84375,0,0,0,0,0,0,0,1,0,0,0,1,0,0
8603,4054,2001,2.84375,0,0,0,0,0,0,0,1,0,0,0,1,0,0
12408,4054,2001,2.84375,0,0,0,0,0,0,0,1,0,0,0,1,0,0
13290,4054,2001,2.84375,0,0,0,0,0,0,0,1,0,0,0,1,0,0
29020,4054,2001,2.84375,0,0,0,0,0,0,0,1,0,0,0,1,0,0
32248,4054,2001,2.84375,0,0,0,0,0,0,0,1,0,0,0,1,0,0
34691,4054,2001,2.84375,0,0,0,0,0,0,0,1,0,0,0,1,0,0
39937,4054,2001,2.84375,0,0,0,0,0,0,0,1,0,0,0,1,0,0


In [24]:
len(np.unique(movies_data['movie id']))

847

***NOTES***  
There are 847 movies represented in the dataset. The dataset contains the `release_year` of the movie, a binary indicator for each genre applies that applies to the movie, and an `ave_rating` of the movie. A movie may belong to one or more genres. Those are the key findings:
- As the users data, there are many duplicates.
- The data contains 50884 rows with 17 columns (the same as the users data).
- Additionally, minor data cleaning steps will be taken (renaming cols & casting the `movie_id` col).

#### Data Cleaning

In [25]:
# clean col names
movies_data.columns = utils.clean_col_names(movies_data)

In [26]:
movies_data.head()

Unnamed: 0,movie_id,year,ave_rating,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,horror,mystery,romance,sci-fi,thriller
0,6874,2003,3.961832,1,0,0,0,0,1,0,0,0,0,0,0,0,1
1,8798,2004,3.761364,1,0,0,0,0,1,0,1,0,0,0,0,0,1
2,46970,2006,3.25,1,0,0,0,1,0,0,0,0,0,0,0,0,0
3,48516,2006,4.252336,0,0,0,0,0,1,0,1,0,0,0,0,0,1
4,58559,2008,4.238255,1,0,0,0,0,1,0,1,0,0,0,0,0,0


In [27]:
# prepare for the modeling process
movies_features = np.array(
    movies_data.drop(
        columns = ['movie_id']
    ).values
)

In [28]:
movies_features[0]

array([2003.    ,    3.9618,    1.    ,    0.    ,    0.    ,    0.    ,
          0.    ,    1.    ,    0.    ,    0.    ,    0.    ,    0.    ,
          0.    ,    0.    ,    0.    ,    1.    ])

In [29]:
movies_features.shape

(50884, 16)

### Ratings
The ratings vector represents our target variable. Each sample represents the rating `(y)` given by a user `(j)` to a movie `(i)`. 

In [30]:
ratings = utils.load_data('ratings', np_arr = True)

In [31]:
ratings[0:10]

array([4. , 3.5, 4. , 4. , 4.5, 5. , 4.5, 3. , 3. , 3. ])

- For example, the first rating (4) corresponds to the score given by the first user in `users_train_data` (ID-2) to the first movie in `movies_train_data` (ID-6874).

In [32]:
ratings = ratings.reshape(-1,1)

In [33]:
ratings.shape

(50884, 1)

#### Building a dataframe contains IDs and ratings

In [34]:
ratings_df = pd.concat(
    [movies_data['movie_id'], users_data['user_id'], pd.DataFrame(ratings, columns = ['rating'])], axis = 1
)

In [35]:
ratings_df.head(10)

Unnamed: 0,movie_id,user_id,rating
0,6874,2,4.0
1,8798,2,3.5
2,46970,2,4.0
3,48516,2,4.0
4,58559,2,4.5
5,60756,2,5.0
6,68157,2,4.5
7,71535,2,3.0
8,71535,2,3.0
9,71535,2,3.0


### Info about Movies

In [36]:
movies_info = utils.load_data('info_movies')

In [37]:
movies_info.head()

Unnamed: 0,movieId,title,genres
0,4054,Save the Last Dance (2001),Drama|Romance
1,4069,"Wedding Planner, The (2001)",Comedy|Romance
2,4148,Hannibal (2001),Horror|Thriller
3,4149,Saving Silverman (Evil Woman) (2001),Comedy|Romance
4,4153,Down to Earth (2001),Comedy|Fantasy|Romance


In [38]:
movies_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847 entries, 0 to 846
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  847 non-null    int64 
 1   title    847 non-null    object
 2   genres   847 non-null    object
dtypes: int64(1), object(2)
memory usage: 20.0+ KB


---
---

## Prepare the Data before Modeling

In [39]:
movies_features.shape, users_preferences.shape, ratings.shape

((50884, 16), (50884, 14), (50884, 1))

### Scaling the data

In [40]:
# movies
movies_scaler = StandardScaler().fit(movies_features)
movies_scaled = movies_scaler.transform(movies_features)

# users
users_scaler = StandardScaler().fit(users_preferences)
users_scaled = users_scaler.transform(users_preferences)

# ratings
ratings_scaler = MinMaxScaler().fit(ratings)  # min_max scaler (bounded range): no need for negatives
ratings_scaled = ratings_scaler.transform(ratings)

### Splitting the data

In [41]:
GLOBAL_RANDOM_STATE = 42
GLOBAL_TEST_SIZE = 0.2

train_idx, test_idx = train_test_split(
    range(ratings.shape[0]),  # total number of examples
    test_size = GLOBAL_TEST_SIZE,
    random_state = GLOBAL_RANDOM_STATE,
    shuffle = True
)


movies_train, movies_test = movies_scaled[train_idx], movies_scaled[test_idx]
users_train, users_test = users_scaled[train_idx], users_scaled[test_idx]
ratings_train, ratings_test = ratings_scaled[train_idx], ratings_scaled[test_idx]

---
---

## Building the Model

- The network Structure
  
![](theNN.png)

In [42]:
num_outputs = 32
tf.random.set_seed(1)

# Users NN
user_nn = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units = 256, activation = 'relu'),
    tf.keras.layers.Dense(units = 128, activation = 'relu'),
    tf.keras.layers.Dense(units = num_outputs)
])


# Movies NN
movie_nn = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units = 256, activation = 'relu'),
    tf.keras.layers.Dense(units = 128, activation = 'relu'),
    tf.keras.layers.Dense(units = num_outputs)
])

# Inputting the user data
user_input = tf.keras.layers.Input(shape = (users_scaled.shape[1], ))
vu = user_nn(user_input)
vu = k.normalize(vu, axis = 1)


# Inputting the movie data
movie_input = tf.keras.layers.Input(shape = (movies_scaled.shape[1], ))
vm = movie_nn(movie_input)
vm = k.normalize(vm, axis = 1)

# Output layer
output = tf.keras.layers.Dot(axes = 1)([vu, vm])

In [43]:
model = tf.keras.Model(
    inputs = [user_input, movie_input],
    outputs = output
)

In [44]:
model.summary()

In [45]:
tf.random.set_seed(1)
model.compile(
    loss = tf.keras.losses.MeanSquaredError(),
    optimizer = tf.keras.optimizers.Adam(learning_rate = 0.01)
)

model.fit(
    x = (users_train, movies_train),
    y = ratings_train,
    epochs = 30
)

Epoch 1/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - loss: 0.0368
Epoch 2/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0287
Epoch 3/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0274
Epoch 4/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0266
Epoch 5/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0258
Epoch 6/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0250
Epoch 7/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0244
Epoch 8/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0238
Epoch 9/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0233
Epoch 10/30
[1m1273/1273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m

<keras.src.callbacks.history.History at 0x23f1cfef310>

In [46]:
model.evaluate(
    x = (users_test, movies_test),
    y = ratings_test
)

[1m319/319[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 812us/step - loss: 0.0207


0.020917516201734543

---
---

## Make Recommendations

In [47]:
scalers = {
    'users_scaler' : users_scaler,
    'movies_scaler' : movies_scaler,
    'ratings_scaler' : ratings_scaler
}

### Predicting movies for a user

In [48]:
target_user_id = 42
ratings_df[ratings_df['user_id'] == target_user_id]

Unnamed: 0,movie_id,user_id,rating
2557,4153,42,3.0
2558,4367,42,3.0


In [49]:
pred_df_user42 = utils.recommend_movies(target_user_id, model, None, **scalers)
pred_df_user42.head(10)

[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step


Unnamed: 0,user_id,movie_id,predicted_rating
2288,42,6618,4.584171
139,42,6539,4.487654
860,42,60684,4.43921
910,42,76251,4.420294
189,42,31878,4.382574
1288,42,40815,4.371706
1807,42,61024,4.362256
1396,42,81229,4.356258
3021,42,104913,4.327493
3819,42,115210,4.326573


In [50]:
## Explore the recommended movies features

recommendations_user42 = pd.concat([
    # movies that user has watched before
    pd.merge(
        pd.merge(
            # the predicted ratings
            pred_df_user42.drop(columns = 'user_id'),
        
            # the true ratings
            (
                ratings_df[ratings_df['user_id'] == target_user_id]
                .drop_duplicates(subset = ['movie_id'])
                .drop(columns = ['user_id'])
                .rename(columns = {'rating' : 'actual_rating'})
            ),
            how = 'inner',  # to bring movies that is found in both df (user has watched)
            on  = ['movie_id']
        ),
    
        # movie_features
        movies_data.drop_duplicates(subset = ['movie_id']),
        how = 'inner',
        on  = 'movie_id'
    ),

    
    # top 10 recommendations (probably contains movies that user has not watched before0
    pd.merge(
        pd.merge(
            # the top 10 recommended movies 
            pred_df_user42.drop(columns = 'user_id').head(10),
        
            # the true ratings
            (
                ratings_df[ratings_df['user_id'] == target_user_id]
                .drop_duplicates(subset = ['movie_id'])
                .drop(columns = ['user_id'])
                .rename(columns = {'rating' : 'actual_rating'})
            ),
            how = 'left', # to bring movies in pred_df, whether they are in ratings (user watched them) or not
            on  = 'movie_id'
        ),
    
        # movie_features
        movies_data.drop_duplicates(subset = ['movie_id']),
        how = 'inner',
        on  = 'movie_id'
    )  
]).drop(columns = ['year', 'ave_rating'])
recommendations_user42

Unnamed: 0,movie_id,predicted_rating,actual_rating,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,horror,mystery,romance,sci-fi,thriller
0,4153,2.860564,3.0,0,0,0,0,1,0,0,0,1,0,0,1,0,0
1,4367,2.849543,3.0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
0,6618,4.584171,,1,0,0,0,1,0,0,0,0,0,0,0,0,0
1,6539,4.487654,,1,1,0,0,1,0,0,0,1,0,0,0,0,0
2,60684,4.43921,,1,0,0,0,0,0,0,1,0,0,1,0,1,1
3,76251,4.420294,,1,0,0,0,1,0,0,0,0,0,0,0,0,0
4,31878,4.382574,,1,0,0,0,1,0,0,0,0,0,0,0,0,0
5,40815,4.371706,,0,1,0,0,0,0,0,0,1,0,0,0,0,1
6,61024,4.362256,,1,0,0,0,1,1,0,0,0,0,0,0,0,0
7,81229,4.356258,,1,0,0,0,1,0,0,0,0,0,0,0,0,0


### Predicting movies for a new user

In [51]:
ratings_df[ratings_df['user_id'] == 666]

Unnamed: 0,movie_id,user_id,rating


- No ratings found, a new user.

In [52]:
# new user that loves crime, drama, and comedy
new_user_id = 666
new_user = {}
for col in users_data.columns:
    if col == 'user_id':
        new_user[col] = new_user_id
    elif col in ['crime', 'drama', 'comedy']:
        new_user[col] = 5
    else:
        new_user[col] = 0

In [53]:
pred_df_new_user = utils.recommend_movies(new_user_id, model, user_preferences = new_user, **scalers)
pred_df_new_user.head(10)

[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 


Unnamed: 0,user_id,movie_id,predicted_rating
710,666,7323,4.197736
7054,666,6620,4.195607
1075,666,4979,4.194088
1918,666,5577,4.15608
4956,666,8366,4.143809
231,666,5902,4.063844
254,666,6942,4.060277
2052,666,48738,4.059286
7330,666,5380,4.055204
167,666,8949,4.04278


In [54]:
# Explain recommended movies features
pd.merge(
    # top 10 recommendations
    pred_df_new_user.head(10),

    # movies features
    movies_data.drop_duplicates(subset = ['movie_id']),

    how = 'inner',
    on  = ['movie_id']
).drop(columns = ['user_id', 'ave_rating', 'year'])

Unnamed: 0,movie_id,predicted_rating,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,horror,mystery,romance,sci-fi,thriller
0,7323,4.197736,0,0,0,0,1,0,0,1,0,0,0,0,0,0
1,6620,4.195607,0,0,0,0,1,0,0,1,0,0,0,0,0,0
2,4979,4.194088,0,0,0,0,1,0,0,1,0,0,0,0,0,0
3,5577,4.15608,0,0,0,0,1,0,0,1,0,0,0,0,0,0
4,8366,4.143809,0,0,0,0,1,0,0,1,0,0,0,0,0,0
5,5902,4.063844,0,0,0,0,1,0,0,1,0,0,0,1,0,0
6,6942,4.060277,0,0,0,0,1,0,0,1,0,0,0,1,0,0
7,48738,4.059286,0,0,0,0,0,0,0,1,0,0,0,0,0,1
8,5380,4.055204,0,0,0,0,1,0,0,1,0,0,0,1,0,0
9,8949,4.04278,0,0,0,0,1,0,0,1,0,0,0,1,0,0


- Movies align with the user's preferences (comedy, drama, and crime)

### Recommend for no-preferences user

In [55]:
ratings_df[ratings_df['user_id'] == 777]

Unnamed: 0,movie_id,user_id,rating


In [56]:
new_user2_id = 777
new_user2 = {}

for col in users_data.columns:
    if col == 'user_id':
        new_user2[col] = new_user2_id
    else:
        new_user2[col] = 0
new_user2

{'user_id': 777,
 'rating_count': 0,
 'rating_ave': 0,
 'action': 0,
 'adventure': 0,
 'animation': 0,
 'children': 0,
 'comedy': 0,
 'crime': 0,
 'documentary': 0,
 'drama': 0,
 'fantasy': 0,
 'horror': 0,
 'mystery': 0,
 'romance': 0,
 'sci-fi': 0,
 'thriller': 0}

In [57]:
pred_df_new_user2 = utils.recommend_movies(
    new_user2_id,
    model, 
    new_user2,
    **scalers
)

pred_df_new_user2.head(10)

[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 


Unnamed: 0,user_id,movie_id,predicted_rating
860,777,60684,3.61928
403,777,7438,3.581844
1741,777,6773,3.576361
2070,777,51935,3.496224
415,777,48774,3.48175
527,777,27773,3.475237
128,777,5618,3.443218
891,777,71899,3.428383
3344,777,69481,3.428376
1465,777,108932,3.391225


In [58]:
pd.merge(
    # recommended movies
    pred_df_new_user2.head(10),

    # movies features
    movies_data.drop_duplicates(subset = ['movie_id']),

    how = 'inner',
    on  = ['movie_id']
).drop(columns = ['user_id', 'year'])

Unnamed: 0,movie_id,predicted_rating,ave_rating,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,horror,mystery,romance,sci-fi,thriller
0,60684,3.61928,3.988372,1,0,0,0,0,0,0,1,0,0,1,0,1,1
1,7438,3.581844,3.868182,1,0,0,0,0,0,0,1,0,0,0,0,0,1
2,6773,3.576361,3.704545,0,0,1,0,1,0,0,0,1,0,0,0,0,0
3,51935,3.496224,3.86,1,0,0,0,0,0,0,1,0,0,0,0,0,1
4,48774,3.48175,3.945312,1,1,0,0,0,0,0,1,0,0,0,0,1,1
5,27773,3.475237,4.089744,0,0,0,0,0,0,0,0,0,0,1,0,0,1
6,5618,3.443218,4.155172,0,1,1,0,0,0,0,0,1,0,0,0,0,0
7,71899,3.428383,4.2,0,0,1,0,1,0,0,1,0,0,0,0,0,0
8,69481,3.428376,4.058824,1,0,0,0,0,0,0,1,0,0,0,0,0,1
9,108932,3.391225,3.870968,1,1,1,1,1,0,0,0,1,0,0,0,0,0


- `predicted_rating` are very close to the `ave_rating`. It implies that for no-preferences user, the model predictions are close to the `ave_rating` feature.

### Saving model and scalers

In [60]:
utils.save_object("recommender_dnn_model", model, is_tf_obj = True)
utils.save_object("fitted_scalers", scalers)

---
---

***Alhamdulillah***