## *Last edit by DLao - 2020/09 updated with full data*



















<br>
<br>


![](https://cdn.statically.io/img/thakoni.com/f=auto%2Cq=30/wp-content/uploads/2020/06/1591106722_Lucifer-Season-5-Release-Date-Cast-Netflix-And-Everything-You.jpg)
# Netflix Analytics - Movie Recommendation through Correlations / CF
<br>

I love Netflix! Everyone does?

This project aims to build a movie recommendation mechanism within Netflix. The dataset I used here come directly from Netflix. It consists of 4 text data files, each file contains over 20M rows, i.e. over 4K movies and 400K customers. All together **over 17K movies** and **500K+ customers**! 

<br>
One of the major challenges is to get all these data loaded into the Kernel for analysis, I have encountered many times of Kernel running out of memory and tried many different ways of how to do it more efficiently. Welcome any suggestions!!!

This kernel will be consistently be updated! Welcome any suggestions! Let's get started!

<br>
Feel free to fork and upvote if this notebook is helpful to you in some ways!


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow import keras
from keras.models import Model
from keras.layers import Input, Reshape, Dot
from keras.layers.embeddings import Embedding
from keras.layers import Concatenate, Dense, Dropout
from keras.optimizers import Adam
from keras.regularizers import l2

# Importing Data
### As the raw data format is not readable as csv file, we need some pre-process steps to convert it into csv format and then import to pandas dataframe later

In [None]:
# DataFrame to store all imported data
if not os.path.isfile('data.csv'):
    data = open('data.csv', mode='w')

files = ['../input/netflix-prize-data/combined_data_1.txt',
         '../input/netflix-prize-data/combined_data_2.txt',
         '../input/netflix-prize-data/combined_data_3.txt',
         '../input/netflix-prize-data/combined_data_4.txt']

# Remove the line with movie_id: and add a new column of movie_id
# Combine all data files into a csv file
for file in files:
  print("Opening file: {}".format(file))
  with open(file) as f:
    for line in f:
        line = line.strip()
        if line.endswith(':'):
            movie_id = line.replace(':', '')
        else:
            data.write(movie_id + ',' + line)
            data.write('\n')
data.close()

# Read all data into a pd dataframe
df = pd.read_csv('data.csv', names=['movie_id', 'user_id','rating','date'])

df

# Pre-process data
### From dataframe df, let's take only a smaller dataset of 2000 top rated movies and 100000 top users (who gave the most rates) and save into new df: lite_rating_df

In [None]:
lite_rating_df = pd.DataFrame()

group = df.groupby('user_id')['rating'].count()
top_users = group.sort_values(ascending=False)[:10000]

group = df.groupby('movie_id')['rating'].count()
top_movies = group.sort_values(ascending=False)[:2000]

lite_rating_df = df.join(top_users, rsuffix='_r', how='inner', on='user_id')
lite_rating_df = lite_rating_df.join(top_movies, rsuffix='_r', how='inner', on='movie_id')

# Re-name the users and movies for uniform name from 0..2000 and 10000
user_enc = LabelEncoder()
lite_rating_df['user'] = user_enc.fit_transform(lite_rating_df['user_id'].values)
movie_enc = LabelEncoder()
lite_rating_df['movie'] = movie_enc.fit_transform(lite_rating_df['movie_id'].values)

n_movies = lite_rating_df['movie'].nunique()
n_users = lite_rating_df['user'].nunique()

# print(n_movies, n_users)
lite_rating_df

# Prepare data for training

In [None]:
X = lite_rating_df[['user', 'movie']].values
y = lite_rating_df['rating'].values

# Split train and test data (for test model performance at last)
X_training, X_test, y_training, y_test = train_test_split(X, y, test_size=0.1)

# Split train and validation data (to monitor model performance in training)
X_train, X_val, y_train, y_val = train_test_split(X_training, y_training, test_size=0.1)

# Set the embedding dimension d of Matrix factorization
e_dimension = 50

X_train_array = [X_train[:, 0], X_train[:, 1]]
X_val_array = [X_val[:, 0], X_val[:, 1]]
X_test_array = [X_test[:, 0], X_test[:, 1]]

# Build and train deep learning model
### The embeddings is used to represent each user and each movie in the data. The dot product of user embedding matrix (size: n_users x e_dimension) and movie embedding matrix (size: n_movies x e_dimension) is a good approximation of the rating from user for movie. The model's goal is to minimize the distqace between this dot product and the ratings (training target)

In [None]:
user = Input(shape=(1,))
u = Embedding(n_users, e_dimension, embeddings_initializer='he_normal',
              embeddings_regularizer=l2(1e-6))(user)
u = Reshape((e_dimension,))(u)
movie = Input(shape=(1,))
m = Embedding(n_movies, e_dimension, embeddings_initializer='he_normal',
              embeddings_regularizer=l2(1e-6))(movie)
m = Reshape((e_dimension,))(m)

x = Dot(axes=1)([u, m])
# Build last deep learning layers 
x = Dense(128, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(1)(x)

model = Model(inputs=[user, movie], outputs=x)
model.compile(loss='mean_squared_error', 
              optimizer=Adam(lr=0.001), 
              metrics=[tf.keras.metrics.RootMeanSquaredError()]
              )

# Set up for early stop if the validation loss stop improving for more than 1 epoch
callbacks_list = [keras.callbacks.EarlyStopping(monitor='val_loss',
                                                patience=1,
                                                ),
                  # Saves the weights after every epoch
                  keras.callbacks.ModelCheckpoint(  
                      filepath='Model_1',
                      monitor='val_loss',
                      save_best_only=True,
                      )]

# Print model info summary
model.summary()  

history = model.fit(x=X_train_array, y=y_train, batch_size=64, epochs=20,
                    verbose=1, 
                    callbacks=callbacks_list,
                    validation_data=(X_val_array, y_val)
                    )

# Save the model (we should make a good habit of always saving our models after training)
model.save("Model_1")

In [None]:
# Visualize the training and validation loss

history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, len(loss_values) + 1)

plt.plot(epochs, loss_values, 'ro', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
m = tf.keras.metrics.RootMeanSquaredError()
m.update_state(model.predict(X_test_array), y_test)
m.result().numpy()

## Conclusion
### In the test result, we can see that our model's RMSE is 0.7731, which is quite good and seemingly so much improved from the Cinematch's performance (0.9514) or the prize winner team ”BellKor’s Pragmatic Chaos” (0.8567) but that is not really true. As our model have not trained in the original massive dataset with more sparse matrix and testing in the qualifying dataset (Netflix's test data) is not possible since the competition closed, any comparation would hardly be correct.

### Therefore, in this notebook, my main purpose is to show a deep learning approach to the challenge which is simple and effective to apply with a decent accuracy. Any comment about further improvement or correction would be very welcome.