# Content-based model
This model was designed by [Dao Vang](https://www.linkedin.com/in/daovang/). It gives a recommand/not recommend list of movies given a user. Based on the user's historical rating of various movies, the NN studies the characteristics of movies, such as genres and tags, and gives prediction of whether the user would like a new movie or not. Below is the entire notebook from his  [Github Repository](https://github.com/dao-v/Movie_Recommendation_System). 
We used [MovieLens 10M Dataset](https://grouplens.org/datasets/movielens/10m/) instead of the 25M dataset used in the original model. 

## Movie Recommendation System:

This script was designed to predict the ratings of movies unseen by users according to the genres and tags associated with the movies they've liked and disliked. The dataset was obtained from grouplens (University of Minnesota, https://grouplens.org/datasets/movielens/), specifically the "MovieLens 25M Dataset": http://files.grouplens.org/datasets/movielens/ml-25m.zip (250 MB)<br>
The resulting DataFrame produced by this script will choose the top ten movies with the highest predicted ratings.

To see a more step-by-step walkthrough of the code, see my Medium post: https://medium.com/@dv930/recommendation-system-for-movies-movielens-grouplens-171d30be334e

### Predictive Model Design:
Predictions are a result of three models. The first model feeds in the genres from the movies that the like and disliked + the genres of the movie in question into a neural network/deep learning model and outputs a predicted rating. The second model is similar to the first model except it uses the tags associated with the movies that have been watched by the user and the tags associated with the movie in question to predict a rating. The third model takes the two predicted ratings from the first two models and predicts a final rating using linear regression.

There is a potential that not all users will be included in this script because if a user had only watched one movie with little to no tags, it would not be possible to input them into the tags model.

(Files: links.csv, genome-scores.csv, and genome-tags.csv are not used in this script)

### How to Use:
- To begin using this script, download the "MovieLens 25M Dataset" and extract the file (ml-23m) into the same directory where this script is located.
- [Optional] Second, also unzip data.zip (from the GitHub repository) into the same directory as well.
- Next, simply run all the cells in this notebook and all additional folders will be produced as needed. 
- Lastly, the top ten movies (as well as the predictions for all movies for a specific user) will be displayed in cells 20-22, using the userId of 6550 as an example. These two DataFrames will not be saved automatically and will only be there to show the results. 

### Things to Note:
- All users are included in the statistics in cell 18, including those who've only rated a couple of movies. Most recommendation systems will attempt to address this by copying other users' profiles who've watched/rated the same movies over to the user with fewer data. This script does not directly deal with users with low data--instead, the movie(s) that the users did watch will have a majority (or sole) impact on what the models will predict.
- Most of the cells will finish running within a couple of hours at most. However, cell 13 will most likely run for over 2 days before finishing. A copy of the resulting DataFrame is included in the GitHub repository in the correct location for the script to find the CSV file. Unless the original dataset is different from "MovieLens 25M Dataset" or just wanting to fully run the entire script, removal of the triple quotes in cell 13 is needed before running.
- The very last cell (23) can be run to produce predictions for all users. However, the removal of the triple single quotes (comment syntax) is required before running.


### Minimum System Requirements & Runtimes (IMPORTANT):
Since the "ratings.csv" file is extremely large (25+ million rows), this script requires a large amount of RAM if using a personal system. My system has 48 GB of RAM and it is fully utilized during the model training for random forest (cell 18).  If you do not have that much RAM and (understandably) will not upgrade/add more RAM to your system, it is advised to splice a smaller subset of "ratings.csv" in cell 2 to a size that is more manageable to your system. In addition, setting a maximum length for each random forest tree and lower the number of trees trained in cell 18 might be required. From my experience, selecting a subset of "ratings.csv" for training does not seem to have a large impact on the performance of the models. If possible, loading in "ratings.csv" and shuffling it before subsetting is recommended since the CSV file is ordered by users.

The entire script (without cells 13 and 23) should finish running within a day if the system is up-to-date on hardware and no major background applications are running. This script was ran using the GPU version of TensorFlow. If you do not have TensorFlow installed + have a supported Nvidia GPU in your system + are using an Anaconda distribution, install TensorFlow GPU using the instructions by Anaconda: https://docs.anaconda.com/anaconda/user-guide/tasks/tensorflow/

The runtime for the last cell (cell 23) is expected to take many days.

All components created and the original dataset will take up about 2.13 GB of storage. Running cell 23 is estimated to additionally take about 150 MB of storage.


(No Python parallel programming is used in this script. Full CPU utilization is only used by Sklearn during the random forest training.)

To compare, my system specifications are: <br>
CPU = Intel i5-9600k (6-cores, factory settings) <br>
GPU = Nvidia RTX 2070<br>
Primary Storage Device = NVMe, Western Digital Black SN750<br>
RAM Speed = DDR4 3000


<br><br><br><br>
VERSION 1.01<br><br>
Version 1.01 Updates:<br>
~ Changed all "\\" to "/" (for universial usage in Windows and Linux)<br>
~ Changed all "to_csv()" to include the "index = False" argument<br>
~ Changed all "read_csv()" to remove all "index_col = 0" argument<br>
~ "common_tags_df" creation fixed ("tag" column was becoming a index when using ".groupby()")<br>
~ Exporting vectorized_dict dictionary as a pickle file (as well as loading it in). Adding the pickle file to data.zip<br>

<br><br>
Questions? Comments? Email me: dv930@nyu.edu

In [None]:
# CELL 1

import pandas as pd
import numpy as np
import tensorflow as tf

from tensorflow import keras
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import classification_report
from sklearn.utils import shuffle
from sklearn.linear_model import LinearRegression
#from spellchecker import SpellChecker # pyspellchecker

import re, os, math, sklearn, datetime, pickle

In [None]:
# CELL 2

# Loading in all relevant datasets (ignoring links.csv, genome-scores.csv, genome-tags.csv)
## Datasets are from: https://grouplens.org/datasets/movielens/25m/
### Datasets are stored in the original folder name, "ml-25m"

movies_df = pd.read_csv('ml-25m/movies.csv')
ratings_df = pd.read_csv('ml-25m/ratings.csv').iloc[:500000, :]  ## Change the "500000" to your desired size
tags_df = pd.read_csv('ml-25m/tags.csv')

In [None]:
# CELL 3

movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10681 entries, 0 to 10680
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  10681 non-null  int64 
 1   movieId     10681 non-null  int64 
 2   title       10681 non-null  object
 3   genres      10681 non-null  object
dtypes: int64(2), object(2)
memory usage: 333.9+ KB


In [None]:
# CELL 4

ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Unnamed: 0    500000 non-null  int64  
 1   userId        500000 non-null  int64  
 2   movieId       500000 non-null  int64  
 3   rating        500000 non-null  float64
 4   timestamp     500000 non-null  int64  
 5   user_emb_id   500000 non-null  int64  
 6   movie_emb_id  500000 non-null  int64  
dtypes: float64(1), int64(6)
memory usage: 26.7 MB


In [None]:
# CELL 5

tags_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95580 entries, 0 to 95579
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  95580 non-null  int64 
 1   userId      95580 non-null  int64 
 2   movieId     95580 non-null  int64 
 3   tags        95564 non-null  object
 4   timestamp   95580 non-null  int64 
dtypes: int64(4), object(1)
memory usage: 3.6+ MB


In [None]:
# CELL 6

# Moving the data out of ratings_df and tags_df for the last movie the user liked to be used as the label:
## Rating of 4+ = liked
# Creating directories:
if os.path.exists('data/') != True:
    os.mkdir('data/')
    
if os.path.exists('data/last_liked_tags/') != True: 
    os.mkdir('data/last_liked_tags/')
    
# Getting the last movie liked from ratings_df:
ratings_df_copy = ratings_df.copy()
tags_df_copy = tags_df.copy()

users_list = list(set(ratings_df_copy.userId)) ## List of all users in the dataset

ratings_index_list = [] ## These empty lists will be used to remove the last liked movies from the ratings_df and tags_df_mod copies
tags_index_list = []

last_ratings_df = pd.DataFrame() ## Want to save all the last liked movies rated into a single CSV file

counter = 0

for user in users_list:
    try: ## Some users did not rate a movie highly enough and will be removed from the dataset
        temp_df = ratings_df_copy[ratings_df_copy.userId == user].copy()
        temp_df = temp_df[temp_df.rating >= 4] # = Liked Movie

        last_time = max(temp_df.timestamp) ## If the user did not have a "liked" movie, this will return an error

        temp_df = temp_df[temp_df.timestamp == last_time] ## Isolating the last liked movie rated for each user
        
        if len(temp_df) > 1: ## Some of the movies were rated at the same timestamp; only the last one on spliced DF will be removed
            temp_df = temp_df.iloc[[len(temp_df)-1]]
            
        ratings_index_list.append(temp_df.index.values[0]) ## Appending the index of the last movies watched

        if counter == 0:
            last_ratings_df = temp_df
            counter = 1

        else:
            last_ratings_df = pd.concat([last_ratings_df, temp_df], ignore_index= True)
        
    except Exception:
        ratings_index_list.append(ratings_df_copy[ratings_df_copy.userId == user].index.values[0]) ## Adding the index of the users whom did not highly rate a movie
    
    try:  ## Some users have not created tags
        temp_df = tags_df_copy[tags_df_copy.userId == user].copy()
        temp_df = temp_df[temp_df.rating >= 4]
        last_movie = temp_df.movieId.values[0]
        temp_df = temp_df[temp_df.movieId == last_movie]
        
        
        if len(temp_df) == 0: ## MOST USERS DID NOT CREATE TAG(S) FOR THE LAST MOVIE LIKED
            continue
            
        else:
            temp_df.to_csv('data/last_liked_tags/' + str(user) + '.csv', index = False)  ###!!! THESE TAGS WILL NOT BE USED AND IS STORED FOR EXAMINATION PURPOSES; these must be removed for "proper" datasets when training to exclude data related to the label from being used in the training data
            tags_index_list.extend(list(temp_df.index.values))  ## This is a .extend since there are most likely more than one timestamp per movie
    
    except Exception:
        pass
    
last_ratings_df.to_csv('data/last_liked_ratings.csv', index = False)

# Removing the last movies from ratings_df_copy and tags_df_copy:
ratings_df_removed = ratings_df_copy.drop(ratings_index_list)

tags_df_removed = tags_df_copy.drop(tags_index_list)

ratings_df_removed.to_csv('data/ratings_df_last_liked_movie_removed.csv', index = False)
tags_df_removed.to_csv('data/tags_df_last_liked_movie_removed.csv', index = False)

In [None]:
# CELL 7

'''
	tags.csv (remove timestamp column)
		- Lower case all tags
		- Remove tags with 1-2 letter words and remove parenthesis from the tags
			These are opinions/more like mini-reviews
			It will be hard to compare these into same catagories 
			KEEP ALL TAGS WTIH "based"
		- Spell check all tags (use pyspellchecker: https://pypi.org/project/pyspellchecker/) (Disabled since it seems all tags have been spell checked) 
		- Create a new DF
			Associate each movie to the tags (ignoring/removing the userId)
				Then add all same tags together
		- Create a new DF
			Associate each userId with all the tags they inputted
				Count all the tags
						Thinking is that the tag with the most counts will be the subject/genre/type that the user likes to watch the most
						(Goal for this DF is to describe the user)
'''

## Deleting 'timestamp' column since it is not informative (not examining viewer's social behaviors)
## Droping all NaN values, which is only seen in the tag column--this is to eliminate considering NaNs when looping below
tags_df_removed = pd.read_csv('data/tags_df_last_liked_movie_removed.csv')

tags_df_mod = tags_df_removed.copy().drop('timestamp', axis=1).dropna()
tags_df_mod['tag'] = tags_df_mod['tags'].str.lower()# Making all tags lowercased for uniform format


for index, row in tags_df_mod.iterrows():
    tag = row.tag#.split()  ## splitting words for spell check
    
    correct_tag = re.sub(r' \([^)]*\)', '', tag)  ## Removing all parenthesis and its contents, including the whitespace before
        
    # First if:
    if 'based' in correct_tag: ## This is necessary because it is a common tag and avoids the other if statements downstream
        tags_df_mod.loc[index, 'tag'] = correct_tag
        continue
        
    # Second if:    
    if '-' in correct_tag: ## This is to keep "sci-fi" from being removed in the next if statement
        tags_df_mod.loc[index, 'tag'] = correct_tag
        continue
        
    # Third if:    
    if re.findall(r'\b\w{2}\b', correct_tag):
        tags_df_mod.loc[index, 'tag'] = np.NaN ## Replacing two-letter words; Need to maintain index ordering, will delete NaNs later
        
    elif re.findall(r'\b\w{1}\b', correct_tag):
        tags_df_mod.loc[index, 'tag'] = np.NaN ## Replacing one-letter words
        
    elif tag == correct_tag: ## This is for better performance since replacing significantly slows the process
        continue
        
    else:
        tags_df_mod.loc[index, 'tag'] = correct_tag ## Saves the corrected tag
        pass
        
tags_df_mod = tags_df_mod.dropna() # Dropping all tags with words that are lower than two letters or less

tags_df_mod.to_csv('data/tags_df_mod.csv', index = False)
        

In [None]:
# CELL 8

# Creating a new DF that contains the most common tags for each movie ("movieId"):

## This will create a new DF for each movie and will store this file since there is no easy storage method for this task
### Storage will be in the "data" folder under the "movie_tags" subfolder:
    
if os.path.exists('data/movie_tags/') != True: # Creating movie_tags subfolder
    os.mkdir('data/movie_tags/')
    
## Creating a copy of tags_df_mod and dropping userID:
tags_df_mod = pd.read_csv('data/tags_df_mod.csv')

tags_df_no_user = tags_df_mod.copy().drop('userId', axis= 1)

## Obtaining a list of all movieId with tags:
### !!!! The set() function does not put the list in perfect order. Some of the IDs are out-of-place.
movieId_list = list(set(tags_df_no_user.movieId))  

for movieId in movieId_list:
    df_select = tags_df_no_user[tags_df_no_user.movieId == movieId].copy().drop('movieId', axis= 1)
    
    df_select['COUNT'] = 1
    
    df_select_group = df_select.groupby(['tag']).count()
    
    df_select_group = df_select_group.sort_values(by=['COUNT'], ascending= False).reset_index()
    
    df_select_group.to_csv('data/movie_tags/' + str(movieId) + '.csv', index = False)

In [None]:
# CELL 9

# Creating a new DF that contains the most common tags for each user ("userId"):
## This DF is similar to the movieId DF that is previously created except this ties the tags in with each user
### This can be used in conjunction with the most common genres watched by the user to help determine which movies they like to watch
if os.path.exists('data/user_tags/') != True: # Creating movie_tags subfolder
    os.mkdir('data/user_tags/')
    
## Creating a copy of tags_df_mod and dropping userID:
tags_df_mod = pd.read_csv('data/tags_df_mod.csv')

tags_df_user = tags_df_mod.copy().drop('movieId', axis= 1)

## Obtaining a list of all movieId with tags:
userId_list = list(set(tags_df_user.userId))

for userId in userId_list:
    df_select = tags_df_user[tags_df_user.userId == userId].copy().drop('userId', axis= 1)
    
    df_select['COUNT'] = 1
    
    df_select_group = df_select.groupby(['tag']).count()
    
    df_select_group = df_select_group.sort_values(by=['COUNT'], ascending= False).reset_index()
    
    df_select_group.to_csv('data/user_tags/' + str(userId) + '.csv', index = False)

In [None]:
# CELL 10

# Creating another DF that contains the most common tags created by users:
## Common = the tag was used 35 times or more
tags_df_mod = pd.read_csv('data/tags_df_mod.csv')

common_tags_df = tags_df_mod.groupby(['tag']).count().sort_values('userId', ascending= False).copy().drop('movieId', axis= 1).reset_index()

common_tags_df = common_tags_df[common_tags_df.userId >= 35]

common_tags_df.to_csv('data/common_tags.csv', index = False)

In [None]:
# CELL 11

'''
movies.csv
		- Move all years to its own Year column
		- Expand all genres into their own columns and use 0 & 1 as no or yes
			REMOVE "IMAX" from genre
		- Add average user rating from ratings.csv (include average + std & average - std)
			Also add number of users who watched the movie
'''
ratings_df_removed = pd.read_csv('data/ratings_df_last_liked_movie_removed.csv')
movies_df_mod = movies_df.copy()

movies_df_mod['YEAR'] = 0
movies_df_mod['UPPER_STD'] = 0
movies_df_mod['LOWER_STD'] = 0
movies_df_mod['AVG_RATING'] = 0
movies_df_mod['VIEW_COUNT'] = 0

# Making the genres into columns:
## First, need to obtain a list of all the genres in the dataset.
#### !!!! Note: "IMAX" is not listed in the readme but is present in the dataset. "Children's" in the readme is "Children" in the dataset.
genres_list = []
for index, row in movies_df.iterrows():
    try:
        genres = row.genres.split('|')
        genres_list.extend(genres)
    except:
        genres_list.append(row.genres)
        
genres_list = list(set(genres_list))
genres_list.remove('IMAX')
genres_list.remove('(no genres listed)') # Replace with 'None'
genres_list.append('None')

for genre in genres_list: # Creating new columns with names as genres
    movies_df_mod[genre] = 0  # 0 = movie is not considered in that genre


for index, row in movies_df_mod.iterrows():
    movieId = row.movieId
    title = row.title
    
    try:
        genres = row.genres.split('|') ## Multiple genres for the movie is separated by '|' in the one string; converts to list
    except Exception:
        genres = list(row.genres) ## In the case that there is only one genre for the movie
        
        
    #print(index)
    
    # Extracting the year from the title:
    try: ## Some titles do not have the year--these will be removed downstream to remove the need to access the IMDB API (http://www.omdbapi.com/)
        matcher = re.compile('\(\d{4}\)')  ## Need to extract '(year)' from the title in case there is a year in the title
        parenthesis_year = matcher.search(title).group(0)
        matcher = re.compile('\d{4}') ## Matching the year from the already matched '(year)'
        year = matcher.search(parenthesis_year).group(0)

        movies_df_mod.loc[index, 'YEAR'] = int(year)
    
    except Exception:
        pass
    
    # Merging info from ratings_df into movies_df
    try:
        ratings_df_select = ratings_df_removed[ratings_df_removed.movieId == movieId]  ## Gathering the reviews for the movies
        std = np.std(ratings_df_select.rating)
        average_rating = np.mean(ratings_df_select.rating)

        upper_std = average_rating + std

        if upper_std > 5:   # This is to prevent the upper range from passing the max rating value
            upper_std = 5

        lower_std = average_rating - std

        if lower_std < 0.5:
            lower_std = 0.5

        view_count = len(ratings_df_select)

        movies_df_mod.loc[index, 'UPPER_STD'] = upper_std
        movies_df_mod.loc[index, 'LOWER_STD'] = lower_std
        movies_df_mod.loc[index, 'AVG_RATING'] = average_rating
        movies_df_mod.loc[index, 'VIEW_COUNT'] = view_count
        
    except Exception:
        pass

    
    # Changing all columns that are labelled as genres to 1 if the movie is in that genre:
    if 'IMAX' in genres:
        genres.remove('IMAX')
        
    if '(no genres listed)' in genres:
        genres.remove('(no genres listed)')
        genres.append('None')
        
    for genre in genres:
        movies_df_mod.loc[index, genre] = 1
        
movies_df_mod = movies_df_mod[movies_df_mod.YEAR != 0] ## Removing all movies without years in the title
movies_df_mod = movies_df_mod[movies_df_mod.VIEW_COUNT != 0] ## Removing all movies than have not be rated

movies_df_mod.to_csv('data/movies_mod.csv', index = False)

In [None]:
# CELL 12

# Combining ratings_df and movies_df_mod together:
movies_df_mod = pd.read_csv('data/movies_mod.csv')

ratings_df_removed = pd.read_csv('data/ratings_df_last_liked_movie_removed.csv')

ratings_movies_df = ratings_df_removed.merge(movies_df_mod, how= 'left', on= 'movieId').dropna()  ## Some of the movies were removed when creating movies_df_mod, which will result in nan values for some rows

# Getting a count of all the liked and dislike genres and transforming it into a percentage (liked genre counts / all liked genres counts)
## If the user rated the movie 4+, then they liked it. If lower than 4, then they disliked it.
users_list = list(set(ratings_movies_df.userId))

total_user_like_df = pd.DataFrame()
total_user_dislike_df = pd.DataFrame()

progress_counter_1 = 0
progress_counter_2 = .10

for user in users_list:
    temp_df = ratings_movies_df[ratings_movies_df.userId == user]
    like_df = temp_df[temp_df.rating >= 4].iloc[:, 14:] ## Only selecting the genres
    dislike_df = temp_df[temp_df.rating < 4].iloc[:, 14:]
    
    liked_total_counts = 0
    liked_dict = {'userId': user,'War': 0, 'Animation': 0, 'Horror': 0, 'Sci-Fi': 0, 'Fantasy': 0, 'Thriller': 0, 'Crime': 0, 'Mystery': 0, 
                  'Documentary': 0, 'Children': 0, 'Action': 0, 'Adventure': 0, 'Musical': 0,'Film-Noir': 0, 'Drama': 0, 
                  'Romance': 0, 'Comedy': 0, 'Western': 0, 'None': 0}
    
    disliked_total_counts = 0
    disliked_dict = {'userId': user,'War': 0, 'Animation': 0, 'Horror': 0, 'Sci-Fi': 0, 'Fantasy': 0, 'Thriller': 0, 'Crime': 0, 'Mystery': 0, 
                  'Documentary': 0, 'Children': 0, 'Action': 0, 'Adventure': 0, 'Musical': 0,'Film-Noir': 0, 'Drama': 0, 
                  'Romance': 0, 'Comedy': 0, 'Western': 0, 'None': 0}   
    
    progress_counter_1 += 1
    if progress_counter_1 / len(users_list) >= progress_counter_2:
        print(progress_counter_1 / len(users_list) * 100, '%')
        progress_counter_2 += .10
    
    for genre in list(like_df.columns): ## Getting all the genre counts for liked and disliked, separately
        if len(like_df) == 0: ## If the user has not given a movie a rating of 4 or higher
            pass
        
        else:
            liked_total_counts += sum(like_df[genre])
        
        
        if len(dislike_df) == 0: ## If the user has not given a movie a rating of 3.5 or lower
            pass
        
        else:
            disliked_total_counts += sum(dislike_df[genre])
        
        
    for genre in list(like_df.columns):
        if liked_total_counts == 0: 
            pass
        
        else:
            liked_genre_total_counts = sum(like_df[genre])
            liked_dict[genre] = liked_genre_total_counts/liked_total_counts
            
            
        if disliked_total_counts == 0:
            pass
        
        else:
            disliked_genre_total_counts = sum(dislike_df[genre])
            disliked_dict[genre] = disliked_genre_total_counts/disliked_total_counts
        
    
    user_like_df = pd.DataFrame(liked_dict, index=[0]) ## Even though some users have not rated a movie higher or lower than 4, the zero counts will still be added for complete-ness
    user_dislike_df = pd.DataFrame(disliked_dict, index=[0])
    
    # Concatenating the user total counts 
    if len(total_user_like_df) == 0:
        total_user_like_df = user_like_df
    
    else:
        total_user_like_df = pd.concat([total_user_like_df, user_like_df], ignore_index= True)
        
    if len(total_user_dislike_df) == 0:
        total_user_dislike_df = user_dislike_df
        
    else:
        total_user_dislike_df = pd.concat([total_user_dislike_df, user_dislike_df], ignore_index= True)
        
total_user_like_df.to_csv('data/total_user_like_df.csv', index = False)
total_user_dislike_df.to_csv('data/total_user_dislike_df.csv', index = False)
        
##########################################
# The reason why the counts are in percentage is so that the counts/genres are scaled against each other rather than a raw count
## This is more important for the models since someone who rated a lot of movies vs someone who rated a few movies would have higher counts
## but the higher counts is not meaningful and will most likely skew the model weights 

10.024516480523019 %
20.021792427131572 %
30.019068373740126 %
40.01634432034868 %
50.01362026695724 %
60.010896213565786 %
70.00817216017434 %
80.0054481067829 %
90.00272405339145 %
100.0 %


In [None]:
# CELL 13

# !!! This cell will most likely take over 2 days or more on a personal system. 
# !!! A premade CSV and pickle files are already included in the GitHub file (like_dislike_tags.csv) if not wanting to wait and the original dataset is the "MovieLens 25M Dataset".
# !!! Else, remove the triple single quotes before running this cell.

# Creating a dictionary of vectorized tags:
if os.path.exists('data/final/') != True:
    os.mkdir('data/final/')

'''
common_tags = pd.read_csv('data/common_tags.csv', index_col= False)

tags = list(set(common_tags.tag))

vector_counter = 0
vectorized_dict = {}

for tag in tags:
    vectorized_dict[tag] = vector_counter
    vector_counter += 1

ratings_df_removed = pd.read_csv('data/ratings_df_last_liked_movie_removed.csv')

user_list = list(set(ratings_df_removed.userId))

like_dislike_tags = pd.DataFrame()
index_counter = 0

progress_counter_1 = 0
progress_counter_2 = 5
start_time = datetime.datetime.now()
print('Start Time:', start_time)

for user in user_list:
    progress_counter_1 += 1

    temp_ratings_df = ratings_df_removed[ratings_df_removed.userId == user]
    like_tags_df = pd.DataFrame()
    dislike_tags_df = pd.DataFrame()
        
    for index, row in temp_ratings_df.iterrows():  ## Creating tags for each user
        try: ### This is to check if the movie tags exist
            if row.rating >= 4: # Like
                temp_movie_df = pd.read_csv('data/movie_tags/{}.csv'.format(str(int(row.movieId)))) ## This oddly turns the movieId into a float, most likely to match the other data types in the selected series

                if len(like_tags_df) == 0:
                    like_tags_df = temp_movie_df

                else:
                    like_tags_df = pd.concat([like_tags_df, temp_movie_df], ignore_index= True)

            else:
                temp_movie_df = pd.read_csv('data/movie_tags/{}.csv'.format(str(int(row.movieId))))

                if len(like_tags_df) == 0:
                    dislike_tags_df = temp_movie_df

                else:
                    dislike_tags_df = pd.concat([dislike_tags_df, temp_movie_df], ignore_index= True)
        except Exception:
            pass
                
    ## Counting all tags
    try:  ### This is to check if the user has movies they've liked or disliked. Users who only have liked movies will be skipped (example: userId 173)
        like_tags_list = list(like_tags_df.tag)
        dislike_tags_list = list(dislike_tags_df.tag)
    except Exception:
        continue
    
    like_dict = {}
    dislike_dict = {}
    
    for tag in like_tags_list:
        like_dict[tag] = like_tags_list.count(tag) * -1  ### This is multiple by -1 to convert it to a negative numerical count for the sorting that will be done next
    
    for tag in dislike_tags_list:
        dislike_dict[tag] = dislike_tags_list.count(tag) * -1
        
    ## Sorting the dictionary by the tag counts (smallest to largest is by default and simplest; in this case, the multiplication by -1 makes the tags with the largest counts the first in the sorted list)
    like_tags_counted = sorted(like_dict, key= lambda tag: like_dict[tag])  ## Returns a list of the tags
    dislike_tags_counted = sorted(dislike_dict, key= lambda tag: dislike_dict[tag])
    
    ## Converting the tags to vectorized tags but only for the first 50 tags from the like and dislike tags counted lists
    like_tags_vectorized = []
    dislike_tags_vectorized = []
    
    if len(like_tags_counted) < 50:  ## Checking to make sure there is 50 tags in the counted lists
        num_like_tags = len(like_tags_counted)
    else:
        num_like_tags = 50
        
    if len(dislike_tags_counted) < 50: 
        num_dislike_tags = len(like_tags_counted)
    else:
        num_dislike_tags = 50
    
    for tag in like_tags_counted[:num_like_tags]:
        try:  ### The tag might not exist in the vectorized dictionary
            tag_vector = vectorized_dict[tag]
            like_tags_vectorized.append(tag_vector)
        except Exception:
            pass
        
    for tag in dislike_tags_counted[:num_dislike_tags]:
        try:
            tag_vector = vectorized_dict[tag]
            dislike_tags_vectorized.append(tag_vector)
        except Exception:
            pass
        
    if len(like_tags_vectorized) < 20 or len(dislike_tags_vectorized) < 20:
        continue  ## If any of the two are not 20 tags in length, then the user will be skipped
    
    ## Obtaining the most liked and disliked tags, 20 tags each, and adding it to like_dislike_tags:
    like_dislike_dict = {}
    
    like_dislike_dict['userId'] = user
    
    for x in range(20):
        like_dislike_dict['LIKE_' + str(x)] = like_tags_vectorized[x]
        like_dislike_dict['DISLIKE_' + str(x)] = dislike_tags_vectorized[x]
    
    concat_df = pd.DataFrame(like_dislike_dict, index=[0])
    
    if len(like_dislike_tags) == 0:
        like_dislike_tags = concat_df
    
    else:
        like_dislike_tags = pd.concat([like_dislike_tags, concat_df], ignore_index= True)
    
    if (progress_counter_1 / len(user_list)) * 100 >= progress_counter_2:
        print((progress_counter_1 / len(user_list)) * 100, '% completed')
        print('Processing Time:', datetime.datetime.now() - start_time)
        print('Current Time:', datetime.datetime.now())
        progress_counter_2 += 5

like_dislike_tags = like_dislike_tags.astype('int64')
like_dislike_tags.to_csv('data/final/like_dislike_tags.csv', index = False)
with open('data/vectorized_dict.pkl', 'wb') as writer:
    # Saving the vectorized tag dictionary as a pickle file; THIS IS THE REFERENCE TO KNOW WHICH VECTOR IS ASSOCIATED TO THE TAG (string)
    pickle.dump(vectorized_dict, writer)
'''

"\ncommon_tags = pd.read_csv('data/common_tags.csv', index_col= False)\n\ntags = list(set(common_tags.tag))\n\nvector_counter = 0\nvectorized_dict = {}\n\nfor tag in tags:\n    vectorized_dict[tag] = vector_counter\n    vector_counter += 1\n\nratings_df_removed = pd.read_csv('data/ratings_df_last_liked_movie_removed.csv')\n\nuser_list = list(set(ratings_df_removed.userId))\n\nlike_dislike_tags = pd.DataFrame()\nindex_counter = 0\n\nprogress_counter_1 = 0\nprogress_counter_2 = 5\nstart_time = datetime.datetime.now()\nprint('Start Time:', start_time)\n\nfor user in user_list:\n    progress_counter_1 += 1\n\n    temp_ratings_df = ratings_df_removed[ratings_df_removed.userId == user]\n    like_tags_df = pd.DataFrame()\n    dislike_tags_df = pd.DataFrame()\n        \n    for index, row in temp_ratings_df.iterrows():  ## Creating tags for each user\n        try: ### This is to check if the movie tags exist\n            if row.rating >= 4: # Like\n                temp_movie_df = pd.read_csv('

In [None]:
# CELL 14

# Creating a movie tags profile to complement the user tags:
if os.path.exists('data/final/') != True:
    os.mkdir('data/final/')
    
movies_df_mod = pd.read_csv('data/movies_mod.csv')
movieId_list = list(movies_df_mod.movieId)
del movies_df_mod

movie_tags_df = pd.DataFrame()
index_counter = 0

progress_counter_1 = 0
progress_counter_2 = 5
start_time = datetime.datetime.now()
print('Start Time:', start_time)

with open('data/vectorized_dict.pkl', 'rb') as reader:
    vectorized_dict = pickle.load(reader)

for movie in movieId_list:
    progress_counter_1 += 1

    try:
        temp_df = pd.read_csv('data/movie_tags/{}.csv'.format(movie))  ## The tags are already in order of most counts and then alphabetically

        if len(temp_df) < 5: ## Skipping movies with less than 5 tags
            continue 

        vectorized_tag = []
        movie_tags = list(temp_df.tag)

        for tag in movie_tags:
            try:
                tag_vector = vectorized_dict[tag]
                vectorized_tag.append(tag_vector)
            except Exception:
                pass

        if len(vectorized_tag) < 5: ## Skipping movies with less than 5 common tags; The first similar if statement is not needed but is placed for performance purposes
            continue 

        movie_tags_df.loc[index_counter, 'movieId'] = movie

        for x in range(5):
            movie_tags_df.loc[index_counter, 'TAG_' + str(x)] = vectorized_tag[x]
            
        index_counter += 1
            
    except Exception:
        pass
    
    if (progress_counter_1 / len(movieId_list)) * 100 >= progress_counter_2:
        print((progress_counter_1 / len(movieId_list)) * 100, '% completed')
        print('Processing Time:', datetime.datetime.now() - start_time)
        print('Current Time:', datetime.datetime.now())
        progress_counter_2 += 5

movie_tags_df.to_csv('data/final/movie_tags_df.csv', index = False)

Start Time: 2022-12-02 23:56:41.605465
5.001147052076164 % completed
Processing Time: 0:00:00.553252
Current Time: 2022-12-02 23:56:42.158768
10.002294104152329 % completed
Processing Time: 0:00:01.032032
Current Time: 2022-12-02 23:56:42.637615
15.003441156228492 % completed
Processing Time: 0:00:01.751781
Current Time: 2022-12-02 23:56:43.357338
20.016058729066298 % completed
Processing Time: 0:00:02.190087
Current Time: 2022-12-02 23:56:43.795576
25.005735260380824 % completed
Processing Time: 0:00:02.733589
Current Time: 2022-12-02 23:56:44.339098
30.018352833218625 % completed
Processing Time: 0:00:03.158096
Current Time: 2022-12-02 23:56:44.763595
35.03097040605643 % completed
Processing Time: 0:00:03.803827
Current Time: 2022-12-02 23:56:45.409331
40.009176416609314 % completed
Processing Time: 0:00:04.230041
Current Time: 2022-12-02 23:56:45.835547
45.01032346868548 % completed
Processing Time: 0:00:04.666073
Current Time: 2022-12-02 23:56:46.271610
50.03441156228493 % complete

## Model Training:



In [None]:
# CELL 15

def stats(predictions, true, flex_range= 0.5):
    predictions_list = []
    round_list = np.arange(0.5, 5.5, 0.5)

    for value in predictions:
        value_ori = value
        compare_diff = 99999
        value_round = 0

        for rating in round_list:
            compare_value = abs(value_ori - rating)

            if compare_value < compare_diff: ## The absolute difference value that is closest to 0 is the rating the prediction will be rounded to
                compare_diff = compare_value
                value_round = rating

        predictions_list.append(value_round)

    prediction_dict = {'PREDICTION': predictions_list, 'TRUE': list(true)}
    prediction_compare_df = pd.DataFrame(prediction_dict)

    rating_accuracy = 0
    like_dislike_tp = 0  ## "Positive" = Like
    like_dislike_tn = 0  ## "Negative" = Dislike
    like_dislike_fp = 0
    like_dislike_fn = 0
    prediction_length = len(prediction_compare_df)

    ## Making the accuracy definition more flexible by covering a larger range:
    rating_accuracy_flex = 0  ## If the prediction was within +/- 0.5 of the actual
    like_dislike_tp_flex = 0  ## If the prediction was 3.5+ (instead of 4+), then it is a like
    like_dislike_tn_flex = 0  ## If the prediction was 3.0-, then it is a dislike
    like_dislike_fp_flex = 0
    like_dislike_fn_flex = 0

    progress_counter = 0

    for index, row in prediction_compare_df.iterrows():
        predict_like = 0
        true_like = 0

        if row.PREDICTION >= 4:
            predict_like = 1

        if row.TRUE >= 4:
            true_like = 1

        if row.PREDICTION == row.TRUE:  ## This is if the exact predicted rating value is the same as the actual value
            rating_accuracy += 1

        if predict_like == true_like:
            if predict_like == 1:  ## Don't need to consider true_like to also be 1 since it is assumed it is with the nested if condition
                like_dislike_tp += 1  ## True Positive

            else:
                like_dislike_tn += 1  ## True Negative

        else:
            if predict_like == 1:
                like_dislike_fp += 1  ## False Positive

            else:
                like_dislike_fn += 1 ## False Negative

        ####### FLEX starts:
        predict_like_flex = 0
        true_like_flex = 0

        if row.PREDICTION >= 3.5:
            predict_like_flex = 1

        if row.TRUE >= 3.5:
            true_like_flex = 1

        if row.PREDICTION >= (row.TRUE - flex_range) and row.PREDICTION <= (row.TRUE + flex_range):  
            rating_accuracy_flex += 1

        if predict_like_flex == true_like_flex:
            if predict_like_flex == 1:  
                like_dislike_tp_flex += 1 

            else:
                like_dislike_tn_flex += 1 

        else:
            if predict_like_flex == 1:
                like_dislike_fp_flex += 1 

            else:
                like_dislike_fn_flex += 1 

        progress_counter += 1
        if progress_counter % 100000 == 0:
            print(str(progress_counter / prediction_length * 100) + '%')

    rating_accuracy = rating_accuracy / prediction_length
    like_dislike_accuracy = (like_dislike_tp + like_dislike_tn) / prediction_length

    rating_accuracy_flex = rating_accuracy_flex / prediction_length
    like_dislike_accuracy_flex = (like_dislike_tp_flex + like_dislike_tn_flex) / prediction_length

    print('True Positive: {}, True Negative: {}, False Positive {}, False Negative {}'.format(like_dislike_tp, like_dislike_tn, like_dislike_fp, like_dislike_fn))
    print('Rating Accuracy: {}, Catagorical Accuracy (Like/Dislike) {}'.format(rating_accuracy, like_dislike_accuracy))
    print('------------------------------------------------------------------------------------------------------------')
    print('FLEX True Positive: {}, FLEX True Negative: {}, FLEX False Positive {}, FLEX False Negative {}'.format(like_dislike_tp_flex, like_dislike_tn_flex, like_dislike_fp_flex, like_dislike_fn_flex))
    print('FLEX Rating Accuracy: {}, FLEX Catagorical Accuracy (Like/Dislike) {}'.format(rating_accuracy_flex, like_dislike_accuracy_flex))
    return

In [None]:
# CELL 16

def merge_shuffle_split(split=0.5):
    movies_df_mod = pd.read_csv('data/movies_mod.csv')
    ratings_df_removed = pd.read_csv('data/ratings_df_last_liked_movie_removed.csv')
    
    # Since ratings_df_removed is the template for merging, it will be shuffled:
    ratings_df_removed = shuffle(ratings_df_removed)
    
    # Selecting a certain range from ratings_df_removed, train + test:
    selection_range = int(len(ratings_df_removed) * (split))
    ratings_df_removed = ratings_df_removed.iloc[: selection_range, :]
    
    # Merging begins:
    ratings_df_removed = ratings_df_removed.merge(movies_df_mod, how= 'left', on= 'movieId').dropna()
    del movies_df_mod


    # Changing the columns names to differentiate between the columns of total_user_like_df and total_user_dislike_df:
    total_user_like_df = pd.read_csv('data/total_user_like_df.csv')

    like_columns = list(total_user_like_df.columns)
    like_columns_modified = []

    for column in like_columns:
        if column == 'userId':
            like_columns_modified.append('userId')
        else:
            modify_column = 'user_like_' + column
            like_columns_modified.append(modify_column)

    total_user_like_df.columns = like_columns_modified

    ratings_df_removed = ratings_df_removed.merge(total_user_like_df, how= 'left', on= 'userId').dropna()
    del total_user_like_df
    

    total_user_dislike_df = pd.read_csv('data/total_user_dislike_df.csv')    

    dislike_columns = list(total_user_dislike_df.columns)
    dislike_columns_modified = []

    for column in dislike_columns:
        if column == 'userId':
            dislike_columns_modified.append('userId')
        else:
            modify_column = 'user_dislike_' + column
            dislike_columns_modified.append(modify_column)

    total_user_dislike_df.columns = dislike_columns_modified

    # Merging all the DFs to create one final DF:
    ratings_df_removed = ratings_df_removed.merge(total_user_dislike_df, how= 'left', on= 'userId').dropna()

    # Removing loaded DFs to save on RAM space:
    del total_user_dislike_df

    movie_tags_df = pd.read_csv('data/final/movie_tags_df.csv')
    ratings_df_removed = ratings_df_removed.merge(movie_tags_df, how= 'left', on= 'movieId').dropna()
    del movie_tags_df

    like_dislike_tags = (pd.read_csv('data/final/like_dislike_tags.csv')).astype('int64')
    ratings_df_removed = ratings_df_removed.merge(like_dislike_tags, how= 'left', on= 'userId').dropna()
    del like_dislike_tags
    
    like_columns_modified.remove('userId')
    dislike_columns_modified.remove('userId')
    like_columns.remove('userId')
    
    genres_like = ratings_df_removed.loc[:, like_columns_modified]
    genres_dislike = ratings_df_removed.loc[:, dislike_columns_modified]
    genres_movie = ratings_df_removed.loc[:, like_columns]
    
    # Generating the columns for the tag inputs for random forest:
    rf_columns = []
    for x in range(20): 
        rf_columns.append('LIKE_' + str(x))
        rf_columns.append('DISLIKE_' + str(x))
    for x in range(5):
        rf_columns.append('TAG_' + str(x))
        
    rf_input = ratings_df_removed.loc[:, rf_columns]
    
    ratings = list(ratings_df_removed.rating)
    
    del ratings_df_removed
    
    return genres_like, genres_dislike, genres_movie, rf_input, ratings

In [None]:
# CELL 17

# Using Deep Learning/TensorFlow as the first model:
## The goal of the model is to predict the rating the person would give to each movie
### There will be three inputs: user liked genres, user disliked genres, and movie genres
### The label will be the actual rating for the movie that the user gave it
user_liked_genres = keras.Input(shape= (20,))
user_disliked_genres = keras.Input(shape= (20,))
movie_genres = keras.Input(shape= (20,))

## Liked genres Input:
liked_input = keras.layers.Dense(20, activation= 'relu')(user_liked_genres)
liked_hidden_1 = keras.layers.Dense(50, activation= 'relu')(liked_input)
liked_hidden_2 = keras.layers.Dense(50, activation= 'relu')(liked_hidden_1)

## Disliked genres Input:
disliked_input = keras.layers.Dense(20, activation= 'relu')(user_disliked_genres)
disliked_hidden_1 = keras.layers.Dense(50, activation= 'relu')(disliked_input)
disliked_hidden_2 = keras.layers.Dense(50, activation= 'relu')(disliked_hidden_1)

## Movie genres Input:
movie_input = keras.layers.Dense(20, activation= 'relu')(movie_genres)
movie_hidden_1 = keras.layers.Dense(50, activation= 'relu')(movie_input)
movie_hidden_2 = keras.layers.Dense(50, activation= 'relu')(movie_hidden_1)

## Merging:
merged_model = keras.layers.concatenate([liked_hidden_2, disliked_hidden_2, movie_hidden_2])
merged_model_hidden_1 = keras.layers.Dense(150, activation= 'relu')(merged_model)
merged_model_hidden_2 = keras.layers.Dense(75, activation= 'relu')(merged_model_hidden_1)
merged_model_hidden_3 = keras.layers.Dense(50, activation= 'relu')(merged_model_hidden_2)

## Output Layer:
output_rating = keras.layers.Dense(1, activation= 'sigmoid')(merged_model_hidden_3)

## Molding the Model togther:
genres_model = keras.Model(inputs= [user_liked_genres, user_disliked_genres, movie_genres], outputs= output_rating)

## Compiling the Model:
genres_model.compile(optimizer= keras.optimizers.Adam(learning_rate=0.001), loss= 'mean_squared_error')

In [None]:
# CELL 18

# Models Training:
if os.path.exists('models/') != True: 
    os.mkdir('models/')

# Generating the datasets:
genres_like, genres_dislike, genres_movie, rf_input, ratings = merge_shuffle_split() # Default split of the whole ratings.csv dataset is set to be 50%; already shuffled

train_split = 0.5 ## This would be about 25% of original ratings.csv dataset
split_index = int(len(ratings) * train_split)

genres_like_train = genres_like.iloc[: split_index, :]
genres_like_test = genres_like.iloc[split_index :, :]
del genres_like ## Attempting to save RAM space

genres_dislike_train = genres_dislike.iloc[: split_index, :]
genres_dislike_test = genres_dislike.iloc[split_index :, :]
del genres_dislike

genres_movie_train = genres_movie.iloc[: split_index, :]
genres_movie_test = genres_movie.iloc[split_index :, :]
del genres_movie

ratings_scaled = np.array(ratings) / 5
ratings_scaled_train = ratings_scaled[: split_index]
ratings_scaled_test = ratings_scaled[split_index :]

batch_size = 500
epochs = 10

def scheduler(epoch):
    if epoch < 5:
        return 0.001
    else:
        return 0.001 * math.exp(0.1 * (5 - epoch))

Learning_Rate_Callback = keras.callbacks.LearningRateScheduler(scheduler)

class Save_Progress_Callback(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None): ## Saving and printing after each epoch
        lr = float(tf.keras.backend.get_value(self.model.optimizer.learning_rate))
        print("Epoch {}, loss is {:7.3f}, validation loss is {:7.3f}, learning rate is {}.".format(epoch, logs["loss"], logs["val_loss"], lr))
            
### !!!!!!!!!!!!!! VERBOSE MUST BE SET TO 0 AS THE OUTPUT IS TOO LONG/LARGE AND WILL CRASH THE NOTEBOOK    
genres_model.fit(x= [genres_like_train, genres_dislike_train, genres_movie_train], 
                  y= ratings_scaled_train, 
                  epochs= epochs, verbose= 0, batch_size= batch_size, validation_split= 0.1, shuffle= True,
                  callbacks=[Learning_Rate_Callback, Save_Progress_Callback()])

genres_model.save('models/genres_model.h5', overwrite= True, include_optimizer= True)

# _____________________________________________________________________________________________________
# Tag Model, Random Forest:
rf_input_train = rf_input.iloc[: split_index, :]
rf_input_test = rf_input.iloc[split_index :, :]

ratings_train = ratings[: split_index]
ratings_test = ratings[split_index :]

random_forest = RandomForestRegressor(n_estimators= 100, max_features= 'sqrt', verbose=2, random_state= True, n_jobs= -1) ## The number of trees is set to 100 due to high RAM usage
random_forest.fit(rf_input_train, ratings_train)
print(random_forest.score(rf_input_test, ratings_test))

# Saving RF model:
pickle.dump(random_forest, open('tags_model.sav', 'wb'))


genres_model_predictions = (genres_model.predict(x= [genres_like_test, genres_dislike_test, genres_movie_test])) * 5 # Rescale back to original values
random_forest_predict = random_forest.predict(rf_input_test)

print('genres Model Stats:')
stats(genres_model_predictions, ratings_scaled_test * 5)
print('Tags Model Stats:')
stats(random_forest_predict, ratings_test)

# Creating a input for the combine_model:
genres_model_predictions_list = []

for prediction in genres_model_predictions:
    genres_model_predictions_list.append(prediction[0])
    
merged_predictions = pd.DataFrame({'genres_model': genres_model_predictions_list, 
                                   'tag_model': list(random_forest_predict), 
                                   'genres_true': list(np.array(list(ratings_scaled_test)) * 5), 
                                   'tag_true': ratings_test}, 
                                  index= list(range(len(ratings_test))))

# Using a linear regression for predictions adjustment:
X = merged_predictions.loc[:, ['genres_model', 'tag_model']]
y = np.array(merged_predictions.loc[:, 'genres_true']) 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25)

line_reg = LinearRegression(n_jobs= -1).fit(X_train, y_train)
print('Linear Regression R2:', line_reg.score(X_test, y_test))
line_reg_predictions = line_reg.predict(X_test)

# Saving linear regression model:
pickle.dump(line_reg, open('combine_model.sav', 'wb'))

# Rounding the predictions that are out of bounds:
line_reg_predictions_rounded = []

for prediction in line_reg_predictions:
    rounded = prediction
    if rounded > 5:
        rounded = 5
    elif rounded < 0.5:
        rounded = 0.5
    
    line_reg_predictions_rounded.append(rounded)
        

stats(line_reg_predictions_rounded, y_test)

Epoch 0, loss is   0.050, validation loss is   0.042, learning rate is 0.0010000000474974513.
Epoch 1, loss is   0.041, validation loss is   0.041, learning rate is 0.0010000000474974513.
Epoch 2, loss is   0.041, validation loss is   0.040, learning rate is 0.0010000000474974513.
Epoch 3, loss is   0.040, validation loss is   0.040, learning rate is 0.0010000000474974513.
Epoch 4, loss is   0.040, validation loss is   0.041, learning rate is 0.0010000000474974513.
Epoch 5, loss is   0.040, validation loss is   0.039, learning rate is 0.0010000000474974513.
Epoch 6, loss is   0.039, validation loss is   0.039, learning rate is 0.0009048373904079199.
Epoch 7, loss is   0.039, validation loss is   0.039, learning rate is 0.0008187307394109666.
Epoch 8, loss is   0.039, validation loss is   0.039, learning rate is 0.0007408182136714458.
Epoch 9, loss is   0.039, validation loss is   0.040, learning rate is 0.0006703200633637607.


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.


building tree 1 of 100building tree 2 of 100building tree 3 of 100building tree 4 of 100



building tree 5 of 100building tree 6 of 100
building tree 7 of 100
building tree 8 of 100

building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100building tree 13 of 100

building tree 14 of 100building tree 15 of 100

building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100


[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    2.4s


building tree 33 of 100building tree 34 of 100

building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100building tree 40 of 100

building tree 41 of 100
building tree 42 of 100
building tree 43 of 100
building tree 44 of 100
building tree 45 of 100
building tree 46 of 100
building tree 47 of 100
building tree 48 of 100
building tree 49 of 100
building tree 50 of 100
building tree 51 of 100
building tree 52 of 100
building tree 53 of 100
building tree 54 of 100
building tree 55 of 100
building tree 56 of 100
building tree 57 of 100
building tree 58 of 100
building tree 59 of 100
building tree 60 of 100
building tree 61 of 100
building tree 62 of 100
building tree 63 of 100
building tree 64 of 100
building tree 65 of 100
building tree 66 of 100
building tree 67 of 100
building tree 68 of 100
building tree 69 of 100
building tree 70 of 100
building tree 71 of 100
building tree 72 of 100
building tree 73 of 100
building tree 74

[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    8.1s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.3s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    1.1s finished


0.12249606221093867


[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.2s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    1.0s finished


genres Model Stats:
True Positive: 3043, True Negative: 37200, False Positive 1205, False Negative 44225
Rating Accuracy: 0.11149370280018209, Catagorical Accuracy (Like/Dislike) 0.46972791894762644
------------------------------------------------------------------------------------------------------------
FLEX True Positive: 51157, FLEX True Negative: 3830, FLEX False Positive 27079, FLEX False Negative 3607
FLEX Rating Accuracy: 0.5980297176473335, FLEX Catagorical Accuracy (Like/Dislike) 0.6418241452966512
Tags Model Stats:
True Positive: 25004, True Negative: 28239, False Positive 10166, False Negative 22264
Rating Accuracy: 0.21632252868464977, Catagorical Accuracy (Like/Dislike) 0.6214676735961154
------------------------------------------------------------------------------------------------------------
FLEX True Positive: 48944, FLEX True Negative: 8079, FLEX False Positive 22830, FLEX False Negative 5820
FLEX Rating Accuracy: 0.5866142191822394, FLEX Catagorical Accuracy (Like

In [None]:
# CELL 19

def top_10_recommendations(userId):
    # Loading all the datasets needed:
    movies_df_mod = pd.read_csv('data/movies_mod.csv')
    ratings_df_removed = pd.read_csv('data/ratings_df_last_liked_movie_removed.csv')

    
    # Gathering all the movies in the dataset:
    not_watched = list(movies_df_mod.movieId)
    
    # Selecting all movies that have not been seen by the user:
    ratings_df_removed = ratings_df_removed[ratings_df_removed.userId == userId]
    
    if len(ratings_df_removed) ==  0:  ## First check for valid users/users with enough information 
        return print('User {} does not have enough information. 1'.format(userId))
    
    ratings_df_removed = ratings_df_removed.merge(movies_df_mod, how= 'left', on= 'movieId').dropna()
    
    if len(ratings_df_removed) ==  0:  ## Second check
        return print('User {} does not have enough information. 2'.format(userId))
    
    watched = list(ratings_df_removed.movieId)
    del ratings_df_removed  ## I find that not all variables are actually cleared in definitions; this is to ensure it removed from RAM
    
    # Finding the movies the user has not watched:
    for movie in watched:
        if movie in not_watched:
            not_watched.remove(movie)
            
    # Loading in users' like and disliked genres:
    total_user_like_df = pd.read_csv('data/total_user_like_df.csv')
    total_user_dislike_df = pd.read_csv('data/total_user_dislike_df.csv') 

    
    # Selecting from total_user_like_df and total_user_dislike_df to isolate only the userId input:
    total_user_like_df = total_user_like_df[total_user_like_df.userId == userId]
    
    if len(total_user_like_df) ==  0:  ## Third check
        return print('User {} does not have enough information. 3'.format(userId))
    
    total_user_dislike_df = total_user_dislike_df[total_user_dislike_df.userId == userId]
    if len(total_user_dislike_df) ==  0:  ## Fourth check
        return print('User {} does not have enough information. 4'.format(userId))
            
    # Changing the columns names to differentiate between the columns of total_user_like_df and total_user_dislike_df:

    like_columns = list(total_user_like_df.columns)
    like_columns_modified = []

    for column in like_columns:
        if column == 'userId':
            like_columns_modified.append('userId')
        else:
            modify_column = 'user_like_' + column
            like_columns_modified.append(modify_column)

    total_user_like_df.columns = like_columns_modified
    
    dislike_columns = list(total_user_dislike_df.columns)
    dislike_columns_modified = []

    for column in dislike_columns:
        if column == 'userId':
            dislike_columns_modified.append('userId')
        else:
            modify_column = 'user_dislike_' + column
            dislike_columns_modified.append(modify_column)

    total_user_dislike_df.columns = dislike_columns_modified

    # Loading in tags:
    movie_tags_df = pd.read_csv('data/final/movie_tags_df.csv')
    like_dislike_tags = (pd.read_csv('data/final/like_dislike_tags.csv')).astype('int64')
    
    # Selecting the movies that have not been seen from movie_tags_df and merging movies_df_mod and movie_tags_df:
    template_df = pd.DataFrame({'movieId': not_watched}, index= list(range(len(not_watched)))) ## Creating a template DF for merging
    template_df = template_df.merge(movies_df_mod, how= 'left', on= 'movieId').dropna()
    template_df = template_df.merge(movie_tags_df, how= 'left', on= 'movieId').dropna()
    del movie_tags_df
    
    # Selecting the user's tags:
    like_dislike_tags = like_dislike_tags[like_dislike_tags.userId == userId]
    if len(like_dislike_tags) ==  0:  ## Fifth check
        return print('User {} does not have enough information. 5'.format(userId))

    # Adding a userId column to the template DF so that merging is possible with total_user_like_df, total_user_dislike_df, and like_dislike_tags
    template_df['userId'] = userId
    template_df = template_df.merge(total_user_like_df, how= 'left', on= 'userId').dropna()
    del total_user_like_df
    template_df = template_df.merge(total_user_dislike_df, how= 'left', on= 'userId').dropna()
    del total_user_dislike_df
    template_df = template_df.merge(like_dislike_tags, how= 'left', on= 'userId').dropna()
    del like_dislike_tags
    
    like_columns_modified.remove('userId')
    dislike_columns_modified.remove('userId')
    like_columns.remove('userId')

    # Generating the columns for the tag inputs for random forest:
    rf_columns = []
    for x in range(20): 
        rf_columns.append('LIKE_' + str(x))
        rf_columns.append('DISLIKE_' + str(x))
    for x in range(5):
        rf_columns.append('TAG_' + str(x))
        
    # Selecting out the inputs from the template DF by column names:
    genres_like_input = template_df.loc[:, like_columns_modified]
    genres_dislike_input = template_df.loc[:, dislike_columns_modified]
    genres_movie_input = template_df.loc[:, like_columns]
    
    tags_input = template_df.loc[:, rf_columns]
    
    # Saving the movieId list:
    movieId_list = list(template_df.movieId)
    
    del template_df
    
    # Loading in all models
    genres_model = tf.keras.models.load_model('models/genres_model.h5', compile=True)
    tags_model = pickle.load(open('tags_model.sav', 'rb'))
    combine_model = pickle.load(open('combine_model.sav', 'rb'))
    
    # Predicting with the genres model and tags model:
    genres_model_predictions = (genres_model.predict(x= [genres_like_input, genres_dislike_input, genres_movie_input])) * 5 ## Rescaling up; predicts a scaled and bound (sigmoid, 0-1) values
    tags_model_predictions = tags_model.predict(tags_input)
    
    # Extracting and changing the Keras predictions into a 1-D format (list):
    genres_model_predictions_list = []

    for prediction in genres_model_predictions:
        genres_model_predictions_list.append(prediction[0])
    
    # Using the predictions from the two models as the inputs for the combine_model:
    combine_input = pd.DataFrame({'genres_predictions': genres_model_predictions_list, 
                                  'tags_predictions': tags_model_predictions}, 
                                 index= list(range(len(genres_model_predictions))))
    
    combine_model_predictions = combine_model.predict(combine_input)
    
    # Rounding the predictions that are out of bounds:
    combine_model_predictions_rounded = []

    for prediction in combine_model_predictions:
        rounded = prediction
        if rounded > 5:
            rounded = 5
        elif rounded < 0.5:
            rounded = 0.5

        combine_model_predictions_rounded.append(rounded)
    
    # Adding all predictions into one DF:
    predictions_df = pd.DataFrame({'movieId': movieId_list,
                                   'genres_predictions': genres_model_predictions_list, 
                                  'tags_predictions': tags_model_predictions,
                                  'combine_predictions': combine_model_predictions_rounded}, 
                                 index= list(range(len(movieId_list))))
    
    # Sorting by combine_model_predictions_rounded and selecting the first 10 highest predicted ratings:
    best_movies_df = predictions_df.sort_values(by= ['combine_predictions'], ascending=False).iloc[:10, :]
    
    # Finding adding the movie titles and information to highest 10:
    best_movies_df = best_movies_df.merge(movies_df_mod, how= 'left', on= 'movieId').dropna()
    del movies_df_mod
    
    return predictions_df, best_movies_df
    
    

In [None]:
# CELL 20

predictions_df, best_movies_df = top_10_recommendations(4) 



[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished


In [None]:
# CELL 21

predictions_df

Unnamed: 0,movieId,genres_predictions,tags_predictions,combine_predictions
0,1,3.643119,3.550,3.706718
1,2,3.489480,3.800,3.750420
2,5,3.478448,3.635,3.608618
3,6,3.501935,4.355,4.202621
4,7,3.505412,3.880,3.829804
...,...,...,...,...
2587,63113,3.253168,4.000,3.671451
2588,63876,3.271477,3.380,3.198663
2589,64285,3.034932,3.515,3.067965
2590,64620,3.184769,3.575,3.266035


In [None]:
# CELL 22

best_movies_df

Unnamed: 0.1,movieId,genres_predictions,tags_predictions,combine_predictions,Unnamed: 0,title,genres,YEAR,UPPER_STD,LOWER_STD,...,Comedy,Adventure,Crime,Documentary,Fantasy,Film-Noir,Horror,Musical,War,None
0,64906,4.078344,4.14,4.611391,10651,"Battle of Britain, The (Why We Fight, 4) (1943)",Documentary|War,1943,4.0,4.0,...,0,0,0,1,0,0,0,0,1,0
1,942,3.899832,4.295,4.554841,925,Laura (1944),Crime|Film-Noir|Mystery,1944,4.901032,3.672497,...,0,0,1,0,0,1,0,0,0,0
2,7769,4.031487,3.97,4.429635,7455,Legend of the Village Warriors (Bangrajan) (2000),Action|Drama|War,2000,4.138071,3.195262,...,0,0,0,0,0,0,0,0,1,0
3,3683,3.948617,4.06,4.417679,3594,Blood Simple (1984),Crime|Drama|Film-Noir,1984,4.906642,3.317358,...,0,0,1,0,0,1,0,0,0,0
4,1179,3.941441,4.045,4.398586,1156,"Grifters, The (1990)",Crime|Drama|Film-Noir,1990,4.682722,2.841869,...,0,0,1,0,0,1,0,0,0,0
5,26007,3.845189,4.16,4.392992,8413,"Unknown Soldier, The (Tuntematon sotilas) (1955)",Drama|War,1955,4.0,4.0,...,0,0,0,0,0,0,0,0,1,0
6,55052,3.870701,4.125,4.390895,10000,Atonement (2007),Drama|Romance|War,2007,4.637385,3.106205,...,0,0,0,0,0,0,0,0,1,0
7,3068,3.737947,4.29,4.38824,2983,"Verdict, The (1982)",Drama|Mystery,1982,4.484691,3.273204,...,0,0,0,0,0,0,0,0,0,0
8,44633,3.863849,4.13,4.387973,9472,"Devil and Daniel Johnston, The (2005)",Documentary,2005,4.0,3.5,...,0,0,0,1,0,0,0,0,0,0
9,32892,3.923865,4.05,4.38489,9047,My Name Is Ivan (a.k.a. Ivan's Childhood) (Iva...,Drama|War,1962,4.5,4.5,...,0,0,0,0,0,0,0,0,1,0


## Optional:
If wanting to loop through all users and save the predictions, remove the triple single quotes at the start and end of the next cell.

In [None]:
# CELL 23

'''
if os.path.exists('predictions/') != True: 
    os.mkdir('predictions/')

if os.path.exists('predictions/full_predictions') != True: 
    os.mkdir('predictions/full_predictions')
    
if os.path.exists('predictions/top_10') != True: 
    os.mkdir('predictions/top_10')

ratings_df_removed = pd.read_csv('data/ratings_df_last_liked_movie_removed.csv')
userId_list = list(set(ratings_df_removed.userId))
del ratings_df_removed

progress_counter_1 = 0
progress_counter_2 = 5

for user in userId_list:
    progress_counter_1 += 1
    
    predictions_df, best_movies_df = top_10_recommendations(user) 
    
    predictions_df.to_csv('predictions/full_predictions/full_predictions - {}.csv'.format(user), index = False)
    
    best_movies_df.to_csv('predictions/top_10/top_10 - {}.csv'.format(user), index = False)
    
    if progress_counter_1 / len(userId_list) * 100 >= progress_counter_2:
        print(progress_counter_1 / len(userId_list) * 100, '% Completed')
        progress_counter_2 += 5

'''