# Content-based Movies Recommender System Using User Profile and Movie Genres

<center>
    <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQZnLZ3TVnHb50UzEDg38xyMFpgjUUSVzTiL_4b-XsTKLEm71zqQPxXSFMFVRXS4sP5ZA&usqp=CAU" width="600" alt="cognitiveclass.ai logo" />
</center>


The most common type of content-based recommendation system is to recommend items to users based on their profiles. The user's profile revolves around that user's preferences and tastes. It is shaped based on user ratings, including the number of times a user has clicked on different items or liked those items.

The recommendation process is based on the similarity between those items. The similarity or closeness of items is measured based on the similarity in the content of those items. When we say content, we're talking about things like the item's category, tag, genre, and so on. Essentially the features about an item.

### Table of Content
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#wrangle">Preliminary Wranglingt</a></li>
<li><a href="#webscrape">Webscrape image_url and extract url from data in links.csv</a></li>
<li><a href="#moviegenre">Genegrate movie genre dataframe</a></li>
<li><a href="#usergenre">Generate user's profile dataframe</a></li>
<li><a href="#recommend">Build Recommendation system</a></li>
<li><a href="#testing">Test Recommendation system</a></li>
</ul>

<a id='introduction'></a>
### Introduction

#### Objectives
The following are the main objectives of this notebook:
- Generate a user profile based on movie genres and rating
- Generate movie recommendations based on a user's profile and movie genres


#### Steps taken for movies recommender systems
- Webscrape image_url and extract url from data in links.csv
- Extract features from movies data (such as genres). 
- Based on the movie genres and users' ratings, 
- Build user profiles dataframe.
- Use the user profile feature vectors and movies genre feature vectors constructed, with several computational methods, such as a simple dot product, to compute or predict an interest score for each movie
- recommend those movies with the highest interest scores.

A user profile can be seen as the user feature vector that mathematically represents a user's learning interests.

#### About Dataset

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.

<a id='wrangle'></a>
### Preliminary Wrangling

Import necessary libraries

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')

##### Load movies dataframe `movies.csv`

In [2]:
movies_df = pd.read_csv('movies.csv')
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


##### Load user's ratings dataframe `ratings.csv`

In [3]:
users_df = pd.read_csv('ratings.csv')

# Drop timestamp column
users_df.drop(['timestamp'], axis=1, inplace=True)
users_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


##### Load links dataframe `links.csv`

In [4]:
# load 
link_df = pd.read_csv('links.csv')
link_df.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


Check information of the dataframes

In [5]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [6]:
users_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   userId   100836 non-null  int64  
 1   movieId  100836 non-null  int64  
 2   rating   100836 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB


In [7]:
link_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


<a id='webscrape'></a>
### Webscrape image_url and extract url from data in links.csv

In [8]:
# Create empty columns with null values
link_df['url'] = np.nan
link_df['img_url'] = np.nan


for idx, tmdbId in tqdm(enumerate(link_df['tmdbId']), total = len(link_df['tmdbId'])):
    try:
        # Get url
        url = 'https://www.themoviedb.org/movie/' + str(tmdbId)
        link_df['url'][idx] = url

        # assign the response to a object
        response = requests.get(url)

        # Use BeautifulSoup() to create a BeautifulSoup object from a response text content
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find image container
        obj = soup.find('div', 'image_content backdrop').img

        # Get image url
        image_url = 'https://www.themoviedb.org' + obj.get('data-src')

        # Link image url
        link_df['img_url'][idx] = image_url
    except AttributeError:
        image_url = 'https://www.firstcolonyfoundation.org/wp-content/uploads/2022/01/no-photo-available.jpeg'
        link_df['img_url'][idx] = image_url

100%|████████████████████████████████████████████████████████████████████████████| 9742/9742 [1:24:47<00:00,  1.91it/s]


##### Save link_df

In [10]:
# Save link_df
link_df.to_csv('link_df.csv', index=False)

<a id='moviegenre'></a>
### Genegrate movie genre dataframe

In [11]:
# Copy movies_df
movies_genres_df = movies_df.copy()

# Merge movies_genres_df and link_df
movies_genres_df = movies_genres_df.merge(link_df, on='movieId')
movies_genres_df

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,url,img_url
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,https://www.themoviedb.org/movie/862.0,https://www.themoviedb.org/t/p/w300_and_h450_b...
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497,8844.0,https://www.themoviedb.org/movie/8844.0,https://www.themoviedb.org/t/p/w300_and_h450_b...
2,3,Grumpier Old Men (1995),Comedy|Romance,113228,15602.0,https://www.themoviedb.org/movie/15602.0,https://www.themoviedb.org/t/p/w300_and_h450_b...
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885,31357.0,https://www.themoviedb.org/movie/31357.0,https://www.themoviedb.org/t/p/w300_and_h450_b...
4,5,Father of the Bride Part II (1995),Comedy,113041,11862.0,https://www.themoviedb.org/movie/11862.0,https://www.themoviedb.org/t/p/w300_and_h450_b...
...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,5476944,432131.0,https://www.themoviedb.org/movie/432131.0,https://www.themoviedb.org/t/p/w300_and_h450_b...
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,5914996,445030.0,https://www.themoviedb.org/movie/445030.0,https://www.themoviedb.org/t/p/w300_and_h450_b...
9739,193585,Flint (2017),Drama,6397426,479308.0,https://www.themoviedb.org/movie/479308.0,https://www.themoviedb.org/t/p/w300_and_h450_b...
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,8391976,483455.0,https://www.themoviedb.org/movie/483455.0,https://www.themoviedb.org/t/p/w300_and_h450_b...


In [12]:
# Create arrays of zeros
zeros = np.zeros(movies_genres_df.shape[0], dtype=int)

# for each row in movies_genres_df['genres'], 
# extract the genres to appropriate(individual columns)
for idx, row in enumerate(movies_genres_df['genres']):
    row = row.split('|')
    for genre in row:
        if genre not in movies_genres_df.columns:
            # Create new columns
            movies_genres_df[genre] = zeros
        
        # insert in dataframe
        movies_genres_df[genre][idx] = 1

In [13]:
# Drop genres column
movies_genres_df = movies_genres_df.drop(['genres', 'imdbId', 'tmdbId'], axis=1)
movies_genres_df.head()

Unnamed: 0,movieId,title,url,img_url,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story (1995),https://www.themoviedb.org/movie/862.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),https://www.themoviedb.org/movie/8844.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),https://www.themoviedb.org/movie/15602.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
3,4,Waiting to Exhale (1995),https://www.themoviedb.org/movie/31357.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),https://www.themoviedb.org/movie/11862.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


##### Save movies_genres_df

In [14]:
movies_genres_df.to_csv('movies_genres.csv', index=False)

<a id='usergenre'></a>
### Generate user's profile dataframe

In [15]:
profiles_df = users_df.merge(movies_genres_df, on='movieId')
profiles_df

Unnamed: 0,userId,movieId,rating,title,url,img_url,Adventure,Animation,Children,Comedy,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,1,4.0,Toy Story (1995),https://www.themoviedb.org/movie/862.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,5,1,4.0,Toy Story (1995),https://www.themoviedb.org/movie/862.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,7,1,4.5,Toy Story (1995),https://www.themoviedb.org/movie/862.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3,15,1,2.5,Toy Story (1995),https://www.themoviedb.org/movie/862.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
4,17,1,4.5,Toy Story (1995),https://www.themoviedb.org/movie/862.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100831,610,160341,2.5,Bloodmoon (1997),https://www.themoviedb.org/movie/30948.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100832,610,160527,4.5,Sympathy for the Underdog (1971),https://www.themoviedb.org/movie/90351.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100833,610,160836,3.0,Hazard (2005),https://www.themoviedb.org/movie/70193.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100834,610,163937,3.5,Blair Witch (2016),https://www.themoviedb.org/movie/351211.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [16]:
genre_columns = profiles_df.loc[:, 'Adventure':]
genre_columns.columns

Index(['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy', 'Romance',
       'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'Mystery', 'Sci-Fi',
       'War', 'Musical', 'Documentary', 'IMAX', 'Western', 'Film-Noir',
       '(no genres listed)'],
      dtype='object')

#### Generate Weighted matrix
> For each user ratings, we calculated the genre_rating i.e the genreate ofthe movie multiply by the user rating

In [17]:
for col in genre_columns.columns:
    profiles_df[col] = profiles_df['rating'] * profiles_df[col]

In [18]:
profiles_df.head(3)

Unnamed: 0,userId,movieId,rating,title,url,img_url,Adventure,Animation,Children,Comedy,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,1,4.0,Toy Story (1995),https://www.themoviedb.org/movie/862.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,4.0,4.0,4.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5,1,4.0,Toy Story (1995),https://www.themoviedb.org/movie/862.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,4.0,4.0,4.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,7,1,4.5,Toy Story (1995),https://www.themoviedb.org/movie/862.0,https://www.themoviedb.org/t/p/w300_and_h450_b...,4.5,4.5,4.5,4.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
# Drop unnecssary columns
profiles_df = profiles_df.drop(['movieId', 'rating', 'title'], axis=1)

# Groupby userId to get the weighted Genre Vectore
profiles_df = profiles_df.groupby(['userId'], as_index=False).sum()
profiles_df.head()

Unnamed: 0,userId,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,373.0,136.0,191.0,355.0,202.0,112.0,308.0,389.0,196.0,...,59.0,75.0,169.0,99.0,103.0,0.0,0.0,30.0,5.0,0.0
1,2,12.5,0.0,0.0,28.0,0.0,4.5,66.0,43.5,38.0,...,3.0,8.0,15.5,4.5,0.0,13.0,15.0,3.5,0.0,0.0
2,3,30.0,2.0,2.5,9.0,13.5,2.5,12.0,50.0,1.0,...,37.5,5.0,63.0,2.5,0.5,0.0,0.0,0.0,0.0,0.0
3,4,106.0,24.0,38.0,365.0,70.0,196.0,418.0,83.0,103.0,...,17.0,80.0,34.0,25.0,64.0,8.0,3.0,38.0,16.0,0.0
4,5,26.0,26.0,37.0,52.0,29.0,34.0,95.0,28.0,46.0,...,3.0,4.0,5.0,10.0,22.0,0.0,11.0,6.0,0.0,0.0


#### Save profiles_df

In [20]:
profiles_df.to_csv('profiles.csv', index=False)

<a id='recommend'></a>
### Build recommendation system

In [21]:
# Load saved date
profiles_df = pd.read_csv('profiles.csv')
movies_genres_df = pd.read_csv('movies_genres.csv')
users_df = pd.read_csv('ratings.csv')

In [22]:
all_movies = set(movies_genres_df['movieId'].values)

def generate_score(user_id):
    # get user profile
    user_profile = profiles_df[profiles_df['userId'] == user_id]
    
    # Now let's get the test user vector by excluding the `user` column
    user_vector = user_profile.iloc[0, 1:].values
    
    # Get watched movies
    watched_movies = users_df[users_df['userId'] == user_id]['movieId'].to_list()
    watched_movies = set(watched_movies)
    
    # Get seen movies df
    seen_movies_genres = movies_genres_df[movies_genres_df['movieId'].isin(watched_movies)]
    seen_movies_genres = seen_movies_genres[['movieId', 'title']]
    
    # Reset Index
    seen_movies_genres = seen_movies_genres.reset_index(drop=True)
    
    
    # Get unseen movies
    unseen_movies = all_movies.difference(watched_movies)
    
    # Get genre vectors of unseen movies
    unseen_movies_genres = movies_genres_df[movies_genres_df['movieId'].isin(unseen_movies)]
    
    # Now let's get the movie matrix by excluding `movieId` and `title` columns:
    movie_matrix = unseen_movies_genres.iloc[:, 4:].values
    
    # user np.dot() to get the recommendation scores for each movie
    scores = np.dot(movie_matrix, user_vector)
    
    # Get unseen dataframe
    unseen_df = unseen_movies_genres[['movieId', 'title', 'img_url', 'url']]
    
    # load scores column to unseen dataframe
    unseen_df['score'] = pd.Series(scores)
    
    # Sort by score columns
    unseen_df = unseen_df.sort_values(by=['score'], ascending=False)
    # Reset index
    unseen_df = unseen_df.reset_index(drop=True)
    
    # Return dataframe
    return unseen_df[:10], seen_movies_genres

In [23]:
recommended_movies, seen_movies = generate_score(50)

In [24]:
recommended_movies

Unnamed: 0,movieId,title,img_url,url,score
0,74095,Wicked City (Yôjû toshi) (1987),https://www.themoviedb.org/t/p/w300_and_h450_b...,https://www.themoviedb.org/movie/21453.0,1518.5
1,4574,Blind Fury (1989),https://www.themoviedb.org/t/p/w300_and_h450_b...,https://www.themoviedb.org/movie/19124.0,1420.0
2,108689,"I, Frankenstein (2014)",https://www.themoviedb.org/t/p/w300_and_h450_b...,https://www.themoviedb.org/movie/100241.0,1378.5
3,6687,My Boss's Daughter (2003),https://www.themoviedb.org/t/p/w300_and_h450_b...,https://www.themoviedb.org/movie/2830.0,1372.0
4,4798,Indiscreet (1958),https://www.themoviedb.org/t/p/w300_and_h450_b...,https://www.themoviedb.org/movie/22874.0,1347.0
5,66915,Rock-A-Doodle (1991),https://www.themoviedb.org/t/p/w300_and_h450_b...,https://www.themoviedb.org/movie/20421.0,1285.5
6,1797,Everest (1998),https://www.themoviedb.org/t/p/w300_and_h450_b...,https://www.themoviedb.org/movie/21736.0,1260.0
7,50802,Because I Said So (2007),https://www.themoviedb.org/t/p/w300_and_h450_b...,https://www.themoviedb.org/movie/1257.0,1254.0
8,119964,A Merry Friggin' Christmas (2014),https://www.themoviedb.org/t/p/w300_and_h450_b...,https://www.themoviedb.org/movie/286532.0,1253.0
9,27604,Suicide Club (Jisatsu saakuru) (2001),https://www.themoviedb.org/t/p/w300_and_h450_b...,https://www.themoviedb.org/movie/12720.0,1252.0


In [25]:
seen_movies.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
2,111,Taxi Driver (1976)
3,165,Die Hard: With a Vengeance (1995)
4,296,Pulp Fiction (1994)


###
#### ***------------------------------------------------THE END!!!----------------------------------------------***
# Author

## [Emuejevoke Eshemitan](https://www.linkedin.com/in/emuejevoke-eshemitan/)
