This dataset describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files 'links.csv', 'movies.csv', 'ratings.csv' and 'tags.csv'. More details about the contents and use of all these files follows:

- User Ids: Unique and anonymized.
- Movies Ids: Only movies with at least one rating. 

In [342]:
# Let's include general porpuses libraries 

import pandas as pd
import numpy as np                     # For mathematical calculations
import seaborn as sns                  # For data visualization
import matplotlib.pyplot as plt        # For plotting graphs
%matplotlib inline

Let's check one by one all the data we have 

In [343]:
dfratings = pd.read_csv('ratings.csv')
dfratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Ratings are made in a 5-star scale, with half-star increments
Timestamp data could be dropped. It is the moment when the rating was made.

In [344]:
dfratings = dfratings.drop('timestamp', axis=1)

In [345]:
dfratings.isnull().sum()

userId     0
movieId    0
rating     0
dtype: int64

In [346]:
dfratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   userId   100836 non-null  int64  
 1   movieId  100836 non-null  int64  
 2   rating   100836 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB


In [347]:
dftags = pd.read_csv('tags.csv')
dftags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


This table has free-text taggind done by users. Also claissified by userId and movieId. The meaning, value, and purpose of a particular tag is determined by each user.

In [348]:
dftags = dftags.drop('timestamp', axis=1)

In [349]:
dftags.isnull().sum()

userId     0
movieId    0
tag        0
dtype: int64

In [350]:
dftags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   userId   3683 non-null   int64 
 1   movieId  3683 non-null   int64 
 2   tag      3683 non-null   object
dtypes: int64(2), object(1)
memory usage: 86.4+ KB


In [351]:
dfmovies = pd.read_csv('movies.csv')
dfmovies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Title includes the title of the movie and the year of release. They are also classified in different Genres:

In [352]:
dfmovies['genres'].value_counts()

Drama                                    1053
Comedy                                    946
Comedy|Drama                              435
Comedy|Romance                            363
Drama|Romance                             349
                                         ... 
Animation|Children|Comedy|Horror            1
Drama|Mystery|Romance|Sci-Fi|Thriller       1
Animation|Comedy|Fantasy|Sci-Fi             1
Comedy|Crime|Drama|Western                  1
Action|Romance|Western                      1
Name: genres, Length: 951, dtype: int64

They belong to any of the following categories or all the possible combination between them
* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western

In [353]:
dfmovies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [354]:
dfmovies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


let's separate title and year of release in two different columns:

In [355]:
#spliter la colonne "Title' en deux colonnes 'title' et 'year'
df_temp = pd.DataFrame(dfmovies.title.str.split('\(',1).tolist(),
columns = ['title','year'])
dfmovies['title'] = df_temp['title']
dfmovies['year'] = df_temp['year']

In [356]:
#supprimer le parenthese ')' de la colonne 'year'
dfmovies['year'] = dfmovies['year'].str.replace(r'\)', '')
dfmovies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


We separate the genres in individual columns, if a movie belongs to a speficic genre, the column is encoded with 1, otherwise 0

In [357]:
dfmovies['Action']=dfmovies['genres'].apply(lambda x: 1 if "Action" in x else 0)
dfmovies['Adventure']=dfmovies['genres'].apply(lambda x: 1 if "Adventure" in x else 0)
dfmovies['Animation']=dfmovies['genres'].apply(lambda x: 1 if "Animation" in x else 0)
dfmovies['Children']=dfmovies['genres'].apply(lambda x: 1 if "Children" in x else 0)
dfmovies['Comedy']=dfmovies['genres'].apply(lambda x: 1 if "Comedy" in x else 0)
dfmovies['Crime']=dfmovies['genres'].apply(lambda x: 1 if "Crime" in x else 0)
dfmovies['Documentary']=dfmovies['genres'].apply(lambda x: 1 if "Documentary" in x else 0)
dfmovies['Drama']=dfmovies['genres'].apply(lambda x: 1 if "Drama" in x else 0)
dfmovies['Fantasy']=dfmovies['genres'].apply(lambda x: 1 if "Fantasy" in x else 0)
dfmovies['Film-Noir']=dfmovies['genres'].apply(lambda x: 1 if "Film-Noir" in x else 0)
dfmovies['Horror']=dfmovies['genres'].apply(lambda x: 1 if "Horror" in x else 0)
dfmovies['Musical']=dfmovies['genres'].apply(lambda x: 1 if "Musical" in x else 0)
dfmovies['Mystery']=dfmovies['genres'].apply(lambda x: 1 if "Mystery" in x else 0)
dfmovies['Romance']=dfmovies['genres'].apply(lambda x: 1 if "Romance" in x else 0)
dfmovies['Sci-Fi']=dfmovies['genres'].apply(lambda x: 1 if "Sci-Fi" in x else 0)
dfmovies['Thriller']=dfmovies['genres'].apply(lambda x: 1 if "Thriller" in x else 0)
dfmovies['War']=dfmovies['genres'].apply(lambda x: 1 if "War" in x else 0)
dfmovies['Western']=dfmovies['genres'].apply(lambda x: 1 if "Western" in x else 0)

In [358]:
dfmovies=dfmovies.drop('genres', axis=1)
dfmovies.head()

Unnamed: 0,movieId,title,year,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story,1995,0,1,1,1,1,0,0,...,1,0,0,0,0,0,0,0,0,0
1,2,Jumanji,1995,0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men,1995,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale,1995,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II,1995,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [359]:
dflinks = pd.read_csv('links.csv')
dflinks.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


imdbId is the identifier of the movie in imdbId. For possible connection with http://www.imdb.com, https://www.imdb.com/title/tt0 + imdbId

https://www.themoviedb.org/movie/862-toy-story for tmdbId

In [360]:
dflinks.isnull().sum()

movieId    0
imdbId     0
tmdbId     8
dtype: int64

In [361]:
dflinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


Let's get a dictionary with the most commom tags in each movie

In [362]:
dftags['tag']=dftags['tag'].str.lower() ## let's pull all of them in lowercase letters

In [363]:
pd.DataFrame(dftags['tag'].value_counts())

Unnamed: 0,tag
in netflix queue,131
atmospheric,41
thought-provoking,24
superhero,24
funny,24
...,...
hot actress,1
scifi masterpiece,1
casey affleck,1
indonesia,1


Let's just keep tha values with a frequency bigger than 10

In [364]:
mask = pd.DataFrame((dftags['tag'].value_counts())> 10)
to_keep=mask.index[mask['tag'] == True].tolist() ## This are the tags to keep for our model
to_keep

['in netflix queue',
 'atmospheric',
 'thought-provoking',
 'superhero',
 'funny',
 'surreal',
 'sci-fi',
 'disney',
 'religion',
 'quirky',
 'suspense',
 'dark comedy',
 'psychology',
 'twist ending',
 'visually appealing',
 'crime',
 'politics',
 'comedy',
 'music',
 'mental illness',
 'action',
 'dark',
 'time travel',
 'high school',
 'mindfuck',
 'aliens',
 'dreamlike',
 'space',
 'black comedy',
 'shakespeare',
 'journalism',
 'heist',
 'disturbing',
 'stephen king',
 'holocaust',
 'mafia',
 'emotional',
 'court',
 'anime',
 'christmas',
 'satire',
 'classic',
 'adultery',
 'adolescence',
 'ghosts',
 'animation',
 'comic book',
 'psychological',
 'boxing',
 'imdb top 250',
 'bittersweet']

In [365]:
# Lets use a get dummies to obtain a Onehotencoder result, then apply to_keep for only keeping the columns with a
# frequency bigger than 10 
dict_tags=pd.get_dummies(dftags['tag'])
dict_tags=dict_tags[to_keep]

In [366]:
# Lets add this to the original tags dataframe
dftags=dftags.join(dict_tags)

In [367]:
dftags.head()

Unnamed: 0,userId,movieId,tag,in netflix queue,atmospheric,thought-provoking,superhero,funny,surreal,sci-fi,...,classic,adultery,adolescence,ghosts,animation,comic book,psychological,boxing,imdb top 250,bittersweet
0,2,60756,funny,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,60756,highly quotable,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,60756,will ferrell,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2,89774,boxing story,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,89774,mma,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [368]:
## Then we could also do a group by movieId, so we have all the tags for a specific movie
tagsbymovie = dftags.drop('userId', axis=1).groupby('movieId').agg(lambda x: sum(x)).reset_index()
tagsbymovie.head()

Unnamed: 0,movieId,in netflix queue,atmospheric,thought-provoking,superhero,funny,surreal,sci-fi,disney,religion,...,classic,adultery,adolescence,ghosts,animation,comic book,psychological,boxing,imdb top 250,bittersweet
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [369]:
dfmovies = dfmovies.merge(tagsbymovie, how='left').fillna(0.0)
dfmovies.head()

Unnamed: 0,movieId,title,year,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,classic,adultery,adolescence,ghosts,animation,comic book,psychological,boxing,imdb top 250,bittersweet
0,1,Toy Story,1995,0,1,1,1,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,1995,0,1,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,1995,0,0,0,0,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,1995,0,0,0,0,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,1995,0,0,0,0,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Lets merge the ranking dt with our final movies/tags df:

In [372]:
df = dfratings.set_index('movieId').join(dfmovies.set_index('movieId')).reset_index()

In [373]:
df.head()

Unnamed: 0,movieId,userId,rating,title,year,Action,Adventure,Animation,Children,Comedy,...,classic,adultery,adolescence,ghosts,animation,comic book,psychological,boxing,imdb top 250,bittersweet
0,1,1,4.0,Toy Story,1995,0,1,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,5,4.0,Toy Story,1995,0,1,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,7,4.5,Toy Story,1995,0,1,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,15,2.5,Toy Story,1995,0,1,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,17,4.5,Toy Story,1995,0,1,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Non personalized recommendations 
- Optionel (Movies seen together) 
- Top 10 Movies (Semaine, mois, )
- Top 10 Movies per category (Semaine, mois, )
- Top 10 most popular (Semaine, mois, )
- Top 10 categories (Semaine, mois, )
- Top 10 tags (Semaine, mois, )