The task would be to recommend movies to the user based on him/her given movies.

Movies will be given by title.

In [1]:
%matplotlib inline
import gc
import difflib

import numpy as np
import pandas as pd

Start by inspecting our dataset

In [2]:
links_df = pd.read_csv('data/links.csv')
links_df.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [3]:
movies_df = pd.read_csv('data/movies.csv')
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings_df = pd.read_csv('data/ratings.csv')
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
tags_df = pd.read_csv('data/tags.csv')
tags_df.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


Let's go on the assumption that if person $A$ likes movies $M_0, M_1, ..., M_i$.

Then there goes person $B$ who likes one or more movies from $M$ let's call them $M_j$.

This would mean that $A$ and $B$ has a movie that they both liked, therefore other movies from both $M_j$ and $M_i$ can be liked by both $A$ and $B$ with high probability.

------------------

In [6]:
input_movies = ['Toy Story (1995)']

Data about a movie

In [7]:
movie = movies_df[movies_df['title'] == input_movies[0]]
movie

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


As we know the movie ID we can check who rated and  liked these movies

In [8]:
movie_rating = ratings_df[ratings_df['movieId'] == movie['movieId'].values[0]]
movie_rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
516,5,1,4.0,847434962
874,7,1,4.5,1106635946
1434,15,1,2.5,1510577970
1667,17,1,4.5,1305696483


Now as we also know the user ID we can check this user's taste

In [9]:
ratings_of_found_user = ratings_df[ratings_df['userId'] == movie_rating['userId'].values[0]]
ratings_of_found_user.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [10]:
movies_possibly_liked = None

if movie_rating['rating'].values[0] > 2.5:
    movies_possibly_liked = ratings_of_found_user.sort_values(by='rating', ascending=False)
else:
    movies_possibly_liked = ratings_of_found_user.sort_values(by='rating', ascending=True)
    
movies_possibly_liked.head()

Unnamed: 0,userId,movieId,rating,timestamp
231,1,5060,5.0,964984002
185,1,2872,5.0,964981680
89,1,1291,5.0,964981909
90,1,1298,5.0,964984086
190,1,2948,5.0,964982191


Now check if we were correct

In [11]:
movies_df[movies_df['movieId'] == int(movies_possibly_liked.iloc[0]['movieId'])]

Unnamed: 0,movieId,title,genres
3673,5060,M*A*S*H (a.k.a. MASH) (1970),Comedy|Drama|War


This clearly doesn't work.

-----------------------

In [12]:
df = tags_df.loc[:, ['tag', 'movieId']].merge(ratings_df.merge(movies_df.merge(links_df, on='movieId'), on='movieId'), on='movieId')
del tags_df, ratings_df, movies_df, links_df

In [13]:
df.head()

Unnamed: 0,tag,movieId,userId,rating,timestamp,title,genres,imdbId,tmdbId
0,funny,60756,2,5.0,1445714980,Step Brothers (2008),Comedy,838283,12133.0
1,funny,60756,18,3.0,1455749449,Step Brothers (2008),Comedy,838283,12133.0
2,funny,60756,62,3.5,1528934376,Step Brothers (2008),Comedy,838283,12133.0
3,funny,60756,68,2.5,1269123243,Step Brothers (2008),Comedy,838283,12133.0
4,funny,60756,73,4.5,1464196221,Step Brothers (2008),Comedy,838283,12133.0


In [14]:
print(len(np.unique(df['movieId'].values)), 'movie')
print(len(np.unique(df['genres'].values)), 'genre')
print(len(np.unique(df['tag'].values)), 'tag')
print(len(np.unique(df['userId'].values)), 'user')

1554 movie
370 genre
1584 tag
610 user


Preprocess text values

In [15]:
from sklearn.preprocessing import LabelEncoder

title_encoder = LabelEncoder()
genre_encoder = LabelEncoder()

Choose train and test set

In [16]:
ratings = df['rating']
del df['rating'] 
y = ratings.values
X = df.values

In [17]:
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=.2)