# Movie Recommendation system
This is a group project of ISSS 610 Applied Machine Learning.  

Group Member: GENG Minghong, HUANG Lu, Manish Agarwal, TAO Xinru  

Project Timeline:

Time|Jobs
-----|-----
2020-03-21|Data Cleaning
2020-03-23|Exploring
**2020-03-30**|**Presentation**

# 0. Introduction 
The data of this project can be accessed here:  
https://smu.sharepoint.com/teams/ISSS610AppliedMachineLearning/Shared%20Documents/Forms/AllItems.aspx?viewid=1b247e4e%2D6a34%2D4c70%2D8abc%2D06cee2cccbea  

In [1]:
# Ignore the warnings
import warnings
warnings.filterwarnings('ignore')

# 1. Processing Data
## 1.1 Load Packages and Import the Data
The packages we used in this packages are:
- Pandas
- NumPy
- ...
- (To be fill)

In [2]:
import pandas as pd
import numpy as np

links= pd.read_csv('data/themoviesdataset/links.csv')


## 1.2 Data Cleaning 

In [3]:
from sklearn.metrics.pairwise import (linear_kernel, 
                                      cosine_similarity)
from ast import literal_eval
import datetime


### 1.2.1 keywords.csv

In [4]:
keywords= pd.read_csv(r'data/themoviesdataset/keywords.csv')
keywords['keywordsTr'] = keywords['keywords'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
keywordsTr=keywords.drop(['keywords'],axis=1)

In [5]:
keywordsTr.rename(columns = {'id':'tmdb_id'},inplace=True)
keywordsTr['tmdb_id'] = keywordsTr['tmdb_id'].apply(str)

keywordsTr.head()

Unnamed: 0,tmdb_id,keywordsTr
0,862,"[jealousy, toy, boy, friendship, friends, riva..."
1,8844,"[board game, disappearance, based on children'..."
2,15602,"[fishing, best friend, duringcreditsstinger, o..."
3,31357,"[based on novel, interracial relationship, sin..."
4,11862,"[baby, midlife crisis, confidence, aging, daug..."


In [6]:
keywordsTr.to_csv('keywordsTr.csv')

### 1.2.2  rating.csv
We want to transform the time stamp into readable time format.   Reference:https://blog.csdn.net/weixin_43790560/article/details/88412005

In [7]:
rating= pd.read_csv(r'data/themoviesdataset/ratings.csv')
#rating['timestampTr']=pd.to_datetime(rating['timestamp'],unit='s')
ratingTr=rating[['userId','movieId','rating','timestamp']]

In [8]:
ratingTr.to_csv('ratingTr.csv')

In [9]:
ratingTr.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556


**Then we need to decide what to keep in `timestampTr`. Day? Week? Month?**

### 1.2.3 credits.csv

In [10]:
import pandas as pd
import numpy as np
from ast import literal_eval

In [11]:
credits = pd.read_csv(r'data/themoviesdataset/credits.csv')

In [12]:
#Rename the id column and change the data type
credits.rename(columns = {'id':'tmdb_id'},inplace=True)
credits['tmdb_id'] = credits['tmdb_id'].apply(str)

In [13]:
# Parse the stringified features into their corresponding python objects
credits['cast'] = credits['cast'].apply(literal_eval)
credits['crew'] = credits['crew'].apply(literal_eval)

In [14]:
# Define a function to get the director name
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

# Apply the function to get director name
credits['director'] = credits['crew'].apply(get_director)

In [15]:
# Returns the list top 3 elements or entire list; whichever is more.
credits['cast'] = credits['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
credits['cast'] = credits['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

# Drop the crew column
credits = credits.drop(['crew'], axis=1)

In [16]:
creditsTr = credits
creditsTr.head(2)

Unnamed: 0,cast,tmdb_id,director
0,"[Tom Hanks, Tim Allen, Don Rickles]",862,John Lasseter
1,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",8844,Joe Johnston


## 1.2.4 movies_metadata.csv

In [17]:
movies_metadata= pd.read_csv(r'data/themoviesdataset/movies_metadata.csv',low_memory=False)

In the origin dataset, 24 columns are provided. We only need 12 of them.

In [18]:
movies_metadataTr = movies_metadata[['genres','id','imdb_id','overview','popularity','release_date',
                                     'revenue','runtime','status', 'title','vote_average','vote_count']]
print(movies_metadataTr.shape)
movies_metadataTr.head(2)

(45466, 12)


Unnamed: 0,genres,id,imdb_id,overview,popularity,release_date,revenue,runtime,status,title,vote_average,vote_count
0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,tt0114709,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81.0,Released,Toy Story,7.7,5415.0
1,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,tt0113497,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104.0,Released,Jumanji,6.9,2413.0


In [19]:
movies_metadataTr['genres'] = movies_metadataTr['genres'].apply(literal_eval)
# Get the genres
movies_metadataTr['genres'] = movies_metadataTr['genres'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [20]:
# Rename the column id to tmdb_id
movies_metadataTr.rename(columns = {'id':'tmdb_id'},inplace=True)

In [21]:
movies_metadataTr.head(2)

Unnamed: 0,genres,tmdb_id,imdb_id,overview,popularity,release_date,revenue,runtime,status,title,vote_average,vote_count
0,"[Animation, Comedy, Family]",862,tt0114709,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81.0,Released,Toy Story,7.7,5415.0
1,"[Adventure, Fantasy, Family]",8844,tt0113497,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104.0,Released,Jumanji,6.9,2413.0


## 1.3 Merge the data

### 1.3.1 Movie

Merge the `movies_metadataTr`, `creditsTr`, and `keywordsTr.csv`.

In [22]:
# Merge the movies_metadata.csv, credits.csv, and keywords.csv
metadataTr_creditsTr = pd.merge(movies_metadataTr, 
                   creditsTr, 
                   how = 'left', 
                   on = ['tmdb_id'])

In [23]:
metadataTr_creditsTr_keywordsTr = pd.merge(metadataTr_creditsTr, 
                                           keywordsTr, 
                                           how = 'left', 
                                           on = ['tmdb_id'])

In [24]:
metadataTr_creditsTr_keywordsTr.head(2)

Unnamed: 0,genres,tmdb_id,imdb_id,overview,popularity,release_date,revenue,runtime,status,title,vote_average,vote_count,cast,director,keywordsTr
0,"[Animation, Comedy, Family]",862,tt0114709,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81.0,Released,Toy Story,7.7,5415.0,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy, friendship, friends, riva..."
1,"[Adventure, Fantasy, Family]",8844,tt0113497,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104.0,Released,Jumanji,6.9,2413.0,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'..."


In [25]:
# Select rows where the column status == 'Released'
data_movies = metadataTr_creditsTr_keywordsTr[metadataTr_creditsTr_keywordsTr['status'] == 'Released']

In [26]:
# Fill NA by the space
data_movies = data_movies.fillna(' ')

In [27]:
data_movies.head(2)

Unnamed: 0,genres,tmdb_id,imdb_id,overview,popularity,release_date,revenue,runtime,status,title,vote_average,vote_count,cast,director,keywordsTr
0,"[Animation, Comedy, Family]",862,tt0114709,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81,Released,Toy Story,7.7,5415.0,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy, friendship, friends, riva..."
1,"[Adventure, Fantasy, Family]",8844,tt0113497,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104,Released,Jumanji,6.9,2413.0,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'..."


Genre, overview, Keywords

### 1.3.2 User Rating Data

In [66]:
tags = pd.read_csv('data/ml-25m/tags.csv')

tags.shape

In [74]:
data_user = pd.merge(ratingTr, tags, how = 'left', on = ['userId','movieId','timestamp'])

In [86]:
data_user['tag'].value_counts()

Series([], Name: tag, dtype: int64)

## 2. Content-based Recommender

### Part II: Credits, Genres and Keywords Based Recommender

In [28]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [29]:
# Apply clean_data function to your features.
features = ['cast', 'keywordsTr', 'director', 'genres']

for feature in features:
    data_movies[feature] = data_movies[feature].apply(clean_data)

In [30]:
# Create a string that contains all the metadata which feeds to vectorizer
def create_pool(x):
    return ' '.join(x['keywordsTr']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
data_movies['pool'] = data_movies.apply(create_pool, axis=1)

In [31]:
data_movies['pool'].head(5)

0    jealousy toy boy friendship friends rivalry bo...
1    boardgame disappearance basedonchildren'sbook ...
2    fishing bestfriend duringcreditsstinger oldmen...
3    basedonnovel interracialrelationship singlemot...
4    baby midlifecrisis confidence aging daughter m...
Name: pool, dtype: object

In [32]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(data_movies['pool'])

In [33]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [34]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(data_movies.index, index=data_movies['title']).drop_duplicates()

In [35]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return data_movies['title'].iloc[movie_indices]

In [36]:
# Reset index of our main DataFrame and construct reverse mapping as before
data_movies = data_movies.reset_index()
indices = pd.Series(data_movies.index, index = data_movies['title'])

In [37]:
get_recommendations('The Dark Knight Rises', cosine_sim2)

12485      The Dark Knight
10122        Batman Begins
18757            Last Exit
34178                 Rege
9226                Shiner
39869    Bullet to Beijing
8950        State of Grace
11417          Harsh Times
14934          Harry Brown
15034             Defendor
Name: title, dtype: object

In [38]:
get_recommendations('Toy Story', cosine_sim2)

2998                                           Toy Story 2
15372                                          Toy Story 3
25779                           Toy Story That Time Forgot
21923                                 Toy Story of Terror!
3310                                     Creature Comforts
27376                                                Anina
29355                                          Cheburashka
40528                   VeggieTales: Josh and the Big Wall
40537    VeggieTales: Minnesota Cuke and the Search for...
40995                                              Uncle P
Name: title, dtype: object