In [1]:
import pandas as pd
import numpy as np
import ast
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
movies = pd.read_csv("Data/tmdb_5000_movies.csv")
credits = pd.read_csv("Data/tmdb_5000_credits.csv")

In [3]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


movies dataframe contain 20 columns:

1) budget : Money to make the movie<br>
2) genres : Type of movie ( Action, romantic, Scify etc)<br>
3) homepage : officaial webpage of the movie<br>
4) id : TMDB id of the movie<br>
5) keywords : Tags accosiated with the movies (action, alien, planest, food, dhost etc)<br>
6) original_language : language of the movie inwhich it is officially released<br>
7) original_title : Title of the movie (Original)<br>
8) overview : a short summary of the movie<br>
9) popularity : Popularity of the movie<br>
10) production_companies : production companies associated with the movie<br>
11) production_countries : countries in whcih the movie is shooted<br>
12) release_date : Release date<br>
13) revenue : money generated from the movie<br>
14) runtime : length of the movie in minutes<br>
15) spoken_languages : Languages spoken within the movie<br>
16) status : statsu of the movie (Post Production, Released, Rumored)<br>
17) tagline: Tag Line<br>
18) title : title of the movie<br>
19) vote_average : Rating<br>
20) vote_count : number of up votes<br>

In [4]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


credits dataframe has 4 columns:

1) movies_id : ID of the movie<br>
2) title : ttitle of the movie<br>
3) cast : Cast in the movie<br>
4) crew : Crew in the movie<br>

In [19]:
#Merging the two data frames on the basis of 'title'
df = pd.merge(movies,credits,on='title',how='left')
df.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [20]:
temp = df.copy()
temp.drop(['id'],axis=1,inplace=True)
df['id']

0        19995
1          285
2       206647
3        49026
4        49529
         ...  
4804      9367
4805     72766
4806    231617
4807    126186
4808     25975
Name: id, Length: 4809, dtype: int64

# Details what we are going to do....

There are basically three types of recommender systems:<br>
1) Contant based<br>
2) Collaborative Filltering<br>
3) Hybrid<br>

<h1>1. Contant Based Recommender System</h1>
<p>A Content-Based Recommender works by the data that we take from the user, either explicitly (rating) or implicitly (clicking on a link). By the data we create a <b>user profile</b>, which is then used to suggest to the user, as the user provides more input or take more actions on the recommendation, the engine becomes more accurate.</p>

### User Profile:
<p>In the User Profile, we create vectors that describe the user’s preference. In the creation of a user profile, we use the utility matrix which describes the relationship between user and item. With this information, the best estimate we can make regarding which item user likes, is some aggregation of the profiles of those items.</p>

### Item Profile:

<p>In Content-Based Recommender, we must build a profile for each item, which will represent the important characteristics of that item.
For example, if we make a movie as an item then its actors, director, release year and genre are the most significant features of the movie. We can also add its rating from the IMDB (Internet Movie Database) in the Item Profile.</p>

### Utility Matrix:

<p>Utility Matrix signifies the user’s preference with certain items. In the data gathered from the user, we have to find some relation between the items which are liked by the user and those which are disliked, for this purpose we use the utility matrix. In it we assign a particular value to each user-item pair, this value is known as the degree of preference. Then we draw a matrix of a user with the respective items to identify their preference relationship.</p>

## NOTE: We will build our recommender system using "Contant Based"

In [6]:
# we will drop unwanted columns which we donot need in our system.
df = df[['id','title', 'genres', 'keywords', 'overview', 'cast','crew']]
df.head(1)

Unnamed: 0,id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [7]:
#Check for null values
df.isnull().sum()

id          0
title       0
genres      0
keywords    0
overview    3
cast        0
crew        0
dtype: int64

In [8]:
# we can drop this 3 rows/movies bt i will replace null with empty string 
# Why i am doing this, i will explain it later in this notebook.
df.fillna("", inplace=True)

In [9]:
# Check For duplicate rows
df.duplicated().sum()

0


The next task is to simplify the genre column:<br>
Convert:<br>
[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]<br>
to<br>
["Action",'Fantasy',.....]<br>


In [10]:
def convert_json(k):
    genre=[]
    k=ast.literal_eval(k) # ast.literal_eval used to convert string-list to list type
    for dic in k:
        if 'name' in dic:
            genre.append(dic['name'])
    return genre

In [11]:
df['genres'] = df.genres.apply(convert_json)

In [12]:
df['keywords'] = df['keywords'].apply(convert_json)

In [13]:
def convert_json1(k):
    genre=[]
    k=ast.literal_eval(k) # ast.literal_eval used to convert string-list to list type
    count=0
    for dic in k:
        if count>=4:
            break
        if 'name' in dic:
            genre.append(dic['name'])
            count+=1
    return genre

In [14]:
df['cast'] = df['cast'].apply(convert_json1)

In [15]:
def fetch_director(k):
    genre=[]
    k=ast.literal_eval(k) # ast.literal_eval used to convert string-list to list type
    for dic in k:
        if dic['job']=='Director':
            genre.append(dic['name'])
            break
    return genre

In [16]:
df['crew'] = df.crew.apply(fetch_director)

In [17]:
# Also we are converting the overview column from string to list
df['overview'] = df.overview.apply(lambda x:x.split())

In [18]:
df.head(1)

Unnamed: 0,id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[In, the, 22nd, century,, a, paraplegic, Marin...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]


In [19]:
df['keywords'] = df['keywords'].apply(lambda x: [i.replace(" ",'') for i in x])
df['genres'] = df['genres'].apply(lambda x: [i.replace(" ",'') for i in x])
df['cast'] = df['cast'].apply(lambda x: [i.replace(" ",'') for i in x])
df['crew'] = df['crew'].apply(lambda x: [i.replace(" ",'') for i in x])

In [20]:
df['tags'] = df['keywords']+df['genres']+df['cast']+df['crew']+df['overview']

In [21]:
df.head(2)

Unnamed: 0,id,title,genres,keywords,overview,cast,crew,tags
0,19995,Avatar,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[In, the, 22nd, century,, a, paraplegic, Marin...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron],"[cultureclash, future, spacewar, spacecolony, ..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[Captain, Barbossa,, long, believed, to, be, d...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...",[GoreVerbinski],"[ocean, drugabuse, exoticisland, eastindiatrad..."


In [22]:
new_df = df[['id','title','tags']]
new_df

Unnamed: 0,id,title,tags
0,19995,Avatar,"[cultureclash, future, spacewar, spacecolony, ..."
1,285,Pirates of the Caribbean: At World's End,"[ocean, drugabuse, exoticisland, eastindiatrad..."
2,206647,Spectre,"[spy, basedonnovel, secretagent, sequel, mi6, ..."
3,49026,The Dark Knight Rises,"[dccomics, crimefighter, terrorist, secretiden..."
4,49529,John Carter,"[basedonnovel, mars, medallion, spacetravel, p..."
...,...,...,...
4804,9367,El Mariachi,"[unitedstates–mexicobarrier, legs, arms, paper..."
4805,72766,Newlyweds,"[Comedy, Romance, EdwardBurns, KerryBishé, Mar..."
4806,231617,"Signed, Sealed, Delivered","[date, loveatfirstsight, narration, investigat..."
4807,126186,Shanghai Calling,"[DanielHenney, ElizaCoupe, BillPaxton, AlanRuc..."


In [23]:
new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))


In [24]:
new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())


In [25]:
# Creating vectors from the tags
cv = CountVectorizer(max_features=5000, stop_words='english')

In [26]:
vectors = cv.fit_transform(new_df['tags']).toarray()

In [27]:
cv.get_feature_names()

['000',
 '007',
 '10',
 '100',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '18th',
 '19',
 '1930s',
 '1940s',
 '1950s',
 '1960s',
 '1970s',
 '1980',
 '1980s',
 '1985',
 '1990s',
 '19th',
 '19thcentury',
 '20',
 '200',
 '2009',
 '20th',
 '24',
 '25',
 '30',
 '300',
 '3d',
 '40',
 '50',
 '500',
 '60',
 '60s',
 '70',
 'aaron',
 'aaroneckhart',
 'aarontaylor',
 'abandoned',
 'abducted',
 'abigailbreslin',
 'abilities',
 'ability',
 'able',
 'aboard',
 'abuse',
 'abusive',
 'academy',
 'accept',
 'accepted',
 'accepts',
 'access',
 'accident',
 'accidental',
 'accidentally',
 'accompanied',
 'accomplish',
 'account',
 'accountant',
 'accused',
 'ace',
 'achieve',
 'act',
 'acting',
 'action',
 'actionhero',
 'actions',
 'activist',
 'activities',
 'activity',
 'actor',
 'actors',
 'actress',
 'acts',
 'actual',
 'actually',
 'adam',
 'adambrody',
 'adams',
 'adamsandler',
 'adamshankman',
 'adaptation',
 'adapted',
 'addict',
 'addicted',
 'addiction',
 'adolescence',
 'adolesc

In [28]:
def stem(t):
    y=[]
    
    for i in t.split():
        y.append(PorterStemmer().stem(i))
    return " ".join(y)

In [29]:
new_df['tags'] = new_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stem)


In [30]:
# Creating vectors from the tags
cv = CountVectorizer(max_features=5000, stop_words='english')
vectors = cv.fit_transform(new_df['tags']).toarray()
cv.get_feature_names()

['000',
 '007',
 '10',
 '100',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '17th',
 '18',
 '18th',
 '18thcenturi',
 '19',
 '1910',
 '1920',
 '1930',
 '1940',
 '1950',
 '1950s',
 '1960',
 '1960s',
 '1970',
 '1970s',
 '1971',
 '1974',
 '1976',
 '1980',
 '1985',
 '1990',
 '1999',
 '19th',
 '19thcenturi',
 '20',
 '200',
 '2003',
 '2009',
 '20th',
 '21st',
 '23',
 '24',
 '25',
 '30',
 '300',
 '3d',
 '40',
 '50',
 '500',
 '60',
 '70',
 '80',
 'aaron',
 'aaroneckhart',
 'aarontaylor',
 'abandon',
 'abduct',
 'abigailbreslin',
 'abil',
 'abl',
 'aboard',
 'abov',
 'abus',
 'academi',
 'accept',
 'access',
 'accid',
 'accident',
 'acclaim',
 'accompani',
 'accomplish',
 'account',
 'accus',
 'ace',
 'achiev',
 'acquaint',
 'act',
 'action',
 'actionhero',
 'activ',
 'activist',
 'activities',
 'actor',
 'actress',
 'actual',
 'ad',
 'adam',
 'adambrodi',
 'adamsandl',
 'adamshankman',
 'adapt',
 'add',
 'addict',
 'adjust',
 'admir',
 'admit',
 'adolesc',
 'adopt',
 'ador',
 'adrienbrodi'

In [31]:
similarity = cosine_similarity(vectors)

In [32]:
def movie_rec(df,movie,similarity):
    movie_index = df[df['title'] == movie].index[0]
    movie_list = sorted(list(enumerate(similarity[movie_index])),reverse=True,key= lambda x:x[1])[1:6]
    
    for i in movie_list:
        print(df['title'].iloc[i[0]])
    return

In [33]:
movie_rec(new_df,'Batman',similarity)

Batman
Batman & Robin
Batman Begins
Batman Returns
The R.M.


In [37]:
k = df.title.tolist()

In [44]:
k[4000:6000]

['The Dress',
 'A Guy Named Joe',
 'Blazing Saddles',
 'Friday the 13th: The Final Chapter',
 'Ida',
 'Maurice',
 'Beer League',
 'Riding Giants',
 'Timecrimes',
 'Silver Medalist',
 'Timber Falls',
 "Singin' in the Rain",
 'Fat, Sick & Nearly Dead',
 'A Haunted House',
 "2016: Obama's America",
 'That Thing You Do!',
 'Halloween III: Season of the Witch',
 'Escape from the Planet of the Apes',
 'Hud',
 'Kevin Hart: Let Me Explain',
 'My Own Private Idaho',
 'Garden State',
 'Before Sunrise',
 'Evil Words',
 "Jesus' Son",
 'Saving Face',
 'Brick Lane',
 'Robot & Frank',
 'My Life Without Me',
 'The Spectacular Now',
 'Religulous',
 'Fuel',
 "Valley of the Heart's Delight",
 'Eye of the Dolphin',
 '8: The Mormon Proposition',
 'The Other End of the Line',
 'Anatomy',
 'Sleep Dealer',
 'Super',
 'Christmas Mail',
 'Stung',
 'Antibirth',
 'Get on the Bus',
 'Thr3e',
 'Idiocracy',
 'The Rise of the Krays',
 'This Is England',
 'U.F.O.',
 'Bathing Beauty',
 'Go for It!',
 'Dancer, Texas Pop