#      Movie recommendation System (tmdb_5000 dataset)

## Overview of Recommendation system

#### What is Recommendation System?
A recommendation system is an artificial intelligence algorithm, usually associated with machine learning, that uses data(Big Data) to suggest or recommend additional products/items to consumers. These can be based on various criteria, including past purchases, search history, demographic information, and other factors as well.

#### Why is Recommendation System needed?
Recommender systems are information filtering systems that deal with the problem of information overload by filtering vital information fragment out of large amount of dynamically generated information according to user's preferences, interest, or observed behavior about item.<br>
Some of the most popular examples of recommender systems include the ones used by Amazon, Netflix, and Spotify. Amazon's recommender system is based on a combination of collaborative filtering and content-based algorithms. It uses past customer behavior to make recommendations for new products.

#### What are the types of Recommendation System?
##### There are three main types of recommendation systems –
##### 1. Collaborative Filtering
The collaborative filtering method is based on gathering and analyzing data on user’s behavior. This includes the user’s online activities and predicting what they will like based on the similarity with other users.

##### 2. Content-Based Filtering
Content-based filtering methods are based on the description of a product and a profile of the user’s preferred choices. In this recommendation system, products are described using keywords, and a user profile is built to express the kind of item this user likes.

##### 3. Hybrid Recommendation Systems
In hybrid recommendation systems, products are recommended using both content-based and collaborative filtering simultaneously to suggest a broader range of products to customers. This recommendation system is up-and-coming and is said to provide more accurate recommendations than other recommender systems.

## Project overview

1. In this project we will be using the tmdb_5000 dataset as this ammount of data can be decent enough for the wroking.<br>
2. We will be using the content based recommendation system.<br>
3. We will be using term-frequency and inverse-document-frequency vectorization for generating the text in vectors.<br>
4. We will be using the cosine function to find nearest vectors.

## Importing the required libraries

In [1]:
import numpy as np # for the calculations and transformation to array
import pandas as pd # for working with dataframes
import ast # for converting string into iterables
from nltk.stem.porter import PorterStemmer as ps # for extracting out the root word
from sklearn.metrics.pairwise import cosine_similarity as cs # for calculating the maximum similarites
from sklearn.feature_extraction.text import TfidfVectorizer # for converting text data (tags) into vectors
import pickle # for making the new data accessible

## Loading the datasets

In [2]:
#loading the datasets
movies = pd.read_csv("tmdb_5000_movies.csv")
credit = pd.read_csv("tmdb_5000_credits.csv")

In [3]:
#getting an overview od the movies data
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [4]:
#getting an overview od the movies data
movies.shape

(4803, 20)

In [5]:
#getting an overview od the credit data
credit.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [6]:
#getting an overview od the credit data
credit.shape

(4803, 4)

## Cleaning and Preprocessing the data

#### Dropping the not required attributes

We will be dropping the bellow listed attributes as they may not be useful for our recomendation system:-
1. budget :- the budget of movie does not help in recommendation of movies.
2. homepage :- wherthere the movie has an official homepage or not it does not help in recommendation of movies.
4. orignal_title :- will be using the title (english) for better working.
3. popularity :- numeric field.
4. production_companies :- production companies are not a good attribute for recommendation of movies.
5. production_countries :- production countries are not a good attribute for recommendation of movies.
6. release_date :- wherthere the movie has declared a release date or not it is not a good attribute of recommendation of movies.
7. revenue :- numeric field as well as a top revenue making movie may not be helpful in recommendation system.
8. runtime :- numeric field as well as a runtime of movie may not be helpful in recommendation system.
9. status :- wherthere the movie has released or not it does not help in recommendation of movies.
10. tagline :- may be vague or misleading.
11. vote_average :- numeric field.
12. vote_count :- numeric field.

In [7]:
#dropping the above listed columns
movies.drop(["budget","homepage","original_title","popularity","production_companies","production_countries","release_date","revenue","runtime","status","tagline","vote_average","vote_count"],axis=1,inplace=True)

In [8]:
# data after dropping the columns
movies.head()

Unnamed: 0,genres,id,keywords,original_language,overview,spoken_languages,title
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,"In the 22nd century, a paraplegic Marine is di...","[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Avatar
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,"Captain Barbossa, long believed to be dead, ha...","[{""iso_639_1"": ""en"", ""name"": ""English""}]",Pirates of the Caribbean: At World's End
2,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,A cryptic message from Bond’s past sends him o...,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Spectre
3,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,Following the death of District Attorney Harve...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",The Dark Knight Rises
4,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,"John Carter is a war-weary, former military ca...","[{""iso_639_1"": ""en"", ""name"": ""English""}]",John Carter


In [9]:
#observing the language column
movies["original_language"].value_counts()

en    4505
fr      70
es      32
zh      27
de      27
hi      19
ja      16
it      14
cn      12
ru      11
ko      11
pt       9
da       7
sv       5
nl       4
fa       4
th       3
he       3
ta       2
cs       2
ro       2
id       2
ar       2
vi       1
sl       1
ps       1
no       1
ky       1
hu       1
pl       1
af       1
nb       1
tr       1
is       1
xx       1
te       1
el       1
Name: original_language, dtype: int64

In [10]:
#observing the language column
movies["original_language"].count()

4803

#### Conclusion:-
Since the values in "original_language" is more than 95% of english we will be discarding the "original_language" as well as "spoken_languages" as the most of them are in English.

In [11]:
#dropping the original_language",spoken_languages columns
movies.drop(["original_language","spoken_languages"],axis=1,inplace=True)

In [12]:
#data after the removing the language columns
movies.head()

Unnamed: 0,genres,id,keywords,overview,title
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...",Avatar
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...",Pirates of the Caribbean: At World's End
2,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",A cryptic message from Bond’s past sends him o...,Spectre
3,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",Following the death of District Attorney Harve...,The Dark Knight Rises
4,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","John Carter is a war-weary, former military ca...",John Carter


<em>Note:- </em>We will be dropping the title from the credits dataset as it is already present in movies dataset and merge them both.

In [13]:
#merging the required data
credit.drop(["title"],axis=1,inplace=True)
credit.rename(columns = {'movie_id':'id'}, inplace = True)
movies =movies.merge(credit,on="id")

In [14]:
#checking for nulls
movies.isnull().sum()

genres      0
id          0
keywords    0
overview    3
title       0
cast        0
crew        0
dtype: int64

<em>Conclusion:- </em>As the overview is necessary for building tags so we will be dropping the 3 rows having null overview.

In [15]:
# dropping the null rows
movies.dropna(inplace=True)

In [16]:
#data after dropping the null rows
movies.isnull().sum()

genres      0
id          0
keywords    0
overview    0
title       0
cast        0
crew        0
dtype: int64

In [17]:
#checking for duplicacy
movies.duplicated().sum()

0

In [18]:
#observing the datatype of the data
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

<em>Conclusion:- </em>As we noticed the data in our various features are in string format and is containing a lot of redundant and be misleading data. So, We will be extracting the required data.

In [19]:
#function to get genres and keywords only
def get_name(obj):
    li=[]
    for i in ast.literal_eval(obj):
        li.append(i["name"])
    return li

In [20]:
#applying the function on the data
movies["genres"] = movies["genres"].apply(get_name)
movies["keywords"] = movies["keywords"].apply(get_name)

In [21]:
#data after operating the functions on it
movies.head()

Unnamed: 0,genres,id,keywords,overview,title,cast,crew
0,"[Action, Adventure, Fantasy, Science Fiction]",19995,"[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...",Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,"[Adventure, Fantasy, Action]",285,"[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...",Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,"[Action, Adventure, Crime]",206647,"[spy, based on novel, secret agent, sequel, mi...",A cryptic message from Bond’s past sends him o...,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,"[Action, Crime, Drama, Thriller]",49026,"[dc comics, crime fighter, terrorist, secret i...",Following the death of District Attorney Harve...,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,"[Action, Adventure, Science Fiction]",49529,"[based on novel, mars, medallion, space travel...","John Carter is a war-weary, former military ca...",John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [22]:
#observing the cast column
movies.iloc[0].cast

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

<em>Conclusion:- </em>As we noticed the data in the cast feature has a number of actors in a single film so we will be taking out top 5 actors from the data.

In [23]:
#function to get the required 5 top actors
def get_actor_name(obj):
    li=[]
    count=0
    for i in ast.literal_eval(obj):
        li.append(i["name"])
        count+=1
        if count>5:
            break
    return li

In [24]:
#applying the function on the data
movies["cast"] = movies["cast"].apply(get_actor_name)

In [25]:
#data after applying the function on it
movies.head()

Unnamed: 0,genres,id,keywords,overview,title,cast,crew
0,"[Action, Adventure, Fantasy, Science Fiction]",19995,"[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...",Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weave...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,"[Adventure, Fantasy, Action]",285,"[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...",Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,"[Action, Adventure, Crime]",206647,"[spy, based on novel, secret agent, sequel, mi...",A cryptic message from Bond’s past sends him o...,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,"[Action, Crime, Drama, Thriller]",49026,"[dc comics, crime fighter, terrorist, secret i...",Following the death of District Attorney Harve...,The Dark Knight Rises,"[Christian Bale, Michael Caine, Gary Oldman, A...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,"[Action, Adventure, Science Fiction]",49529,"[based on novel, mars, medallion, space travel...","John Carter is a war-weary, former military ca...",John Carter,"[Taylor Kitsch, Lynn Collins, Samantha Morton,...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [26]:
#observing the crew feature
movies.iloc[0].crew

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

<em>Conclusion:- </em>As the crew has a lot of members this can be a downgrade if we add all of them in the system. So, We will only be taking out the director name from that.

In [27]:
#function to get the director name only
def get_director(obj):
    li=[]
    for i in ast.literal_eval(obj):
        if i["job"]=="Director":
            li.append(i["name"])
            break
    return li

In [28]:
#applying the function on the data
movies["crew"] = movies["crew"].apply(get_director)

In [29]:
#data after applying the function on it
movies.head()

Unnamed: 0,genres,id,keywords,overview,title,cast,crew
0,"[Action, Adventure, Fantasy, Science Fiction]",19995,"[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...",Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]
1,"[Adventure, Fantasy, Action]",285,"[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...",Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski]
2,"[Action, Adventure, Crime]",206647,"[spy, based on novel, secret agent, sequel, mi...",A cryptic message from Bond’s past sends him o...,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",[Sam Mendes]
3,"[Action, Crime, Drama, Thriller]",49026,"[dc comics, crime fighter, terrorist, secret i...",Following the death of District Attorney Harve...,The Dark Knight Rises,"[Christian Bale, Michael Caine, Gary Oldman, A...",[Christopher Nolan]
4,"[Action, Adventure, Science Fiction]",49529,"[based on novel, mars, medallion, space travel...","John Carter is a war-weary, former military ca...",John Carter,"[Taylor Kitsch, Lynn Collins, Samantha Morton,...",[Andrew Stanton]


In [30]:
#tokenizing the overview feature for further processing
movies["overview"] = movies["overview"].apply(lambda x: x.split(" "))

In [31]:
# data after tokenizing
movies.head()

Unnamed: 0,genres,id,keywords,overview,title,cast,crew
0,"[Action, Adventure, Fantasy, Science Fiction]",19995,"[culture clash, future, space war, space colon...","[In, the, 22nd, century,, a, paraplegic, Marin...",Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]
1,"[Adventure, Fantasy, Action]",285,"[ocean, drug abuse, exotic island, east india ...","[Captain, Barbossa,, long, believed, to, be, d...",Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski]
2,"[Action, Adventure, Crime]",206647,"[spy, based on novel, secret agent, sequel, mi...","[A, cryptic, message, from, Bond’s, past, send...",Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",[Sam Mendes]
3,"[Action, Crime, Drama, Thriller]",49026,"[dc comics, crime fighter, terrorist, secret i...","[Following, the, death, of, District, Attorney...",The Dark Knight Rises,"[Christian Bale, Michael Caine, Gary Oldman, A...",[Christopher Nolan]
4,"[Action, Adventure, Science Fiction]",49529,"[based on novel, mars, medallion, space travel...","[John, Carter, is, a, war-weary,, former, mili...",John Carter,"[Taylor Kitsch, Lynn Collins, Samantha Morton,...",[Andrew Stanton]


In [32]:
#removing " " whitespaces to prevent from misleading tag formation
#like:- "Sam Alex" to "samalex" denoting that sam alex is one tag and not different or seperate 
li = ["genres","keywords","overview","cast","crew"]
for i in li:
    movies[i]=movies[i].apply(lambda x: [j.replace(" ","").lower() for j in x])

In [33]:
#data after lowering and removing white spaces
movies.head()

Unnamed: 0,genres,id,keywords,overview,title,cast,crew
0,"[action, adventure, fantasy, sciencefiction]",19995,"[cultureclash, future, spacewar, spacecolony, ...","[in, the, 22nd, century,, a, paraplegic, marin...",Avatar,"[samworthington, zoesaldana, sigourneyweaver, ...",[jamescameron]
1,"[adventure, fantasy, action]",285,"[ocean, drugabuse, exoticisland, eastindiatrad...","[captain, barbossa,, long, believed, to, be, d...",Pirates of the Caribbean: At World's End,"[johnnydepp, orlandobloom, keiraknightley, ste...",[goreverbinski]
2,"[action, adventure, crime]",206647,"[spy, basedonnovel, secretagent, sequel, mi6, ...","[a, cryptic, message, from, bond’s, past, send...",Spectre,"[danielcraig, christophwaltz, léaseydoux, ralp...",[sammendes]
3,"[action, crime, drama, thriller]",49026,"[dccomics, crimefighter, terrorist, secretiden...","[following, the, death, of, district, attorney...",The Dark Knight Rises,"[christianbale, michaelcaine, garyoldman, anne...",[christophernolan]
4,"[action, adventure, sciencefiction]",49529,"[basedonnovel, mars, medallion, spacetravel, p...","[john, carter, is, a, war-weary,, former, mili...",John Carter,"[taylorkitsch, lynncollins, samanthamorton, wi...",[andrewstanton]


In [34]:
#concatinating all the data in new feature named tags
movies["tags"] = movies["genres"]+movies["keywords"]+movies["overview"]+movies["cast"]+movies["crew"]
movies["tags"] =movies["tags"].apply(lambda x: " ".join(x))

In [35]:
#making a function for extracting the root word from the words
pstem = ps()
def stem_it(string):
    li=[]
    for i in string.split(" "):
        li.append(pstem.stem(i))
    return " ".join(li)

In [36]:
#applying the function on the data
movies["tags"] =movies["tags"].apply(stem_it)

In [37]:
#data after applying the function on it
movies.head()

Unnamed: 0,genres,id,keywords,overview,title,cast,crew,tags
0,"[action, adventure, fantasy, sciencefiction]",19995,"[cultureclash, future, spacewar, spacecolony, ...","[in, the, 22nd, century,, a, paraplegic, marin...",Avatar,"[samworthington, zoesaldana, sigourneyweaver, ...",[jamescameron],action adventur fantasi sciencefict culturecla...
1,"[adventure, fantasy, action]",285,"[ocean, drugabuse, exoticisland, eastindiatrad...","[captain, barbossa,, long, believed, to, be, d...",Pirates of the Caribbean: At World's End,"[johnnydepp, orlandobloom, keiraknightley, ste...",[goreverbinski],adventur fantasi action ocean drugabus exotici...
2,"[action, adventure, crime]",206647,"[spy, basedonnovel, secretagent, sequel, mi6, ...","[a, cryptic, message, from, bond’s, past, send...",Spectre,"[danielcraig, christophwaltz, léaseydoux, ralp...",[sammendes],action adventur crime spi basedonnovel secreta...
3,"[action, crime, drama, thriller]",49026,"[dccomics, crimefighter, terrorist, secretiden...","[following, the, death, of, district, attorney...",The Dark Knight Rises,"[christianbale, michaelcaine, garyoldman, anne...",[christophernolan],action crime drama thriller dccomic crimefight...
4,"[action, adventure, sciencefiction]",49529,"[basedonnovel, mars, medallion, spacetravel, p...","[john, carter, is, a, war-weary,, former, mili...",John Carter,"[taylorkitsch, lynncollins, samanthamorton, wi...",[andrewstanton],action adventur sciencefict basedonnovel mar m...


## Building the System

In [38]:
#creating a new dataframe for building the system
df = movies[["id","title","tags"]]

In [39]:
#overview of cleaned and preprocessed data
df.head()

Unnamed: 0,id,title,tags
0,19995,Avatar,action adventur fantasi sciencefict culturecla...
1,285,Pirates of the Caribbean: At World's End,adventur fantasi action ocean drugabus exotici...
2,206647,Spectre,action adventur crime spi basedonnovel secreta...
3,49026,The Dark Knight Rises,action crime drama thriller dccomic crimefight...
4,49529,John Carter,action adventur sciencefict basedonnovel mar m...


In [40]:
#defining the vectorizer object for converting text into vector
tfv = TfidfVectorizer(min_df=3,  max_features=None,
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3),
            stop_words = 'english')

In [41]:
#fitting the data and generating the sparse matrix of the output
tfv_matrix = tfv.fit_transform(df['tags'])
print(tfv_matrix)
print(tfv_matrix.shape)

  (0, 487)	0.14638111978770688
  (0, 299)	0.1301932340077509
  (0, 11266)	0.15766581701724686
  (0, 5832)	0.16938220983823407
  (0, 150)	0.16471036612147122
  (0, 14066)	0.15766581701724686
  (0, 14156)	0.15240417389208774
  (0, 5535)	0.11966000876155689
  (0, 480)	0.10396476910930964
  (0, 293)	0.08035897292535689
  (0, 8067)	0.15487013776078007
  (0, 6295)	0.13908879862700216
  (0, 10237)	0.14820283299645423
  (0, 14388)	0.15487013776078007
  (0, 13872)	0.13186904771025285
  (0, 16869)	0.14035806568332604
  (0, 13214)	0.15240417389208774
  (0, 2779)	0.15766581701724686
  (0, 12121)	0.09817305983858192
  (0, 11261)	0.09228126945207055
  (0, 5826)	0.08675791673895941
  (0, 15486)	0.1286416728623388
  (0, 1508)	0.06816900261292927
  (0, 10371)	0.09322390750588666
  (0, 15864)	0.1301932340077509
  :	:
  (4799, 12055)	0.10833263810391136
  (4799, 9234)	0.13417487642757142
  (4799, 2010)	0.26834975285514284
  (4799, 4596)	0.08843082178950815
  (4799, 13209)	0.11377326760815681
  (4799, 132

In [42]:
#calculating the cosine similarity of each data with respect to every other data
cosine_sim = cs(tfv_matrix, tfv_matrix)
print(cosine_sim)
print(cosine_sim.shape)

[[1.         0.02367474 0.0255502  ... 0.01389118 0.         0.        ]
 [0.02367474 1.         0.00788156 ... 0.01081861 0.         0.00365613]
 [0.0255502  0.00788156 1.         ... 0.01111705 0.         0.        ]
 ...
 [0.01389118 0.01081861 0.01111705 ... 1.         0.01337321 0.02622117]
 [0.         0.         0.         ... 0.01337321 1.         0.01265371]
 [0.         0.00365613 0.         ... 0.02622117 0.01265371 1.        ]]
(4800, 4800)


In [43]:
#defining the main recommendation function
def recommend(movie):
    movie_ind = df[df["title"]==movie].index[0]
    dist = cosine_sim[movie_ind]
    li = sorted(list(enumerate(dist)),reverse=True,key=lambda x:x[1])[1:11]
    for i in li:
        ind=i[0]
        print(df.iloc[ind].title)
#recommend("Batman Begins")

In [44]:
#creating a file of our work for further taskes
#pickle.dump(df,open("movies.pkl","wb"))
#pickle.dump(cosine_sim,open("similarity.pkl","wb"))