#  Content based Movie Recommender System : 

### Name :  Raj Shivprakash Poonam Jaiswal

#### Github : 

#### Dataset  : https://www.kaggle.com/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv

#### LinkedIn :

#### Twitter :

##  What is Recommender System ?


Recommender System is a **system that seeks to predict or filter preferences according to the user's choices.** Recommender systems are utilized in a variety of areas including **movies, music, news, books, research articles, search queries, social tags, and products in general.**



## Types of Recommender System :

<img src = "Type of RS.png">

## What is Content Based Recommendation System ?

Content-based Filtering is a Machine Learning **technique that uses similarities in features to make decisions.** This technique is often used in recommender systems, which are **algorithms designed to advertise or recommend things to users based on knowledge accumulated about the user.**

## What is Collaborative Recommendation System ?

Collaborative filtering is a **technique that can filter out items that a user might like on the basis of reactions by similar users.** It works by **searching a large group of people and finding a smaller set of users with tastes similar to a particular user**

## What is Hybrid Recommendation System ?

Recommender systems that **recommends items by combining two or more methods together,** including the content-based method, the collaborative filtering-based method, the demographic method and the knowledge-based method.

### Stages of a Project : Content Based Movie Recommendation System 

**1. Data Preprocessing**

**2. Model Building**

**3. Website Building**

**4. Deployment**

In [1]:
# Import Python Libraries / Modules :

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Import dataset in to our Notebook :

movies = pd.read_csv("movies.csv")
credits = pd.read_csv("credits.csv")

print("Both Data Imported Successfully!!!!!!")

Both Data Imported Successfully!!!!!!


#### Observe movies dataset 

In [3]:
# Head of a dataset to see the attributes and its contents

movies.head(1)

# Basically this dataset contains enormous data in 20 colms in the from of list of dictionaries eg: genres , keywords , 
# production_companies etc . so its better to remove some colmns which will not add value or  in a simple waythose attribues  
# which will not affect or manipulate the user choice.
# so we will select 6-9 colmns from this dataset to minimize the complications of cleaning dataset

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [4]:
# shape of a dataset ( rows x colms)

movies.shape

(4803, 20)

In [5]:
# Total number of duplicate rows in a dataset

movies.duplicated().sum()

0

In [6]:
# total NA values in each attributes or colms of the dataset

movies.isnull().sum()

budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
dtype: int64

In [7]:
# Dataset datatypes information

movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

#### Observe credits dataset 

In [8]:
# Head of a dataset to see the attributes and its contents

credits.head(1)

# Basically cast and crew is a list of so many dictionaries which contains key as profession or role and 
# values as the name of person or organisation performed the assigned task .

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [9]:
# shape of a dataset ( rows x colms)

credits.shape

(4803, 4)

In [10]:
# Total number of duplicate rows in a dataset

credits.duplicated().sum()

0

In [11]:
# total NA values in each attributes or colms of the dataset

credits.isnull().sum()

movie_id    0
title       0
cast        0
crew        0
dtype: int64

In [12]:
# Dataset datatypes information

credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB


In [13]:
# After observing both the datasets we can merge it by comman colm " title" and make a single dataset to perform task.

new_dataset = movies.merge(credits, on="title")

In [14]:
# head of new_dataset after merge

new_dataset.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [15]:
# shape of new_dataset , expected -> ( 4809,23)

new_dataset.shape

(4809, 23)

In [16]:
# Now next step is to filter out some colms from the new_dataset and the reasons for this we have discussed earlier
# columns we are not keeping :

# 1. Budget 2.homepage 3. id  4. original_language (95% are english) 5.original_title 6.popularity (since it is a numerical based) 
# 7.production_comapny 8.production_countries 9.release-date(not sure)

new_dataset = new_dataset[['movie_id','title','overview','genres','keywords','cast','crew']]


In [17]:
# head of a new_dataset after elimination

new_dataset.head(1)

# This is our final dataset to work, now next process will be data processing where we create new dataset which contains 3 colms
# i.e 1. Movie-id , 2. title 3. tags ( overview , genres , keywords ,cast ,crew)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [18]:
# shape of new_dataset , expected -> ( 4809,7)

new_dataset.shape

(4809, 7)

In [19]:
# total NA values in each attributes or colms of the new_dataset

new_dataset.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [20]:
# remove those 3 NA in an overview colm

new_dataset.dropna(inplace= True)

In [21]:
# Again check total NA values in each attributes or colms of the new_dataset

new_dataset.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [22]:
# Total number of duplicate rows in a new_dataset

new_dataset.duplicated().sum()

0

In [23]:
# Dataset datatypes information

new_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4806 entries, 0 to 4808
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4806 non-null   int64 
 1   title     4806 non-null   object
 2   overview  4806 non-null   object
 3   genres    4806 non-null   object
 4   keywords  4806 non-null   object
 5   cast      4806 non-null   object
 6   crew      4806 non-null   object
dtypes: int64(1), object(6)
memory usage: 300.4+ KB


## 1 . Data Preprocessing : 

In [24]:
# lets start cleaning with Genres column , since we have already discuss that it contain a list of dictionaries.

new_dataset["genres"][0] # first movie Genre

# since we required only tags or keywords as this is most vital information present here so let take it out building a function

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [25]:
# Build a function called genres to extract name key from dictionaries.

import ast

# ast. literal_eval raises an exception if the input isn't a valid Python datatype, so the code won't be executed if it's not. 
# Use ast. literal_eval whenever you need eval. 

def genres(obj):
    list = []
    for i in ast.literal_eval(obj): # since the datatype of the genere is not suitable what python demands so we used ast 
        list.append(i["name"])
    return list

In [26]:
# now store the keywords of genres in our genres colm

new_dataset["genres"] = new_dataset["genres"].apply(genres)

# observe the changes in genres colm

new_dataset["genres"]

# Wow.. We achieved it !!!!

0       [Action, Adventure, Fantasy, Science Fiction]
1                        [Adventure, Fantasy, Action]
2                          [Action, Adventure, Crime]
3                    [Action, Crime, Drama, Thriller]
4                [Action, Adventure, Science Fiction]
                            ...                      
4804                        [Action, Crime, Thriller]
4805                                [Comedy, Romance]
4806               [Comedy, Drama, Romance, TV Movie]
4807                                               []
4808                                    [Documentary]
Name: genres, Length: 4806, dtype: object

In [27]:
# whats next ? Now in similar fashion clean the keywords column as well

# lets start cleaning with keywords column , since we have already discuss that it contain a list of dictionaries.

new_dataset["keywords"][0]  # first movie keyword

# since we required only tags or keywords as this is most vital information present here so let take it out building a function

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

In [28]:
# Build a function called " keywords " and fetch only "name" key from dictionaries .

import ast

def keywords(obj):
    list = []
    for i in ast.literal_eval(obj):
        list.append(i["name"])
    return list

In [29]:
# now store the keywords of name in keywords colm

new_dataset["keywords"] = new_dataset["keywords"].apply(keywords)

# observe the changes in keyowrds colm

new_dataset["keywords"]

# Wow..Wow ... We achieved it again !!!!

0       [culture clash, future, space war, space colon...
1       [ocean, drug abuse, exotic island, east india ...
2       [spy, based on novel, secret agent, sequel, mi...
3       [dc comics, crime fighter, terrorist, secret i...
4       [based on novel, mars, medallion, space travel...
                              ...                        
4804    [united states–mexico barrier, legs, arms, pap...
4805                                                   []
4806    [date, love at first sight, narration, investi...
4807                                                   []
4808            [obsession, camcorder, crush, dream girl]
Name: keywords, Length: 4806, dtype: object

In [30]:
# Whats next then ? Two more colms cast and crew .
# lets see the cast column first ,then we will decide what /how we need to perfrom ?

new_dataset["cast"][0] # first movie cast

# so bascilly its huge list ...but but but we are only intrested upto 3 dicts as it has famous actors name and
# it will surely help / contribute in keyword ( or tags column )

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

In [31]:
# as usual build a function to get desired result 

import ast


def cast(obj):
    
    list = []
    counter = 0
    
    for i in ast.literal_eval(obj):
        if counter != 3:
            list.append(i["name"])
            counter +=1
        else:
            break
            
    return list
        

In [32]:
# apply the function and hope it wroks as we required 
new_dataset["cast"] = new_dataset["cast"].apply(cast)

# observe the changes in cast colm
new_dataset["cast"]

# Wow ..wow..Wow...We did it again !!

0        [Sam Worthington, Zoe Saldana, Sigourney Weaver]
1           [Johnny Depp, Orlando Bloom, Keira Knightley]
2            [Daniel Craig, Christoph Waltz, Léa Seydoux]
3            [Christian Bale, Michael Caine, Gary Oldman]
4          [Taylor Kitsch, Lynn Collins, Samantha Morton]
                              ...                        
4804    [Carlos Gallardo, Jaime de Hoyos, Peter Marqua...
4805         [Edward Burns, Kerry Bishé, Marsha Dietlein]
4806           [Eric Mabius, Kristin Booth, Crystal Lowe]
4807            [Daniel Henney, Eliza Coupe, Bill Paxton]
4808    [Drew Barrymore, Brian Herzlinger, Corey Feldman]
Name: cast, Length: 4806, dtype: object

In [33]:
new_dataset.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [34]:
# As usual if this function works for cast then lets try it out for crew colm aswell and observe the result
# But before that lets observe the crew  column.

new_dataset["crew"][0] # first movie crew

# so bascilly its huge list ...but but but we are only intrested upto 3 dicts as it has famous directors or producers name and
# it will surely help / contribute in keyword ( or tags column )

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [35]:
# as usual build a function to get desired result 

import ast


def crew(obj):
    
    list = []
    
    for i in ast.literal_eval(obj):
        if i["job"] == "Director":
            list.append(i["name"])
            
    return list
        

In [36]:
# apply the function and hope it wroks as we required 
new_dataset["crew"] = new_dataset["crew"].apply(crew)

# observe the changes in cast colm
new_dataset["crew"]

# Wow ..wow..Wow.Wow..We did it again and again !!!!!!

0                                [James Cameron]
1                               [Gore Verbinski]
2                                   [Sam Mendes]
3                            [Christopher Nolan]
4                               [Andrew Stanton]
                          ...                   
4804                          [Robert Rodriguez]
4805                              [Edward Burns]
4806                               [Scott Smith]
4807                               [Daniel Hsia]
4808    [Brian Herzlinger, Jon Gunn, Brett Winn]
Name: crew, Length: 4806, dtype: object

In [37]:
new_dataset.head()

# We have done a faboulous job till now but.. we still have o work  alot to get our final cleaned processed dataset..Lets Go!!

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


In [38]:
# one things you have might noticed that what if i searched for Sam Mendes and model suggest me Sam Worthington ?
# It might be possiblesince model will address the name of the person as two different tags , so in order to avoid this
# we need to remove those space and replace with nothing .
# its very easy process to do

new_dataset["genres"] = new_dataset["genres"].apply(lambda x : [ i.replace(" " ,"") for i in x ])
new_dataset["keywords"] = new_dataset["keywords"].apply(lambda x : [ i.replace(" " ,"") for i in x ])
new_dataset["cast"] = new_dataset["cast"].apply(lambda x : [ i.replace(" " ,"") for i in x ])
new_dataset["crew"] = new_dataset["crew"].apply(lambda x : [ i.replace(" " ,"") for i in x ])

# Notice :  Science Fiction as ScienceFiction 

new_dataset.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


In [39]:
# our final task is to fetch keywords from overview and then lastly merge all the columns into a new colm called "tags"
# let us see the overview column first 

new_dataset["overview"][0] # first movie overview

# this seems a string datatype and we should convert into list so that it become easy to merge with other list as well.

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [40]:
new_dataset["overview"] = new_dataset["overview"].apply(lambda x : x.split())

In [41]:
new_dataset["overview"][0]

# Now it has become an indivial element in the list ..we can even pick any element using index number which show it has become
# list now and it is not a string anymore :  new_dataset["overview"][0][5] -->'paraplegic'


['In',
 'the',
 '22nd',
 'century,',
 'a',
 'paraplegic',
 'Marine',
 'is',
 'dispatched',
 'to',
 'the',
 'moon',
 'Pandora',
 'on',
 'a',
 'unique',
 'mission,',
 'but',
 'becomes',
 'torn',
 'between',
 'following',
 'orders',
 'and',
 'protecting',
 'an',
 'alien',
 'civilization.']

In [42]:
# finally our wish is fullfilled to merge all those inportant keywords and add into a seperate column called "tag".

new_dataset['tags'] = new_dataset['overview'] + new_dataset['genres'] + new_dataset['keywords'] +new_dataset['cast'] + new_dataset['crew']

In [43]:
# since we have already add a tag column which is sufficient to add values to our model we dont need  multiple duplicate keywords
# and eventually we will drop all those and will keep three column only , movie id ( for poster purpose) , title and tags.

new_dataset = new_dataset[["movie_id","title" ,"tags"]]

# Turned our tags type from list to string 
new_dataset["tags"] = new_dataset["tags"].apply(lambda x : " ".join(x))

# lowercase to every letter ( good practice , highly recommended)
new_dataset["tags"] = new_dataset["tags"].apply(lambda x : x.lower())

# Note : 
# 1. from string to list : (lambda x : x.split())
# 2. from list to string : (lambda x : " ".join(x))

In [44]:
new_dataset.head(5)

# That is what we wanted and we did it !! Lots of preprocessing required but we get it at the end!!!

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


## 2. Model Building :

In [45]:
# Note : thisis notthe first Plan of Attack but it is necesary step to perform at first.

# to eliminate those words , build a function to perfrom stemming on those 5000 words

# step 1 : import library
import nltk

# step 2 : import class
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

# build a function

def stemming(text):
    
    list = []
    
    for i in text.split():
        list.append(ps.stem(i))
    return " ".join(list)

In [46]:
# apply that function on tags column

new_dataset["tags"] = new_dataset["tags"].apply(stemming)


In [47]:
# Plan of Attack :

# First we have tackle with the problem of keywords ,eventhough we have eliminated lots of un unsed data from dataset but still
# its not enough ...we still need to filter it out and choose most used 5000 keywords so that if any user choose a movie
# which has a key words of "Future" ,"ScienceFiction" the user should aslo get the recommentation of "Timetravel" etc.

# to achive this we need to : 1. Vectorize the Text 2. Use countVectorization 

# Basically we are making a 5000 dimensional space ( referring those 5000 keywords) where those words will be vectors
# carrying a number .

# Why do we need text Vectorization?

# Vectorization or word embedding is the process of converting text data to numerical vectors. 
# Later those vectors are used to build various machine learning models. In this manner, we say this as extracting features 
# with the help of text with an aim to build multiple natural languages, processing models, etc.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000 ,stop_words="english")

vectors = cv.fit_transform(new_dataset["tags"]).toarray()

In [48]:
# shape of the vectorization vectors ----> Expected (4806 (total rows) x 5000 ( top 5000 most words colms))

vectors.shape

(4806, 5000)

In [49]:
# lets see those top 5000 most used words
cv.get_feature_names()

# Two Fouls i observed which we rectify :
# 1. Numbers ( since we cannt remove numbers bcoz some movies are recognized the number)
# 2. accident and accidental ( but we need to stem this coz both point the same root word)

['000',
 '007',
 '10',
 '100',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '17th',
 '18',
 '18th',
 '18thcenturi',
 '19',
 '1910',
 '1920',
 '1930',
 '1940',
 '1944',
 '1950',
 '1950s',
 '1960',
 '1960s',
 '1970',
 '1970s',
 '1971',
 '1974',
 '1976',
 '1980',
 '1985',
 '1990',
 '1999',
 '19th',
 '19thcenturi',
 '20',
 '200',
 '2003',
 '2009',
 '20th',
 '21st',
 '23',
 '24',
 '25',
 '30',
 '300',
 '3d',
 '40',
 '50',
 '500',
 '60',
 '70',
 '80',
 'aaron',
 'aaroneckhart',
 'abandon',
 'abduct',
 'abigailbreslin',
 'abil',
 'abl',
 'aboard',
 'abov',
 'abus',
 'academ',
 'academi',
 'accept',
 'access',
 'accid',
 'accident',
 'acclaim',
 'accompani',
 'accomplish',
 'account',
 'accus',
 'ace',
 'achiev',
 'acquaint',
 'act',
 'action',
 'actionhero',
 'activ',
 'activist',
 'activities',
 'actor',
 'actress',
 'actual',
 'ad',
 'adam',
 'adamsandl',
 'adamshankman',
 'adapt',
 'add',
 'addict',
 'adjust',
 'admir',
 'admit',
 'adolesc',
 'adopt',
 'ador',
 'adrienbrodi',
 'adult'

In [50]:
# check the length of words after steemming , earlier it was 5000

len(cv.get_feature_names())

# and ofcourse it will be same.. bcoz we stemmed it not elimate the same root words

5000

In [51]:
# lets see those top 5000 most used words after stemming
cv.get_feature_names()

# And we can see the stemmed words

['000',
 '007',
 '10',
 '100',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '17th',
 '18',
 '18th',
 '18thcenturi',
 '19',
 '1910',
 '1920',
 '1930',
 '1940',
 '1944',
 '1950',
 '1950s',
 '1960',
 '1960s',
 '1970',
 '1970s',
 '1971',
 '1974',
 '1976',
 '1980',
 '1985',
 '1990',
 '1999',
 '19th',
 '19thcenturi',
 '20',
 '200',
 '2003',
 '2009',
 '20th',
 '21st',
 '23',
 '24',
 '25',
 '30',
 '300',
 '3d',
 '40',
 '50',
 '500',
 '60',
 '70',
 '80',
 'aaron',
 'aaroneckhart',
 'abandon',
 'abduct',
 'abigailbreslin',
 'abil',
 'abl',
 'aboard',
 'abov',
 'abus',
 'academ',
 'academi',
 'accept',
 'access',
 'accid',
 'accident',
 'acclaim',
 'accompani',
 'accomplish',
 'account',
 'accus',
 'ace',
 'achiev',
 'acquaint',
 'act',
 'action',
 'actionhero',
 'activ',
 'activist',
 'activities',
 'actor',
 'actress',
 'actual',
 'ad',
 'adam',
 'adamsandl',
 'adamshankman',
 'adapt',
 'add',
 'addict',
 'adjust',
 'admir',
 'admit',
 'adolesc',
 'adopt',
 'ador',
 'adrienbrodi',
 'adult'

In [52]:
# Plan of Attack for ModelBuilding / Algorithm for Recommendation :

# Step 1 : Since we have 4806 rows ( movies) and it has 5000 vectors ( words in 5000 D space) we need to calculate the angle
# bettween those (vectors) . Lesser the angle , more they are similar and vice versa.

# Step 2 : To achieve this we can use cosine Similarity from sklearn.

# Step 3 : Build a function to recoommend same type of (5) movie.

# Step 4 : Implement this 

In [54]:
# step 1 and 2 :

from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vectors)

In [56]:
# This shape make sense bcoz every every movie is comparing with rest other movies
similarity.shape

(4806, 4806)

In [59]:
# Step 3 : 

def recommender(movie):
    
    # First fetch the movie title index from which we have to recommend next 5 movies
    movies_index = new_dataset[new_dataset["title"]== movie].index[0]
    
    # calcualte the similarity score of that movie with another movies
    distances = similarity[movies_index]
    
    # sort the list without disturbing its index pos using enumerate and list also fetch top 5 movie in a reverse order
    movies_list = sorted(list(enumerate(distances)) ,reverse= True , key = lambda x : x[1])[1:6]
    
    for i in movies_list:
        
        # print high similarity score movies with title with index pos as a input
        print(new_dataset.iloc[i[0]].title)

In [62]:
# step 4 :

recommender("Gandhi")

Gandhi, My Father
Guiana 1838
The Wind That Shakes the Barley
Mr. Turner
A Passage to India


In [63]:
# our final task is to pickle this model
import pickle

In [64]:
# finally dump into two files  ( "wb" mode opens the file in binary format for writing)

# file 1 : containse infor about movies ( movie id , title , tags)
pickle.dump(new_dataset,open('movie_list.pkl','wb'))

# file 2 : Similarties score between movies ( for recommneder system)
pickle.dump(similarity,open('similarity.pkl','wb'))

## 3. Website Building (VS code) :


## 4. Model Deployment (Heroku) :

In [65]:
# file 3 : Since dataframe is not compatible for movies list
pickle.dump(new_dataset.to_dict(),open('movies_dict.pkl','wb'))