## **Problem Statement**

Our objective is to develop a sophisticated Movie Recommender System utilizing a content-based recommendation approach.

## **Source**

https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata

In [1]:
import pandas as pd
import numpy as np
import ast
import string  
import nltk

In [2]:
df = pd.read_csv('tmdb_5000_credits.csv')
df1 = pd.read_csv('tmdb_5000_movies.csv')

In [3]:
df.head(2)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [4]:
df1.values

array([[237000000,
        '[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]',
        'http://www.avatarmovie.com/', ..., 'Avatar', 7.2, 11800],
       [300000000,
        '[{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}]',
        'http://disney.go.com/disneypictures/pirates/', ...,
        "Pirates of the Caribbean: At World's End", 6.9, 4500],
       [245000000,
        '[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 80, "name": "Crime"}]',
        'http://www.sonypictures.com/movies/spectre/', ..., 'Spectre',
        6.3, 4466],
       ...,
       [0,
        '[{"id": 35, "name": "Comedy"}, {"id": 18, "name": "Drama"}, {"id": 10749, "name": "Romance"}, {"id": 10770, "name": "TV Movie"}]',
        'http://www.hallmarkchannel.com/signedsealeddelivered', ...,
        'Signed, Sealed, Delivered', 7.0, 6],
       [0, '[]', 'http://s

In [5]:
movies = df.merge(df1, on = 'title')
movies.shape


(4809, 23)

**Feature Selection**

Selecting features based on their requirement to the project. We need features which are helpful in creating the tags for the movies. So that we can use them for recommendations.

In [6]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   movie_id              4809 non-null   int64  
 1   title                 4809 non-null   object 
 2   cast                  4809 non-null   object 
 3   crew                  4809 non-null   object 
 4   budget                4809 non-null   int64  
 5   genres                4809 non-null   object 
 6   homepage              1713 non-null   object 
 7   id                    4809 non-null   int64  
 8   keywords              4809 non-null   object 
 9   original_language     4809 non-null   object 
 10  original_title        4809 non-null   object 
 11  overview              4806 non-null   object 
 12  popularity            4809 non-null   float64
 13  production_companies  4809 non-null   object 
 14  production_countries  4809 non-null   object 
 15  release_date         

In [7]:
movies.isnull().sum()

movie_id                   0
title                      0
cast                       0
crew                       0
budget                     0
genres                     0
homepage                3096
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
vote_average               0
vote_count                 0
dtype: int64

In [8]:
# Required columns for the recommender system. 
# movie_id, title, genres, overview, keywords,  cast, crew

movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]

In [9]:
movies.shape

(4809, 7)

In [10]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [11]:
movies.dropna(inplace=True)  # Removing null values from the dataset
movies.reset_index(drop = True)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond‚Äôs past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
...,...,...,...,...,...,...,...
4801,9367,El Mariachi,El Mariachi just wants to play his guitar and ...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 5616, ""name"": ""united states\u2013mexi...","[{""cast_id"": 1, ""character"": ""El Mariachi"", ""c...","[{""credit_id"": ""52fe44eec3a36847f80b280b"", ""de..."
4802,72766,Newlyweds,A newlywed couple's honeymoon is upended by th...,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",[],"[{""cast_id"": 1, ""character"": ""Buzzy"", ""credit_...","[{""credit_id"": ""52fe487dc3a368484e0fb013"", ""de..."
4803,231617,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...","[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...","[{""id"": 248, ""name"": ""date""}, {""id"": 699, ""nam...","[{""cast_id"": 8, ""character"": ""Oliver O\u2019To...","[{""credit_id"": ""52fe4df3c3a36847f8275ecf"", ""de..."
4804,126186,Shanghai Calling,When ambitious New York attorney Sam is sent t...,[],[],"[{""cast_id"": 3, ""character"": ""Sam"", ""credit_id...","[{""credit_id"": ""52fe4ad9c3a368484e16a36b"", ""de..."


In [12]:
movies.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [13]:
movies.duplicated().sum()

0

Now our aim is to create a tag for the recommendation which we can create by joining overview, keywords, genres, 
cast and crew. Basically, we will have only three columns now i.e. 'movie_id', 'title', 'tag'. 

In [14]:
movies.iloc[0]['genres']

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [15]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4806 entries, 0 to 4808
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4806 non-null   int64 
 1   title     4806 non-null   object
 2   overview  4806 non-null   object
 3   genres    4806 non-null   object
 4   keywords  4806 non-null   object
 5   cast      4806 non-null   object
 6   crew      4806 non-null   object
dtypes: int64(1), object(6)
memory usage: 300.4+ KB


we can see all the columns are object type. Hence this will create problem in joining three columns. We need to take out the name of the genres. To access that, we need to first change it to list. we will use library called 'ast' for this and then extract name and save in another variable. We will do that for the complete column using for loop.
**The function ast.literal_eval from Python's ast (Abstract Syntax Trees) module is used to safely evaluate an expression node or a string containing a Python literal or container display. Unlike eval, it only evaluates expressions that are Python literals, such as strings, numbers, tuples, lists, dicts, booleans, and None. This makes it safer to use for parsing strings from untrusted sources because it doesn't execute arbitrary code.**

In [16]:
# def conversion(col):
#     New = []
#     for i in ast.literal_eval(col):
#         New.append(i['name'])
#     return New
# movies['genres'] = movies['genres'].apply(conversion)

movies['genres'] = movies['genres'].apply(lambda x: [i['name'] for i in ast.literal_eval(x)])

In [17]:
movies['keywords'] = movies['keywords'].apply(lambda x: [i['name'] for i in ast.literal_eval(x)])

In [18]:
movies['keywords']

0       [culture clash, future, space war, space colon...
1       [ocean, drug abuse, exotic island, east india ...
2       [spy, based on novel, secret agent, sequel, mi...
3       [dc comics, crime fighter, terrorist, secret i...
4       [based on novel, mars, medallion, space travel...
                              ...                        
4804    [united states‚Äìmexico barrier, legs, arms, pap...
4805                                                   []
4806    [date, love at first sight, narration, investi...
4807                                                   []
4808            [obsession, camcorder, crush, dream girl]
Name: keywords, Length: 4806, dtype: object

In [19]:
# Here we are focusing on only the first thress actors. Hence taking only top 3 actors name from the movies.
def conversion2(col):
    New1 = []
    counter = 0 
    for i in ast.literal_eval(col):
        if counter !=3:
            New1.append(i['name'])
            counter +=1
        else:
            break
    return New1


In [20]:
movies['cast'] = movies['cast'].apply(conversion2)

In [21]:
movies['cast']

0        [Sam Worthington, Zoe Saldana, Sigourney Weaver]
1           [Johnny Depp, Orlando Bloom, Keira Knightley]
2            [Daniel Craig, Christoph Waltz, L√©a Seydoux]
3            [Christian Bale, Michael Caine, Gary Oldman]
4          [Taylor Kitsch, Lynn Collins, Samantha Morton]
                              ...                        
4804    [Carlos Gallardo, Jaime de Hoyos, Peter Marqua...
4805         [Edward Burns, Kerry Bish√©, Marsha Dietlein]
4806           [Eric Mabius, Kristin Booth, Crystal Lowe]
4807            [Daniel Henney, Eliza Coupe, Bill Paxton]
4808    [Drew Barrymore, Brian Herzlinger, Corey Feldman]
Name: cast, Length: 4806, dtype: object

In [22]:
movies['crew'][0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [23]:
# def conversion3(col):
#     New2 = []
#     for i in ast.literal_eval(col):
#         if i['job']=='Director':
#             New2.append(i['name'])
#     return New2
# movies['crew'] = movies['crew'].apply(conversion3)

movies['crew'] = movies['crew'].apply(lambda x: [i['name'] for i in ast.literal_eval(x) if i['job']=='Director'])

In [24]:
movies['crew']

0                                [James Cameron]
1                               [Gore Verbinski]
2                                   [Sam Mendes]
3                            [Christopher Nolan]
4                               [Andrew Stanton]
                          ...                   
4804                          [Robert Rodriguez]
4805                              [Edward Burns]
4806                               [Scott Smith]
4807                               [Daniel Hsia]
4808    [Brian Herzlinger, Jon Gunn, Brett Winn]
Name: crew, Length: 4806, dtype: object

In [25]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,A cryptic message from Bond‚Äôs past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, L√©a Seydoux]",[Sam Mendes]
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


In [26]:
# change the overview to list and also removes spaces between the names so that we get complete name of the cast. Including with the 
# surname will result in complete info reducing confusion.

In [27]:
# removing punctuations from overview content as it will create two commas when splitting and further create problem in other functions.
movies['overview'] = movies['overview'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
movies['overview'] = movies['overview'].apply(lambda x: x.split())

In [28]:
movies['genres'] = movies['genres'].apply(lambda x : [i.replace(" ", "") for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x: [i.replace(" ", "") for i in x])
movies['cast'] = movies['cast'].apply(lambda x: [i.replace(" ", "") for i in x])
movies['crew'] = movies['crew'].apply(lambda x: [i.replace(" ", "") for i in x])

In [29]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century, a, paraplegic, Marine...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa, long, believed, to, be, de...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond‚Äôs, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, L√©aSeydoux]",[SamMendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,John Carter,"[John, Carter, is, a, warweary, former, milita...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


In [30]:
# creating 'tags' column in the dataset
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [31]:
movies.shape

(4806, 8)

In [32]:
# creating new dataframe with only movies_id, title, tags
new_movies = movies[['movie_id', 'title', 'tags']]

In [33]:
new_movies.iloc[4805]

movie_id                                                25975
title                                       My Date with Drew
tags        [Ever, since, the, second, grade, when, he, fi...
Name: 4808, dtype: object

In [34]:
new_movies['tags']

0       [In, the, 22nd, century, a, paraplegic, Marine...
1       [Captain, Barbossa, long, believed, to, be, de...
2       [A, cryptic, message, from, Bond‚Äôs, past, send...
3       [Following, the, death, of, District, Attorney...
4       [John, Carter, is, a, warweary, former, milita...
                              ...                        
4804    [El, Mariachi, just, wants, to, play, his, gui...
4805    [A, newlywed, couples, honeymoon, is, upended,...
4806    [Signed, Sealed, Delivered, introduces, a, ded...
4807    [When, ambitious, New, York, attorney, Sam, is...
4808    [Ever, since, the, second, grade, when, he, fi...
Name: tags, Length: 4806, dtype: object

In [35]:
# removing commas
new_movies['tags'] = new_movies['tags'].apply(lambda x: " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_movies['tags'] = new_movies['tags'].apply(lambda x: " ".join(x))


In [36]:
new_movies['tags'][0]

'In the 22nd century a paraplegic Marine is dispatched to the moon Pandora on a unique mission but becomes torn between following orders and protecting an alien civilization Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

In [37]:
# converting them to lower case
new_movies['tags'] = new_movies['tags'].apply(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_movies['tags'] = new_movies['tags'].apply(lambda x: x.lower())


In [38]:
print(new_movies['tags'][0])
print()
print('**********')
print()
print(new_movies['tags'][1])

in the 22nd century a paraplegic marine is dispatched to the moon pandora on a unique mission but becomes torn between following orders and protecting an alien civilization action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron

**********

captain barbossa long believed to be dead has come back to life and is headed to the edge of the earth with will turner and elizabeth swann but nothing is quite as it seems adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger johnnydepp orlandobloom keiraknightley goreverbinski


## **Word Vectorization**

we will convert all the tags text in vectors, this will make every movie a vector. In all there will be 4806 vectors. For recommendations, we will see which vectors meaning movies are closest to it. Those will be the recommendations.

We will use Bag of words technique for the same.

We will combine all the words of tags of all movies. Now calculate frequency of all the words. i.e. check frequency of each word in each tag. so let's say if we have first word i.e. w1 as 'romance', then it will check how many times did this word occur in tag1 and so on it will check for other tags. Similarly,  it will check for other words, this way it will have the numbers for all the words which are the coordinates of the tags in the space. Each row itself is a vector. So if for simple explanation, we have 3 words, then the frequency we received for the words is the coordinate of the tag in 3 dimension space e.g. (5,4,3). Here we might have 5000 words so that many will be the dimensions. It will also require to remove stop words. These are words which are used for sentence formation but have no relevance in forming meaning of sentence. e.g. are, is, of, in, am etc.

We have a library in scikit learn for this. we have a class in this i.e. 'CountVectorizer'.

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

In [40]:
cv = CountVectorizer(max_features = 5000, stop_words = 'english')

In [41]:
vect = cv.fit_transform(new_movies['tags']).toarray()

In [42]:
vect.shape

(4806, 5000)

In [43]:
cv.get_feature_names()
# it has some words which are similar but have considered separately. e.g. (actor, actors), (activity, activities)
# so we will be using stemming. It takes the root words. This can be done by nltk library.



['007',
 '10',
 '100',
 '10yearold',
 '11',
 '12',
 '12yearold',
 '13',
 '14',
 '15',
 '16',
 '16yearold',
 '17',
 '17yearold',
 '18',
 '18th',
 '18thcentury',
 '1930s',
 '1940s',
 '1950s',
 '1960s',
 '1970s',
 '1980s',
 '1985',
 '1990s',
 '19th',
 '19thcentury',
 '20',
 '2003',
 '2009',
 '20th',
 '24',
 '25',
 '30',
 '300',
 '3d',
 '40',
 '60s',
 '70s',
 'aaron',
 'aaroneckhart',
 'abandoned',
 'abducted',
 'abigailbreslin',
 'abilities',
 'ability',
 'able',
 'aboard',
 'abuse',
 'abusive',
 'academic',
 'academy',
 'accept',
 'accepted',
 'accepts',
 'access',
 'accident',
 'accidental',
 'accidentally',
 'accompanied',
 'accomplish',
 'account',
 'accountant',
 'accused',
 'ace',
 'achieve',
 'act',
 'acting',
 'action',
 'actionhero',
 'actionpacked',
 'actions',
 'activist',
 'activities',
 'activity',
 'actor',
 'actors',
 'actress',
 'acts',
 'actual',
 'actually',
 'ad',
 'adam',
 'adams',
 'adamsandler',
 'adamshankman',
 'adaptation',
 'adapted',
 'addict',
 'addiction',
 'a

In [44]:
from nltk.stem.porter import PorterStemmer #This imports the PorterStemmer class from the nltk.stem.porter module.
ps= PorterStemmer()   # Creating an Instance of PorterStemmer

In [45]:
def stem(text):
    final= []
    for i in text.split():
        final.append(ps.stem(i))
    return " ".join(final)

In [46]:
new_movies['tags'] = new_movies['tags'].apply(stem)
# new_movies['tags'] = new_movies['tags'].apply(lambda x: [ps.stem(i)])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_movies['tags'] = new_movies['tags'].apply(stem)


In [47]:
new_movies['tags']

0       in the 22nd centuri a parapleg marin is dispat...
1       captain barbossa long believ to be dead ha com...
2       a cryptic messag from bond‚Äô past send him on a...
3       follow the death of district attorney harvey d...
4       john carter is a warweari former militari capt...
                              ...                        
4804    el mariachi just want to play hi guitar and ca...
4805    a newlyw coupl honeymoon is upend by the arriv...
4806    sign seal deliv introduc a dedic quartet of ci...
4807    when ambiti new york attorney sam is sent to s...
4808    ever sinc the second grade when he first saw h...
Name: tags, Length: 4806, dtype: object

In [48]:
# Again doing the vectorization
vector = cv.fit_transform(new_movies['tags']).toarray()

In [49]:
vector

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [50]:
cv.get_feature_names()



['007',
 '10',
 '100',
 '10yearold',
 '11',
 '12',
 '12yearold',
 '13',
 '14',
 '15',
 '150',
 '16',
 '16yearold',
 '17',
 '17yearold',
 '18',
 '18th',
 '18thcenturi',
 '1910',
 '1920',
 '1930',
 '1940',
 '1950',
 '1960',
 '1970',
 '1974',
 '1976',
 '1980',
 '1985',
 '1990',
 '1999',
 '19th',
 '19thcenturi',
 '20',
 '2000',
 '2003',
 '2009',
 '20th',
 '21stcenturi',
 '24',
 '25',
 '30',
 '300',
 '3d',
 '40',
 '50',
 '60',
 '70',
 '911',
 'aaron',
 'aaroneckhart',
 'aarontaylor',
 'abandon',
 'abbi',
 'abduct',
 'abigailbreslin',
 'abil',
 'abl',
 'aboard',
 'aborigin',
 'abov',
 'absenc',
 'abus',
 'academ',
 'academi',
 'accept',
 'access',
 'accid',
 'accident',
 'acclaim',
 'accompani',
 'accomplish',
 'account',
 'accus',
 'ace',
 'achiev',
 'acquaint',
 'act',
 'action',
 'actionhero',
 'actionpack',
 'activ',
 'activist',
 'actor',
 'actress',
 'actual',
 'ad',
 'adam',
 'adamsandl',
 'adamshankman',
 'adapt',
 'add',
 'addict',
 'addit',
 'adjust',
 'admir',
 'admit',
 'adolesc'

In [51]:
vector.shape

(4806, 5000)

Now that every movie has been changed into a vector, will find the distance of each movie with all the other movies. Movies that have smaller distances will be more similar to each other. 
Here, the question comes, how are we going to calculate the distance. Point to note here is that we will not be using the Euclidean distance. i.e.
The formula for Euclidean distance between two points in an ùëõ n-dimensional space is given by:
## Euclidean Distance Formula

For two points in \( n \)-dimensional space:

\( p = (p_1, p_2, \ldots, p_n) \)

\( q = (q_1, q_2, \ldots, q_n) \)

The Euclidean distance is:

$ d(p, q) = \sqrt{\sum_{i=1}^n (p_i - q_i)^2} $

The higher is the dimensionality, higher are the chances of failing of Euclidean Distance method.
Instead we will calculate cosine distance which is the $theta$ between them i.e. angle.
There is a function in scikit learn which is cosine_similarity which is the inverse of cosine_distance. Higher is the cosine_distance, lesser is the similarity. Similarly, higher is the similarity, lesser is the distance which is nothing but the angle between the vectors. 

In [52]:
from sklearn.metrics.pairwise import cosine_similarity

In [53]:
similarity = cosine_similarity(vector)

In [54]:
similarity.shape  # similarity of each vector with each vector

(4806, 4806)

In [55]:
similarity
#  First array shows similarity of 1st movie with all the movies. First element is 1 because, it is the similarity of the movie with itself. Then similarity with the second movie, then third
# and so on. The catch here is that this matrix has 1 diagonally because, that element represents the similarity of the movie
# with itself.

array([[1.        , 0.08238526, 0.08492078, ..., 0.06419407, 0.02360961,
        0.        ],
       [0.08238526, 1.        , 0.06063391, ..., 0.02291746, 0.        ,
        0.02711631],
       [0.08492078, 0.06063391, 1.        , ..., 0.04724556, 0.        ,
        0.        ],
       ...,
       [0.06419407, 0.02291746, 0.04724556, ..., 1.        , 0.03940552,
        0.02112886],
       [0.02360961, 0.        , 0.        , ..., 0.03940552, 1.        ,
        0.06993786],
       [0.        , 0.02711631, 0.        , ..., 0.02112886, 0.06993786,
        1.        ]])

In [56]:
similarity[0] # first array. first movie

array([1.        , 0.08238526, 0.08492078, ..., 0.06419407, 0.02360961,
       0.        ])

## Creating Function for Recommendation

In [57]:
from operator import itemgetter

In [83]:
# index of the movie
new_movies[new_movies['title']=='Avatar'] # Masking
ind = new_movies[new_movies['title']=='Avatar'].index[0] # index finding
similarity[ind] # array as per the index of the movie
mov = sorted(enumerate(similarity[ind]), reverse= True, key =itemgetter(1))[1:6]  # top 5
print(mov)
mov_ind = [i[0] for i in mov]
print(mov_ind)
for i in mov_ind:
    print(new_movies.iloc[i]['title'])
       

[(2409, 0.26257545381445874), (539, 0.24715576637149034), (507, 0.24602771043141894), (1216, 0.23878346647045964), (1204, 0.23870495801314426)]
[2409, 539, 507, 1216, 1204]
Aliens
Titan A.E.
Independence Day
Aliens vs Predator: Requiem
Predators


In [98]:
def recommendation(movie):
    index = new_movies[new_movies['title']== movie].index[0]
    distance = similarity[index]
    movie_list = sorted(enumerate(distance), reverse= True, key =itemgetter(1))[1:6]
    print(movie_list)
    for i in movie_list:
        print(new_movies.iloc[i[0]]['movie_id'])
        print(new_movies.iloc[i[0]]['title'])

In [99]:
recommendation("Tangled")

[(269, 0.20868250309207576), (2300, 0.20597661872511305), (1701, 0.2016397831543608), (255, 0.19596545041740515), (391, 0.18814909988405623)]
10198
The Princess and the Frog
129
Spirited Away
812
Aladdin
13700
Home on the Range
4523
Enchanted


In [None]:
new_movies[]

In [70]:
import pickle

In [73]:
pickle.dump(new_movies.to_dict(),open('Movies_list.pkl','wb'))

In [84]:
pickle.dump(similarity, open('similarity.pkl','wb'))