- The movie data set download from kaggle is placed in [Data](../Data)
- We can import movie data set from kaggle using python code directly:
    ```py
    import kagglehub

    # Download latest version
    path = kagglehub.dataset_download("tmdb/tmdb-movie-metadata")

    print("Path to dataset files:", path)
    ```
- But we will save the files in our repo, so that even if data set is modified / removed on kaggle, it will be still accessible to us

In [1]:
import numpy as np
import pandas as pd
import sklearn

In [2]:
movies  = pd.read_csv('../Data/tmdb_5000_movies.csv')
credits = pd.read_csv('../Data/tmdb_5000_credits.csv')

In [3]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [4]:
credits.head(2)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [5]:
type(movies)

pandas.core.frame.DataFrame

In [6]:
# Mering the two DataFrames on the 'title' column
movies = pd.merge(movies, credits, on="title")

In [7]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

- We will create tags for each movie and we use ML, so let us retain the columns that are useful for our analysis.
- Required columns:
    - id
        - we will use id on TMDB to get stuff from their website later during development phase
    - title
    - genres
    - keywords
    - overview
    - cast
    - crew

In [8]:
movies = movies[['id', 'title', 'genres', 'keywords', 'overview', 'cast', 'crew']]

In [9]:
movies.head(2) # Required columns

Unnamed: 0,id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [10]:
movies.shape

(4809, 7)

In [11]:
# Check for null values in the DataFrame
movies.isnull().sum()

id          0
title       0
genres      0
keywords    0
overview    3
cast        0
crew        0
dtype: int64

In [12]:
# There are 3 null values in the 'overview' column
# Dropping rows with null values in the 'overview' column because I feel overview column is important for us
movies.dropna(inplace=True)

In [13]:
movies.isnull().sum()

id          0
title       0
genres      0
keywords    0
overview    0
cast        0
crew        0
dtype: int64

In [14]:
movies.shape

(4806, 7)

In [15]:
# Let us check for duplicate rows in the DataFrame
movies.duplicated().sum()

np.int64(0)

In [16]:
# Let us see how genres of each movie is
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [17]:
# We will convert the genres column data to a list of keywords instead of the list of dictionaries format
import ast

# A function to clean data
def convert_to_list(obj):
    # obj is in string format, then convert it into list of dictonaries
    list_of_dicts = ast.literal_eval(obj)
    wanted_list = [d['name'] for d in list_of_dicts]

    return wanted_list

In [18]:
convert_to_list(movies.iloc[0].genres)

['Action', 'Adventure', 'Fantasy', 'Science Fiction']

In [19]:
movies['genres'].apply(convert_to_list).head(4)

0    [Action, Adventure, Fantasy, Science Fiction]
1                     [Adventure, Fantasy, Action]
2                       [Action, Adventure, Crime]
3                 [Action, Crime, Drama, Thriller]
Name: genres, dtype: object

In [20]:
movies['genres'] = movies['genres'].apply(convert_to_list)

In [21]:
movies.head(2)

Unnamed: 0,id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [22]:
convert_to_list(movies.iloc[0].keywords)

['culture clash',
 'future',
 'space war',
 'space colony',
 'society',
 'space travel',
 'futuristic',
 'romance',
 'space',
 'alien',
 'tribe',
 'alien planet',
 'cgi',
 'marine',
 'soldier',
 'battle',
 'love affair',
 'anti war',
 'power relations',
 'mind and soul',
 '3d']

In [23]:
movies['keywords'].apply(convert_to_list).head(3)

0    [culture clash, future, space war, space colon...
1    [ocean, drug abuse, exotic island, east india ...
2    [spy, based on novel, secret agent, sequel, mi...
Name: keywords, dtype: object

In [24]:
movies['keywords'] = movies['keywords'].apply(convert_to_list)

In [25]:
movies.head(2)

Unnamed: 0,id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [26]:
movies.iloc[0]['cast']

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

In [27]:
def convert_cast_to_list(obj):
    list_of_dicts = ast.literal_eval(obj)
    # I only consider top 3 cast
    cast_names = []
    counter = 0
    for character_info in list_of_dicts:
        if counter >= 3:
            break
        counter += 1
        cast_names.append(character_info['name'])
    return cast_names


In [28]:
convert_cast_to_list(movies.iloc[0]['cast'])

['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver']

In [29]:
# Kepping only top 3 cast in the data set
movies['cast'] = movies['cast'].apply(convert_cast_to_list)

In [30]:
movies.head(2)

Unnamed: 0,id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [31]:
# Let us care only about director in crew
movies.iloc[0]['crew']

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [32]:
def get_director_from_crew(obj):
    list_of_dicts = ast.literal_eval(obj)
    for d in list_of_dicts:
        if d['job'] == 'Director':
            return [d['name'], ]
    return list() # if there is no Director information

In [33]:
get_director_from_crew(movies.iloc[0]['crew'])

['James Cameron']

In [34]:
movies['crew'] = movies['crew'].apply(get_director_from_crew) # let us consider only director in crew

In [35]:
movies.head(1)

Unnamed: 0,id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


In [36]:
movies.overview.head()

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

In [37]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [38]:
# removing all punctuation
movies['overview'] = movies['overview'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

movies['overview'].head()

0    In the 22nd century a paraplegic Marine is dis...
1    Captain Barbossa long believed to be dead has ...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a warweary former military capt...
Name: overview, dtype: object

In [39]:
# let us make it into list
movies['overview'] = movies['overview'].apply(lambda x: x.split())

In [40]:
movies.head()

Unnamed: 0,id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[In, the, 22nd, century, a, paraplegic, Marine...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Captain, Barbossa, long, believed, to, be, de...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[A, cryptic, message, from, Bond’s, past, send...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Following, the, death, of, District, Attorney...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,John Carter,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[John, Carter, is, a, warweary, former, milita...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


In [41]:
# We will have to remove spaces in between a single entity in keywords, overview, cast and crew
# Because to not confuse the model when we break string into 2 or more words, it is hard to say later which group of words is referring to same emtity
# Example: Sam Worthington, Sam Mendes, are names if you break them they have Sam which will be difficult for our model

In [42]:
movies['cast'].head(2)

0    [Sam Worthington, Zoe Saldana, Sigourney Weaver]
1       [Johnny Depp, Orlando Bloom, Keira Knightley]
Name: cast, dtype: object

In [43]:
movies['cast'].apply(lambda c:[obj.replace(" ", "") for obj in c]).head(2)

0    [SamWorthington, ZoeSaldana, SigourneyWeaver]
1       [JohnnyDepp, OrlandoBloom, KeiraKnightley]
Name: cast, dtype: object

In [44]:
movies['cast'] = movies['cast'].apply(lambda c:[obj.replace(" ", "") for obj in c])

In [45]:
movies['genres'] = movies['genres'].apply(lambda c:[obj.replace(" ", "") for obj in c])

In [46]:
movies['crew'] = movies['crew'].apply(lambda c:[obj.replace(" ", "") for obj in c])

In [47]:
movies['keywords'] = movies['keywords'].apply(lambda c:[obj.replace(" ", "") for obj in c])

In [48]:
movies.head()

Unnamed: 0,id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[In, the, 22nd, century, a, paraplegic, Marine...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[Captain, Barbossa, long, believed, to, be, de...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[A, cryptic, message, from, Bond’s, past, send...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[Following, the, death, of, District, Attorney...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,John Carter,"[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[John, Carter, is, a, warweary, former, milita...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


In [49]:
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [50]:
movies['tags'].head(2)

0    [In, the, 22nd, century, a, paraplegic, Marine...
1    [Captain, Barbossa, long, believed, to, be, de...
Name: tags, dtype: object

In [51]:
movies['tags'] = movies['tags'].apply(lambda x: " ".join(x))

In [52]:
movies = movies.drop(columns=['keywords', 'genres', 'crew', 'cast', 'overview'])

In [53]:
movies.head(2)

Unnamed: 0,id,title,tags
0,19995,Avatar,In the 22nd century a paraplegic Marine is dis...
1,285,Pirates of the Caribbean: At World's End,Captain Barbossa long believed to be dead has ...


In [54]:
# convertig everything into lowercase
movies['tags'] = movies['tags'].apply(lambda x: x.lower())

In [55]:
movies.head(2)

Unnamed: 0,id,title,tags
0,19995,Avatar,in the 22nd century a paraplegic marine is dis...
1,285,Pirates of the Caribbean: At World's End,captain barbossa long believed to be dead has ...


In [56]:
movies.iloc[0]['tags']

'in the 22nd century a paraplegic marine is dispatched to the moon pandora on a unique mission but becomes torn between following orders and protecting an alien civilization action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'

In [57]:
movies.iloc[1]['tags']

"captain barbossa long believed to be dead has come back to life and is headed to the edge of the earth with will turner and elizabeth swann but nothing is quite as it seems adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger johnnydepp orlandobloom keiraknightley goreverbinski"

- We will use (Bag of Words) BoW model for converting tags into vector

In [58]:
# We have to fix this issue:
# for example: we have activities, activity, activities, activity's, activitys, etc. all should be one word

import nltk # famous NLP library
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [59]:
print(ps.stem('activities'), ps.stem('activity'), ps.stem('activitys'))

activ activ activ


In [60]:
print(ps.stem("Bond's"), ps.stem("Bond's"), ps.stem("bond's"), ps.stem("bond"), ps.stem("bonding"))

bond' bond' bond' bond bond


In [61]:
print(ps.stem('love'), ps.stem('loved'), ps.stem('loving'), ps.stem('loves')) # all should be same word

love love love love


In [62]:
print(ps.stem("organ"), ps.stem("organization"), ps.stem("organize"), ps.stem("organizing")) 

organ organ organ organ


In [63]:
# Trying to address the problem with stemming
def apply_stem(text):
    y = []
    for word in text.split():
        y.append(ps.stem(word))
    return " ".join(y)

In [64]:
movies.iloc[2]['tags']

'a cryptic message from bond’s past sends him on a trail to uncover a sinister organization while m battles political forces to keep the secret service alive bond peels back the layers of deceit to reveal the terrible truth behind spectre action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom danielcraig christophwaltz léaseydoux sammendes'

In [65]:
apply_stem(movies.iloc[2]['tags'])

'a cryptic messag from bond’ past send him on a trail to uncov a sinist organ while m battl polit forc to keep the secret servic aliv bond peel back the layer of deceit to reveal the terribl truth behind spectr action adventur crime spi basedonnovel secretag sequel mi6 britishsecretservic unitedkingdom danielcraig christophwaltz léaseydoux sammend'

In [66]:
# applying the stemming function to all tags
movies['tags'] = movies['tags'].apply(apply_stem)

In [67]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000, stop_words='english')

In [68]:
vectors = cv.fit_transform(movies['tags']).toarray()

In [69]:
vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(4806, 5000))

In [70]:
np.set_printoptions(threshold=np.inf)
cv.get_feature_names_out()

array(['007', '10', '100', '10yearold', '11', '11yearold', '12',
       '12yearold', '13', '14', '15', '150', '16', '16yearold', '17',
       '17yearold', '18', '18th', '18thcenturi', '1910', '1920', '1930',
       '1940', '1950', '1960', '1970', '1974', '1976', '1980', '1985',
       '1990', '1999', '19th', '19thcenturi', '20', '2000', '2003',
       '2009', '20th', '24', '25', '30', '300', '3d', '40', '50', '500',
       '60', '70', '911', 'aaron', 'aaroneckhart', 'abandon', 'abbi',
       'abduct', 'abigailbreslin', 'abil', 'abl', 'aboard', 'aborigin',
       'abov', 'abroad', 'absenc', 'absolut', 'abus', 'academ', 'academi',
       'accept', 'access', 'accid', 'accident', 'acclaim', 'accompani',
       'accomplish', 'account', 'accus', 'ace', 'achiev', 'acquaint',
       'act', 'action', 'actionhero', 'actionpack', 'activ', 'activist',
       'actor', 'actress', 'actual', 'ad', 'adam', 'adamsandl',
       'adamshankman', 'adapt', 'add', 'addict', 'addit', 'adjust',
       'admir', 

- For higher dimensional space, Euclidian distance is not a reliable measure for getting similarity between two vectors
- We will use cosine similarity between two vectors to get the similarity between two vectors

In [71]:
similarity = sklearn.metrics.pairwise.cosine_similarity(vectors)

In [72]:
similarity.shape

(4806, 4806)

In [73]:
similarity[0][:10]

array([1.        , 0.08238526, 0.08492078, 0.0720633 , 0.18116433,
       0.12354155, 0.        , 0.16178756, 0.0594701 , 0.09443843])

In [74]:
movies[movies['title'] == "Avatar" ].index[0]

np.int64(0)

In [75]:
sorted(list(enumerate(similarity[0])), reverse = True, key = lambda x: x[1])[:10]

[(0, np.float64(1.0)),
 (2405, np.float64(0.26257545381445874)),
 (539, np.float64(0.24715576637149034)),
 (507, np.float64(0.24602771043141894)),
 (1214, np.float64(0.24037008503093257)),
 (1202, np.float64(0.2360960823249428)),
 (582, np.float64(0.23537827088100213)),
 (1192, np.float64(0.23372319715296228)),
 (61, np.float64(0.22880215766121476)),
 (778, np.float64(0.22875450543583872))]

In [81]:
# Recommend top 5 movies based on the similarity
def recommend(movie):
    movie_index      = movies[movies['title'] == movie].index[0]
    distances        = similarity[movie_index]
    top_movies_count = 5
    movie_list       = sorted(list(enumerate(distances)), reverse = True, key = lambda x: x[1])[1:top_movies_count + 1]

    for i in movie_list:
        print(movies.iloc[i[0]].title)

In [82]:
recommend("Avatar") # ML model is ready to recommend

Aliens
Titan A.E.
Independence Day
Aliens vs Predator: Requiem
Predators


In [83]:
recommend("Batman Begins")

The Dark Knight
Batman
Batman
The Dark Knight Rises
10th & Wolf


Now it is the time to convert to website