## Movie Recommender System

In this notebook, we are going to build a content based movie recommender system using Machine Learning.

There are 3 types of Recommender system. They are:

    1. Content based - Recommending similar content we buy, listen, see or read (Tags). (Similarity in content)
    
    2. Collaborative Filtering based - Recommending based on user's interest. We can find similarity between a set of people based on their previous ratings and reviews. With this we get a set of similar people. So, if one person in the set likes something, we can recommend it to other people in the set. (Similarity in user) eg. Netfilx, News feed. 
    
    3. Hybrid - Combination of above two. eg. Youtube.

Project planning:
    
    1. Data
    2. Preprocessing
    3. Model

#### 1. Data

We are working with tmdb dataset. 

https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv

In [1]:
import numpy as np
import pandas as pd

In [17]:
movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [18]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [19]:
credits.head(1)["cast"]

0    [{"cast_id": 242, "character": "Jake Sully", "...
Name: cast, dtype: object

In [20]:
credits.head(1)["cast"].values

array(['[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "ge

In [21]:
credits.head(1)["crew"]

0    [{"credit_id": "52fe48009251416c750aca23", "de...
Name: crew, dtype: object

In [22]:
credits.head(1)["crew"].values

array(['[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cam

It's better to combine the two data sets. We can do it either with the id or title

In [23]:
movies = movies.merge(credits, on='title')

In [24]:
movies.shape

(4809, 23)

In [25]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [27]:
movies["original_language"].value_counts()

en    4510
fr      70
es      32
zh      27
de      27
hi      19
ja      16
it      14
ko      12
cn      12
ru      11
pt       9
da       7
sv       5
nl       4
fa       4
th       3
he       3
ta       2
cs       2
ro       2
id       2
ar       2
vi       1
sl       1
ps       1
no       1
ky       1
hu       1
pl       1
af       1
nb       1
tr       1
is       1
xx       1
te       1
el       1
Name: original_language, dtype: int64

In [29]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

#### Segregating columns into useful and useless:
Useful Columns to create tag:
    
    1. Genre
    2. id (to make website)
    3. keywords
    4. overview
    5. title
    6. cast
    7. crew

Useless columns to create tags:
    
    1. Budget
    2. Homepage
    3. original_language (majority is english)
    4. original_title (Since we have title)
    5. popularity
    6. production_companies (Since we generally dont recommend movies by this)
    7. production_countries
    8. release_date
    9. revenue
    10. runtime
    11. spoken_languages
    12. status
    13. tagline
    14. vote_average
    15. vote_count
    16. movie_id (we have id already)

In [30]:
movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [31]:
# Create tag by combining overview, genres, keywords, cast and crew

#### 2. Data pre-processing

In [33]:
# Checking for null values
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [34]:
# Dropping null values
movies.dropna(inplace=True)

In [35]:
movies.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [37]:
# Checking for duplicated values
movies.duplicated().sum()

0

##### Changing the format of the columns

In [38]:
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [40]:
# The above is a list of dictionaries. 
# Let's change the format of it to a list of genres

In [45]:
# Helper Function to convert the format

def convert(obj):
    """
    Function which takes object and converts it into a list
    """
    List = []
    for i in ast.literal_eval(obj):
        List.append(i['name'])
    return List

In [52]:
import ast
# ast.literal_eval - inbuilt module to convert string  
ast.literal_eval('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]')

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [48]:
movies['genres'] = movies['genres'].apply(convert)

In [50]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [54]:
# changing the format of keywords column
movies['keywords'] = movies['keywords'].apply(convert)

ValueError: malformed node or string: ['culture clash', 'future', 'space war', 'space colony', 'society', 'space travel', 'futuristic', 'romance', 'space', 'alien', 'tribe', 'alien planet', 'cgi', 'marine', 'soldier', 'battle', 'love affair', 'anti war', 'power relations', 'mind and soul', '3d']

In [55]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [56]:
# Changing the format of cast and picking only 1st 3 values
def convert_and_pick3(text):
    """
    Covert the format of the object and pick the 1st 3 into the list
    """
    List = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 3:
            List.append(i['name'])
        counter+=1
    return List 


In [57]:
movies['cast'] = movies['cast'].apply(convert)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux, R...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman, A...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton,...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [58]:
# Helper function to convert and fetch directors name from crew
def fetch_director(text):
    """
    Fetching director's name from the crew
    """
    List = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            List.append(i['name'])
    return List 

In [59]:
movies['crew'] = movies['crew'].apply(fetch_director)

In [60]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski]


In [62]:
# Putting each row of overview into list
movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [64]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",[Sam Mendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman, A...",[Christopher Nolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton,...",[Andrew Stanton]


We need to remove spaces in each list entity since there could be a confusion in the entities. Like Sam Worthington and Sam Mendes

In [65]:
# Helper function to remove spaces
def remove_spaces(L):
    L1 = []
    for i in L:
        L1.append(i.replace(" ",""))
    return L1

In [66]:
movies['cast'] = movies['cast'].apply(remove_spaces)
movies['crew'] = movies['crew'].apply(remove_spaces)
movies['genres'] = movies['genres'].apply(remove_spaces)
movies['keywords'] = movies['keywords'].apply(remove_spaces)

In [67]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...",[GoreVerbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux, Ralp...",[SamMendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman, Anne...",[ChristopherNolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton, Wi...",[AndrewStanton]


In [68]:
# Create a column 'tag' which will be the concatenation of overview, genres, keywords, crew and cast
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [69]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


In [70]:
new_df = movies.drop(columns=['overview','genres','keywords','cast','crew'])

In [71]:
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


In [73]:
# Converting list into string
new_df['tags']= new_df['tags'].apply(lambda x:" ".join(x))

In [74]:
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [75]:
new_df['tags'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver StephenLang MichelleRodriguez GiovanniRibisi JoelDavidMoore CCHPounder WesStudi LazAlonso DileepRao MattGerald SeanAnthonyMoran JasonWhyte ScottLawrence KellyKilgour JamesPatrickPitt SeanPatrickMurphy PeterDillon KevinDorman KelsonHenderson DavidVanHorn JacobTomuri MichaelBlain-Rozgay JonCurry LukeHawker WoodySchultz PeterMensah SoniaYee JahnelCurfman IlramChoi KylaWarren LisaRoumain DebraWilson ChrisMala TaylorKibby JodieLandau JulieLamm CullenB.Madden JosephBradyMadden FrankieTorres AustinWilson SaraWilson TamicaWashington-Miller LucyBriant NathanM

In [76]:
# Converting tags column into lower case
new_df['tags']= new_df['tags'].apply(lambda x: x.lower())

In [77]:
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


#### Text vectorization using bag of words technique

Now that we got the tags, user would ask for a movie name and we should be able to give the user 5 movies which are similar to the asked movie.

* How to we get to know similar movies? 
    Based on tags. We find the similarity between all the tags

* How do we calculate similarity between tags? 
    1. Same words - Easy but not reliable
    2. Converting tags into vectors 
    
* What are other text vectorization techniques? 
    tfidf, word to vector

* How does bag of words work?
    It combines all the tags, picks most common words. Then it keeps a count of how many times each common word appeared in each tag and makes a table of this count data. And each row will become a vector with combination of common words and their frequency. 
    
* Do we consider all words? 
    NO, we eliminate stop words like is, are, that, this etc

* Do we do all this manually?
    There is an inbuilt library in scikit learn. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
    
* How do we find the similarity after convering them into vectors? 
    By finding the cosine distance (angle between the vectors). Distance inversely proportional to similarity.

![image.png](attachment:image.png)

In [80]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features = 5000, stop_words = 'english')

In [84]:
vectors = cv.fit_transform(new_df['tags']).toarray()

In [85]:
vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [86]:
vectors[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [89]:
# Most common words
cv.get_feature_names_out()

array(['000', '007', '10', ..., 'zoo', 'zooeydeschanel', 'zoëkravitz'],
      dtype=object)

##### stemming - Technique that it replaces words with same meaning with one word
['love', 'loved', 'lovable'] = after stemming ['love', 'love', 'love']

In [90]:
# nltk is used to perform stemming
!pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
     -- ------------------------------------- 0.1/1.5 MB 1.7 MB/s eta 0:00:01
     ----- ---------------------------------- 0.2/1.5 MB 2.3 MB/s eta 0:00:01
     --------- ------------------------------ 0.4/1.5 MB 2.5 MB/s eta 0:00:01
     ------------- -------------------------- 0.5/1.5 MB 2.6 MB/s eta 0:00:01
     --------------- ------------------------ 0.6/1.5 MB 2.5 MB/s eta 0:00:01
     ------------------- -------------------- 0.7/1.5 MB 2.5 MB/s eta 0:00:01
     ---------------------- ----------------- 0.8/1.5 MB 2.6 MB/s eta 0:00:01
     -------------------------- ------------- 1.0/1.5 MB 2.6 MB/s eta 0:00:01
     ----------------------------- ---------- 1.1/1.5 MB 2.6 MB/s eta 0:00:01
     -------------------------------- ------- 1.2/1.5 MB 2.6 MB/s eta 0:00:01
     ------------------------------------ --- 1.4/1.5 MB 2.6 MB/s eta 0:00:01
    


[notice] A new release of pip is available: 23.0.1 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [91]:
import nltk

In [92]:
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

In [93]:
def stem(text):
    """
    Function to perform stemming. A technique of replacing words with root words. 
    actors with actor etc
    """
    y = []
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)

In [95]:
new_df['tags'] = new_df['tags'].apply(stem)

In [96]:
new_df.head(2)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a parapleg marin is dispa..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believ to be dead, ha c..."


In [97]:
# Text vectorization after stemming
vectors = cv.fit_transform(new_df['tags']).toarray()

In [99]:
vectors.shape

(4806, 5000)

In [100]:
cv.get_feature_names_out()

array(['000', '007', '10', ..., 'zoo', 'zooeydeschanel', 'zoëkravitz'],
      dtype=object)

In [102]:
# Findng the similarity between movies (Cosine similarity or distance - measures angle between 2 vectors)
# Distance inversely proportional to similarity

from sklearn.metrics.pairwise import cosine_similarity

In [103]:
cosine_similarity(vectors)

array([[1.        , 0.06666667, 0.07559289, ..., 0.0421637 , 0.        ,
        0.        ],
       [0.06666667, 1.        , 0.07559289, ..., 0.02108185, 0.        ,
        0.02222222],
       [0.07559289, 0.07559289, 1.        , ..., 0.02390457, 0.        ,
        0.        ],
       ...,
       [0.0421637 , 0.02108185, 0.02390457, ..., 1.        , 0.04264014,
        0.0421637 ],
       [0.        , 0.        , 0.        , ..., 0.04264014, 1.        ,
        0.08989331],
       [0.        , 0.02222222, 0.        , ..., 0.0421637 , 0.08989331,
        1.        ]])

In [105]:
cosine_similarity(vectors).shape

(4806, 4806)

In [106]:
similarity = cosine_similarity(vectors)

In [107]:
# Similarity of 1st movie with all the other movies
similarity[0]

array([1.        , 0.06666667, 0.07559289, ..., 0.0421637 , 0.        ,
       0.        ])

In [110]:
# Similarity of 2nd movie with all the other movies
similarity[1]

# All the diogonal elements will be 1 since it's the similarity with itself

array([0.06666667, 1.        , 0.07559289, ..., 0.02108185, 0.        ,
       0.02222222])

In [114]:
# Function to recommend movies

def recommend(movie):
    # Fetching the index of the movie asked 
    movie_index = new_df[new_df['title'] == movie].index[0]
    distances = similarity[movie_index]
    
    # sorting the distances in descending order, 
    # we use enumerate to create tuple of index and similarity
    # lambda function to start comparing with the 2nd item
    movies_list = sorted(list(enumerate(distances)),reverse=True,key = lambda x: x[1])[1:6]
    
    for i in movies_list:
        #print(i[0]) # This gets the indexes of the most similar movies
        print(new_df.iloc[i[0]].title)
     

In [115]:
recommend("Batman Begins")

The Dark Knight
Batman
The Dark Knight Rises
Amidst the Devil's Wings
Rockaway


In [116]:
recommend('Gandhi')

Ramanujan
Guiana 1838
The Wind That Shakes the Barley
The Bounty
The Sea Inside
