## We will be building a Movie Recommender System using Machine Learning

### Kaggle Data Set Link : https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv

In [1]:
import numpy as np
import pandas as pd

In [74]:
pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting regex>=2021.8.3
  Downloading regex-2023.3.23-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m771.9/771.9 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: regex, nltk
Successfully installed nltk-3.8.1 regex-2023.3.23
Note: you may need to restart the kernel to use updated packages.


In [2]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

In [3]:
# displaying 
movies.shape

(4803, 20)

In [4]:
credits.shape

(4803, 4)

In [5]:
movies.info() # displaying necessary information before beginning data preprocessing

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [6]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB


In [7]:
# Dsiplaying a single movie

In [8]:
movies.head(1)
# here the id is the id of the movie in TMDB(it is a website just like Imdb) database

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [9]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


# Data Preprocessing

## Merging the two data frames

As we have 2 differnet data frames, we will merge them to one single data frame, so that it would be easier for us to work with only a single data frame

In [10]:
movies = movies.merge(credits,on='title')
# the shape tells us that the dataframes are combined 

In [11]:
movies.shape

(4809, 23)

In [12]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


## Removing Unnecssary Columns

**We are creating a content based recommendation system, so we will only keep the columns that will be useful for creating tags for the movie(that gives useful information about movie)**

`Columns that are going to be removed :` 
1. budget : budget is not necessarily an important factor while recommending movies. If you like a high budget movie, it does not mean you will like all high budget movies
2. homepage : we dont need the webpage of the movie website 
3. original language : when we see the data of languages, we will find that around 4600 out of 4800 movies are in english. so it won't matter too much
4. original title : we will only keep the title of the movies in english, so we will remove original title so as to avoid inconsistancies
5. popularity : 
6. production companies : we mostly never recommend movies based on production companies, we usally do it based on actors and stuff
7. release date : might matter for people who like movies in a specific ear or time period, will keep this for future work
8. revenue : might matter, but not using in this approach
9. runtime : does not matter, its just a numeric value
10. spoken_languages: not important factor
11. status :  does not matter if movie is released or not for our approach
12. tagline : removing because sometimes tagline is different than what the movie actually is 
13. vote_average: 
14. vote_count: 
15. movie_id : removing because we are alreaady storing id(both are same)


In [13]:
# reassigning movies to the remaining imporatant columns
movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]

In [14]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


## Dealing with Missing Data

In [15]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [16]:
# drop the columns that are null/na
movies.dropna(inplace=True)

In [17]:
movies.isnull().sum()
# now we can see that there are no null data in our dataset

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

## Dealing with Duplicate Data

In [18]:
movies.duplicated().sum()

0

## Formatting the Genre Column

In [19]:
movies.iloc[0].genres
# displaying the genres of first movie, i.e avatar
# we can see that the genres are in a format of list of dictionaries 

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [20]:
# we want to convert the above format to the following
# ['action','adventure','fantasy','science fiction']

# basically we only need a list of all the genre names

In [21]:
movies.iloc[0].genres
# we can see in the output, that the list of dictionaries in actually in the form of string. So we need to convert it using ast module in python

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [22]:
import ast
ast.literal_eval('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]') 

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [23]:
# creating a function that will convert our list of dictionaries to list
def convertGenre(obj):
    L = []

    for i in ast.literal_eval(obj): # to convert string to a list
        # for each i we will get a dictionary, and from that dictionary, we will access the name and append it to a list
        L.append(i['name'])
    return L 

In [24]:
movies['genres'].apply(convertGenre)
# we can see in output, we have got a list of all the genres of movies 

0       [Action, Adventure, Fantasy, Science Fiction]
1                        [Adventure, Fantasy, Action]
2                          [Action, Adventure, Crime]
3                    [Action, Crime, Drama, Thriller]
4                [Action, Adventure, Science Fiction]
                            ...                      
4804                        [Action, Crime, Thriller]
4805                                [Comedy, Romance]
4806               [Comedy, Drama, Romance, TV Movie]
4807                                               []
4808                                    [Documentary]
Name: genres, Length: 4806, dtype: object

In [25]:
movies['genres'] = movies['genres'].apply(convertGenre)


In [26]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


## Formatting the keyword column

In [27]:
movies.iloc[0].keywords

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

In [28]:
# we need to do the same thing that we did with the genres column. Just extract all the names present in each dictonary to a list
# we will use the same function because we need to extract name from keywords as well

movies['keywords'].apply(convertGenre)

0       [culture clash, future, space war, space colon...
1       [ocean, drug abuse, exotic island, east india ...
2       [spy, based on novel, secret agent, sequel, mi...
3       [dc comics, crime fighter, terrorist, secret i...
4       [based on novel, mars, medallion, space travel...
                              ...                        
4804    [united states–mexico barrier, legs, arms, pap...
4805                                                   []
4806    [date, love at first sight, narration, investi...
4807                                                   []
4808            [obsession, camcorder, crush, dream girl]
Name: keywords, Length: 4806, dtype: object

In [29]:
movies['keywords'] = movies['keywords'].apply(convertGenre)

In [30]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


## Formatting the Cast Column

In [31]:
movies.iloc[0].cast

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

**In cast also, we will be doing same kind of formatting as we did in keywords and genres but here we only need the first 3 cast(so we will need to pick only the first 3 dictionaries from the list of dictionaries)**

In [32]:
# creating a function that will convert our list of dictionaries to list
def convertCast(obj):
    L = []
    counter = 0
    for i in ast.literal_eval(obj): # to convert string to a list
        # for each i we will get a dictionary, and from that dictionary, we will access the name and append it to a list
        if counter!=3:
            L.append(i['name'])
            counter+=1
        else:
            break
    return L 

In [33]:
movies['cast'].apply(convertCast)
# we can see that we are getting 3 actors for each movie

0        [Sam Worthington, Zoe Saldana, Sigourney Weaver]
1           [Johnny Depp, Orlando Bloom, Keira Knightley]
2            [Daniel Craig, Christoph Waltz, Léa Seydoux]
3            [Christian Bale, Michael Caine, Gary Oldman]
4          [Taylor Kitsch, Lynn Collins, Samantha Morton]
                              ...                        
4804    [Carlos Gallardo, Jaime de Hoyos, Peter Marqua...
4805         [Edward Burns, Kerry Bishé, Marsha Dietlein]
4806           [Eric Mabius, Kristin Booth, Crystal Lowe]
4807            [Daniel Henney, Eliza Coupe, Bill Paxton]
4808    [Drew Barrymore, Brian Herzlinger, Corey Feldman]
Name: cast, Length: 4806, dtype: object

In [34]:
movies['cast']=movies['cast'].apply(convertCast)

In [35]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


## Formatting Crew Column

Here we only need the dictionary where job title is "Director" because we only want data for director

In [36]:
movies.iloc[0].crew

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [37]:
def convertCrew(obj):
    L = []
    counter = 0
    for i in ast.literal_eval(obj): # to convert string to a list
        # for each i we will get a dictionary, and from that dictionary, we will access the name and append it to a list
        if(i['job']=='Director'):
            L.append(i['name'])
            break
    return L 

In [38]:
movies['crew'].apply(convertCrew)

0           [James Cameron]
1          [Gore Verbinski]
2              [Sam Mendes]
3       [Christopher Nolan]
4          [Andrew Stanton]
               ...         
4804     [Robert Rodriguez]
4805         [Edward Burns]
4806          [Scott Smith]
4807          [Daniel Hsia]
4808     [Brian Herzlinger]
Name: crew, Length: 4806, dtype: object

In [39]:
movies['crew'] = movies['crew'].apply(convertCrew)

In [40]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


## Converting overview from string to list
The overview column is in the form of strings. So we will convert it to lists so that we can concatenate it with the other lists that we have made while creating tags

In [41]:
movies['overview'].apply(lambda x : x.split())

0       [In, the, 22nd, century,, a, paraplegic, Marin...
1       [Captain, Barbossa,, long, believed, to, be, d...
2       [A, cryptic, message, from, Bond’s, past, send...
3       [Following, the, death, of, District, Attorney...
4       [John, Carter, is, a, war-weary,, former, mili...
                              ...                        
4804    [El, Mariachi, just, wants, to, play, his, gui...
4805    [A, newlywed, couple's, honeymoon, is, upended...
4806    ["Signed,, Sealed,, Delivered", introduces, a,...
4807    [When, ambitious, New, York, attorney, Sam, is...
4808    [Ever, since, the, second, grade, when, he, fi...
Name: overview, Length: 4806, dtype: object

In [42]:
movies['overview']= movies['overview'].apply(lambda x : x.split())

In [43]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


## A Prolbem !! 
All the data that is present in the list is sepereated by spaces, this might create a problem

Suppose we have 2 Actors Sam Worthington and Sam Mendes
Now we want Sam Worthington, but our model might get confused as in which Sam do we want 

That is why it is important that we remove the blank spaces between name

In [44]:
# we will use list comprehension
movies['genres'].apply(lambda x : [i.replace(" ","") for i in x])
# we will do this same thing for all columns

0       [Action, Adventure, Fantasy, ScienceFiction]
1                       [Adventure, Fantasy, Action]
2                         [Action, Adventure, Crime]
3                   [Action, Crime, Drama, Thriller]
4                [Action, Adventure, ScienceFiction]
                            ...                     
4804                       [Action, Crime, Thriller]
4805                               [Comedy, Romance]
4806               [Comedy, Drama, Romance, TVMovie]
4807                                              []
4808                                   [Documentary]
Name: genres, Length: 4806, dtype: object

In [45]:
movies['genres'] = movies['genres'].apply(lambda x : [i.replace(" ","") for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x : [i.replace(" ","") for i in x])
movies['cast'] = movies['cast'].apply(lambda x : [i.replace(" ","") for i in x])
movies['crew'] = movies['crew'].apply(lambda x : [i.replace(" ","") for i in x])


In [46]:
movies.head(5)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


## Creating a new column named tags


In [47]:
movies['tags'] = movies['overview'] + movies['keywords'] + movies['cast'] + movies['crew']

In [48]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


## Creating a new DataFrame

In [49]:
final_df = movies[['movie_id','title','tags']]

In [50]:
final_df

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."
...,...,...,...
4804,9367,El Mariachi,"[El, Mariachi, just, wants, to, play, his, gui..."
4805,72766,Newlyweds,"[A, newlywed, couple's, honeymoon, is, upended..."
4806,231617,"Signed, Sealed, Delivered","[""Signed,, Sealed,, Delivered"", introduces, a,..."
4807,126186,Shanghai Calling,"[When, ambitious, New, York, attorney, Sam, is..."


In [51]:
# Convertin the tags from list to string
final_df['tags'].apply(lambda x : " ".join(x))

0       In the 22nd century, a paraplegic Marine is di...
1       Captain Barbossa, long believed to be dead, ha...
2       A cryptic message from Bond’s past sends him o...
3       Following the death of District Attorney Harve...
4       John Carter is a war-weary, former military ca...
                              ...                        
4804    El Mariachi just wants to play his guitar and ...
4805    A newlywed couple's honeymoon is upended by th...
4806    "Signed, Sealed, Delivered" introduces a dedic...
4807    When ambitious New York attorney Sam is sent t...
4808    Ever since the second grade when he first saw ...
Name: tags, Length: 4806, dtype: object

In [52]:
final_df['tags'] = final_df['tags'].apply(lambda x : " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['tags'] = final_df['tags'].apply(lambda x : " ".join(x))


In [53]:
final_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [54]:
final_df['tags'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

### Lower casing all the tags ( in general this is recommended) 

In [55]:
final_df['tags'] = final_df['tags'].apply(lambda x : x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['tags'] = final_df['tags'].apply(lambda x : x.lower())


In [56]:
final_df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."


# Vectorization 

 **Our Task : Create a system where the user will enter a movie and we will recommend 5 similar movies to the user**

In [57]:
 # displaying all tags in first movie
final_df['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'

In [58]:
# displaying all tags in second movie
final_df['tags'][1]

"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger johnnydepp orlandobloom keiraknightley goreverbinski"

**Our Challenge is to find similarities between tags of each movie**

We can choose a strategy where we calculate the scroes between movies depending upon the number of same words they have. But this is not a good approach as most of the times the data will be misleading 

This is why we choose vectorization, where we will convert all tags to vectors, and then when a user says he like a certain movie, then we will recommend him other movies that are very close to that particular vector

**This Process is called Text Vectorization**

There are different techniques for text Vectorization, but the technique that we will use is called `bag of words`

Other techniques are `tf idf`, `word to vec` etc

## Bag of Words Approach

First we will combine all the tags together(All tags from all movies combined).

tag1+tag2+tag3+.....

Now we have a very long string of words. **From this string of words we will choose the 5000 most commonly used words**

Once we have the 5000 words, we will extract them and now we will check how many times each of the 5000 words have occured in tags of every single movie

        word1   word2   word3   word 4 ...... word5000

movie1  3        5       1        0               0

movie2  5        1       2        4               1



We will gat a dataframe that will have 5000 rows(approx, they will be equal to number of movies in database) and 5000 columns

Here each single movie is converted into a vector

**Important Point : During Vectorization, we will not consider stop words(such as and,or,are,to,that etc)**

**We will use a library from sklearn called countvectorization to do our vectorizaion(although we want also do the vectorization manually, like concatinating all tags,extracting most common words and so on**

In [59]:
from sklearn.feature_extraction.text import CountVectorizer

In [60]:
cv = CountVectorizer(max_features=5000,stop_words='english')

In [61]:
cv.fit_transform(final_df['tags']).toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [62]:
cv.fit_transform(final_df['tags']).toarray().shape

(4806, 5000)

In [67]:
vectors[0]

array([0, 0, 0, ..., 0, 0, 0])

In [72]:
cv.get_feature_names_out() # these are the 5000 most used wordsthat are present in our tags 

array(['000', '007', '10', ..., 'zone', 'zoo', 'zooeydeschanel'],
      dtype=object)

In [73]:
len(cv.get_feature_names_out())

5000

the words loved,loving,love are all same meaning, and we dont need them as different tags in our tags column becuase this will create confusion while we try to relate movies to each other.

that's why to remove such words and use only one of them we will use stem 

In [75]:
import nltk

In [76]:
from nltk.stem.porter import PorterStemmer 

ps = PorterStemmer()

In [85]:
def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))
        
    return " ".join(y) # converting the list we created above to string once again  

In [86]:
ps.stem('loving')

'love'

In [87]:
ps.stem('loved')

'love'

In [90]:
final_df['tags'].apply(stem)

0       in the 22nd century, a parapleg marin is dispa...
1       captain barbossa, long believ to be dead, ha c...
2       a cryptic messag from bond’ past send him on a...
3       follow the death of district attorney harvey d...
4       john carter is a war-weary, former militari ca...
                              ...                        
4804    el mariachi just want to play hi guitar and ca...
4805    a newlyw couple' honeymoon is upend by the arr...
4806    "signed, sealed, delivered" introduc a dedic q...
4807    when ambiti new york attorney sam is sent to s...
4808    ever sinc the second grade when he first saw h...
Name: tags, Length: 4806, dtype: object

In [91]:
final_df['tags'] = final_df['tags'].apply(stem)    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['tags'] = final_df['tags'].apply(stem)


In [92]:
from sklearn.feature_extraction.text import CountVectorizer

In [93]:
cv = CountVectorizer(max_features=5000,stop_words='english')

In [94]:
cv.fit_transform(final_df['tags']).toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [95]:
vectors = cv.fit_transform(final_df['tags']).toarray()

In [97]:
cv.get_feature_names_out() # these are the 5000 most used wordsthat are present in our tags 
# I am for some reason unable to display all the 5000 words here, otherwise we would be able to see the ouput clearly

array(['000', '007', '10', ..., 'zone', 'zoo', 'zooeydeschanel'],
      dtype=object)

**Now we have gotten our vectors. We have around 4800 vectors(equal to toal movies). Now our task is to calculate the distance between each vector**

**Smaller the distance, more the similarities between the vectors(movies)**

**We will not calculate the eculidian distance between them(distance from the tip of one vector to another)**

Reason : Eucldian distance fails for higher dimension of data. Here we are having 5000 dimensional data, that's why we are not using Eculidian distance

**Instead we will calculate the cosine distance(the angle between the vectors). Smaller the angle more similarities between movies**

### Using the cosine similarity function from scikt learn

In [99]:
from sklearn.metrics.pairwise import cosine_similarity # similarity will always be between 0 and 1. 1 being very very very similar

In [100]:
cosine_similarity(vectors)

array([[1.        , 0.        , 0.03184649, ..., 0.02475369, 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.02592379, 0.        ,
        0.0277137 ],
       [0.03184649, 0.        , 1.        , ..., 0.02680281, 0.        ,
        0.        ],
       ...,
       [0.02475369, 0.02592379, 0.02680281, ..., 1.        , 0.0412393 ,
        0.04454354],
       [0.        , 0.        , 0.        , ..., 0.0412393 , 1.        ,
        0.08817334],
       [0.        , 0.0277137 , 0.        , ..., 0.04454354, 0.08817334,
        1.        ]])

In [101]:
cosine_similarity(vectors).shape

(4806, 4806)

In [102]:
similarity = cosine_similarity(vectors)

In [104]:
similarity[0] # this tell us the similarity of 1st movie with every other movie
# the similarity score of 1st movie with itself will be ofcourse 1 becuase both are same movies 

array([1.        , 0.        , 0.03184649, ..., 0.02475369, 0.        ,
       0.        ])

In [105]:
similarity[1]

array([0.        , 1.        , 0.        , ..., 0.02592379, 0.        ,
       0.0277137 ])

## Creating the final recommend function

In [113]:
final_df

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a parapleg marin is dispa..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believ to be dead, ha c..."
2,206647,Spectre,a cryptic messag from bond’ past send him on a...
3,49026,The Dark Knight Rises,follow the death of district attorney harvey d...
4,49529,John Carter,"john carter is a war-weary, former militari ca..."
...,...,...,...
4804,9367,El Mariachi,el mariachi just want to play hi guitar and ca...
4805,72766,Newlyweds,a newlyw couple' honeymoon is upend by the arr...
4806,231617,"Signed, Sealed, Delivered","""signed, sealed, delivered"" introduc a dedic q..."
4807,126186,Shanghai Calling,when ambiti new york attorney sam is sent to s...


In [112]:
final_df[final_df['title'] == 'Batman Begins'].index[0] # fetch index of movie

119

In [115]:
list(enumerate(similarity[0]))
# this will give us the similarity between moive 0 and all the other movies

[(0, 1.0),
 (1, 0.0),
 (2, 0.031846487764924096),
 (3, 0.06063390625908324),
 (4, 0.1292719224987548),
 (5, 0.04711428474315455),
 (6, 0.02195814375455294),
 (7, 0.05492350363810897),
 (8, 0.0),
 (9, 0.0),
 (10, 0.0),
 (11, 0.026783579200279007),
 (12, 0.0),
 (13, 0.0),
 (14, 0.028583097523751468),
 (15, 0.0),
 (16, 0.0),
 (17, 0.06587443126365883),
 (18, 0.06149400462680908),
 (19, 0.022518867455552243),
 (20, 0.015526752351113421),
 (21, 0.05970814340265321),
 (22, 0.0),
 (23, 0.0),
 (24, 0.0),
 (25, 0.03656362120635653),
 (26, 0.08458258116519013),
 (27, 0.13629325512727639),
 (28, 0.03300491809922247),
 (29, 0.023557142371577276),
 (30, 0.0),
 (31, 0.07151985398521515),
 (32, 0.023557142371577276),
 (33, 0.0),
 (34, 0.0),
 (35, 0.0),
 (36, 0.10713431680111603),
 (37, 0.029411764705882346),
 (38, 0.02195814375455294),
 (39, 0.0),
 (40, 0.030316953129541614),
 (41, 0.10461315619318828),
 (42, 0.0),
 (43, 0.049507377148833714),
 (44, 0.0),
 (45, 0.045037734911104486),
 (46, 0.03131121

In [118]:
sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x:x[1]) # used lambda function because it was sorting on basis on index position instead of the similarities 


[(0, 1.0),
 (1216, 0.25142264225703365),
 (2409, 0.24404676504598818),
 (582, 0.21613144789263333),
 (507, 0.20812378052829614),
 (3730, 0.20797258270192573),
 (778, 0.2045239970259654),
 (3608, 0.19926334924652142),
 (539, 0.1980295085953348),
 (1920, 0.19474520942613002),
 (1204, 0.18981415059132414),
 (2786, 0.17503501050350123),
 (74, 0.16977493752543305),
 (1089, 0.16516802780306677),
 (3675, 0.1647705109143269),
 (1321, 0.16169041669088866),
 (3538, 0.15877683720748892),
 (529, 0.1569197342897824),
 (151, 0.15339299776947407),
 (2333, 0.150093837982278),
 (373, 0.14969623771302393),
 (1201, 0.14969623771302393),
 (4192, 0.14969623771302393),
 (47, 0.14890247043403096),
 (4048, 0.14777011582226215),
 (2971, 0.14666320798326962),
 (843, 0.1449427589131121),
 (2515, 0.1430397079704303),
 (1071, 0.14291548761875733),
 (305, 0.14269544824634822),
 (3327, 0.14269544824634822),
 (184, 0.14097096860865022),
 (1774, 0.140028008402801),
 (2731, 0.140028008402801),
 (3162, 0.140028008402800

In [119]:
# Now we only need to grab the first 5 indexes (expect the one of the movie itself)
sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x:x[1])[1:6] 
# now we have the top 5 movies that are similar to movie 0

[(1216, 0.25142264225703365),
 (2409, 0.24404676504598818),
 (582, 0.21613144789263333),
 (507, 0.20812378052829614),
 (3730, 0.20797258270192573)]

In [123]:
def recommend(movie):
    movie_index = final_df[final_df['title'] == movie].index[0] # gives us the index of movie we pass
    distances = similarity[movie_index]
    # sorting the distances
    movie_list = sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:6] 
    
    for i in movie_list:
        # print(i[0]) if we do this we will only get the index of the movies which is not so useful
        print(final_df.iloc[i[0]].title)
    

In [124]:
recommend('Batman Begins')

The Dark Knight
Batman
Batman
Batman v Superman: Dawn of Justice
Synecdoche, New York


In [125]:
recommend('Avatar')

Aliens vs Predator: Requiem
Aliens
Battle: Los Angeles
Independence Day
Falcon Rising


In [126]:
recommend('Fight Club')

This Is Martin Bonner
Fight Valley
Paint Your Wagon
Death Race
Slither
