# Movie Recommender System

Let's embark on building a movie recommender system that utilizes advanced techniques in natural language processing and similarity analysis to deliver personalized movie recommendations based on user preferences and movie attributes.

Movie recommender systems utilize techniques like cosine similarity, vectorization, and feature extraction to analyze movie data. By representing movies as vectors based on attributes such as genres, actors, and plot keywords, these systems can calculate similarity scores to identify related movies

We will start the process by collecting and preparing a comprehensive dataset of movies. For this purpose, I have utilized the Kaggle TMDb dataset, sourced from [TMDb Movie Metadata Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata). This dataset includes extensive information about movies, such as genres, cast members, directors, plot summaries, ratings, and more.

In [277]:
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [197]:
# Importing Data
Movies=pd.read_csv("tmdb.csv")
Credits=pd.read_csv("tmdb_credits.csv")

In [198]:
Movies.head(5)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [199]:
Credits.head(5)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [200]:
# Merging the two dataframes
Movies=Movies.merge(Credits,on="title")

In [201]:
Movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [202]:
print("Number of Columns:",Movies.shape[1])
print("Number of Rows:",Movies.shape[0])

Number of Columns: 23
Number of Rows: 4809


In [203]:
Movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

### Creating Tags

Creating tags for a movie recommendation system enhances the user experience by enabling more precise and personalized content discovery. Tags categorize movies based on specific attributes such as themes, genres, and moods, allowing the recommendation algorithm to make more accurate suggestions. This granularity helps users find films that match their unique preferences and interests, resulting in a more satisfying and engaging viewing experience. Thus Filtering out the columns.

In [205]:
Movies=Movies[["movie_id","title","overview","genres","keywords","cast","crew"]]

In [206]:
Movies

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
...,...,...,...,...,...,...,...
4804,9367,El Mariachi,El Mariachi just wants to play his guitar and ...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 5616, ""name"": ""united states\u2013mexi...","[{""cast_id"": 1, ""character"": ""El Mariachi"", ""c...","[{""credit_id"": ""52fe44eec3a36847f80b280b"", ""de..."
4805,72766,Newlyweds,A newlywed couple's honeymoon is upended by th...,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",[],"[{""cast_id"": 1, ""character"": ""Buzzy"", ""credit_...","[{""credit_id"": ""52fe487dc3a368484e0fb013"", ""de..."
4806,231617,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...","[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...","[{""id"": 248, ""name"": ""date""}, {""id"": 699, ""nam...","[{""cast_id"": 8, ""character"": ""Oliver O\u2019To...","[{""credit_id"": ""52fe4df3c3a36847f8275ecf"", ""de..."
4807,126186,Shanghai Calling,When ambitious New York attorney Sam is sent t...,[],[],"[{""cast_id"": 3, ""character"": ""Sam"", ""credit_id...","[{""credit_id"": ""52fe4ad9c3a368484e16a36b"", ""de..."


In [207]:
# Checking Null Values
Movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [208]:
# Overview has 3 null values, so we will drop null values using drop na
Movies.dropna(inplace=True)

In [209]:
Movies["overview"].isnull().sum()

0

In [210]:
# Duplicates in data
Movies.duplicated().sum()

0

### Data Preprocessing

In [211]:
# Genres
Movies.iloc[0].genres
# The genres are not in proper format, they are list of dictionaries
# Should be ["Action","Adventure","Fantasy","Science Fi"]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [212]:
import ast # The ast.literal_eval function safely evaluates the string as a Python literal.
ast.literal_eval('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]')

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [213]:
def convert(obj):
    L=[]
    for i in ast.literal_eval(obj):
        L.append(i["name"])
    return L    

In [214]:
Movies["genres"]=Movies.genres.apply(convert) # Now we have genres in proper format

In [215]:
# Keywords
Movies["keywords"]=Movies.keywords.apply(convert)

In [216]:
# Cast
Movies["cast"]=Movies.cast.apply(convert)

In [217]:
Movies["cast"][0] # we only need top 3 cast 

['Sam Worthington',
 'Zoe Saldana',
 'Sigourney Weaver',
 'Stephen Lang',
 'Michelle Rodriguez',
 'Giovanni Ribisi',
 'Joel David Moore',
 'CCH Pounder',
 'Wes Studi',
 'Laz Alonso',
 'Dileep Rao',
 'Matt Gerald',
 'Sean Anthony Moran',
 'Jason Whyte',
 'Scott Lawrence',
 'Kelly Kilgour',
 'James Patrick Pitt',
 'Sean Patrick Murphy',
 'Peter Dillon',
 'Kevin Dorman',
 'Kelson Henderson',
 'David Van Horn',
 'Jacob Tomuri',
 'Michael Blain-Rozgay',
 'Jon Curry',
 'Luke Hawker',
 'Woody Schultz',
 'Peter Mensah',
 'Sonia Yee',
 'Jahnel Curfman',
 'Ilram Choi',
 'Kyla Warren',
 'Lisa Roumain',
 'Debra Wilson',
 'Chris Mala',
 'Taylor Kibby',
 'Jodie Landau',
 'Julie Lamm',
 'Cullen B. Madden',
 'Joseph Brady Madden',
 'Frankie Torres',
 'Austin Wilson',
 'Sara Wilson',
 'Tamica Washington-Miller',
 'Lucy Briant',
 'Nathan Meister',
 'Gerry Blair',
 'Matthew Chamberlain',
 'Paul Yates',
 'Wray Wilson',
 'James Gaylyn',
 'Melvin Leno Clark III',
 'Carvon Futrell',
 'Brandon Jelkes',
 'Mica

In [218]:
def convert3(obj):
    return obj[:3]

In [219]:
Movies["cast"]=Movies['cast'].apply(convert3)

In [220]:
Movies["cast"][0] # top 3 cast

['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver']

In [221]:
# crew
Movies["crew"][0]
# We will extract director 

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [222]:
def convert3(obj):
    L=[]
    for i in ast.literal_eval(obj):
        if i["job"]=="Director":
            L.append(i["name"])
            break
    return L
        
        

In [223]:
Movies["crew"]=Movies["crew"].apply(convert3)

In [224]:
# Overview ( Coverting to list)
Movies["overview"]=Movies["overview"].apply(lambda x:x.split())

In [225]:
Movies.head(3) 

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]


### Transformation

We have to do transformation in order to remove space from the strings to ensure consistent interpretation of string (names) like "Sam Worthington" and "Sam Mendes" by the model, we transform the names by removing spaces. This converts "Sam Worthington" to "SamWorthington" and "Sam Mendes" to "SamMendes," preventing the model from treating them as separate entities. We then apply this transformation while extracting the top 3 cast members for each movie, maintaining data consistency and accuracy for model training and evaluation.

In [227]:
Movies["genres"]=Movies["genres"].apply(lambda x: [i.replace(" ","") for i in x])

In [228]:
Movies["keywords"]=Movies["keywords"].apply(lambda x: [i.replace(" ","") for i in x])

In [229]:
Movies["crew"]=Movies["crew"].apply(lambda x: [i.replace(" ","") for i in x])

In [230]:
Movies["cast"]=Movies["cast"].apply(lambda x: [i.replace(" ","") for i in x])

In [231]:
Movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


In [232]:
# Creating Feature Tags

In [233]:
Movies["tags"]=Movies["overview"]+Movies["genres"]+Movies["keywords"]+Movies["cast"]+Movies["crew"]

In [234]:
Movies["tags"][0]

['In',
 'the',
 '22nd',
 'century,',
 'a',
 'paraplegic',
 'Marine',
 'is',
 'dispatched',
 'to',
 'the',
 'moon',
 'Pandora',
 'on',
 'a',
 'unique',
 'mission,',
 'but',
 'becomes',
 'torn',
 'between',
 'following',
 'orders',
 'and',
 'protecting',
 'an',
 'alien',
 'civilization.',
 'Action',
 'Adventure',
 'Fantasy',
 'ScienceFiction',
 'cultureclash',
 'future',
 'spacewar',
 'spacecolony',
 'society',
 'spacetravel',
 'futuristic',
 'romance',
 'space',
 'alien',
 'tribe',
 'alienplanet',
 'cgi',
 'marine',
 'soldier',
 'battle',
 'loveaffair',
 'antiwar',
 'powerrelations',
 'mindandsoul',
 '3d',
 'SamWorthington',
 'ZoeSaldana',
 'SigourneyWeaver',
 'JamesCameron']

In [235]:
# Creating new dataframe containing ID, Title and tags
Data=Movies[["movie_id","title","tags"]]

In [236]:
Data

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."
...,...,...,...
4804,9367,El Mariachi,"[El, Mariachi, just, wants, to, play, his, gui..."
4805,72766,Newlyweds,"[A, newlywed, couple's, honeymoon, is, upended..."
4806,231617,"Signed, Sealed, Delivered","[""Signed,, Sealed,, Delivered"", introduces, a,..."
4807,126186,Shanghai Calling,"[When, ambitious, New, York, attorney, Sam, is..."


In [237]:
# Converting List of words to strings in tags feature
Data["tags"]=Data["tags"].apply(lambda x:" ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Data["tags"]=Data["tags"].apply(lambda x:" ".join(x))


In [238]:
Data["tags"][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

In [239]:
# Converting strings in tag feature to lower case
Data["tags"]=Data["tags"].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Data["tags"]=Data["tags"].apply(lambda x:x.lower())


In [240]:
Data["tags"][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'

### Vectorization

The CountVectorizer from scikit-learn is a tool used in natural language processing to convert text documents into a matrix where each row represents a document and each column represents a term's count. Setting max_features=10000 limits the vocabulary to the top 10,000 most frequent terms, reducing dimensionality. The parameter stop_words="english" removes common English stopwords such as etc, is, in ,because they typically do not contribute much to the meaning of the text and can lead to noise in the analysis. This vectorization process enables machines to analyze and interpret textual data efficiently for various applications in NLP.

In [241]:
from sklearn.feature_extraction.text import CountVectorizer

In [242]:
cv=CountVectorizer(max_features=10000,stop_words="english")

In [243]:
vectors=cv.fit_transform(Data["tags"]).toarray()

In [244]:
vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [245]:
vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [246]:
len(cv.get_feature_names_out())

10000

In [247]:
cv.get_feature_names_out()[0:50]

array(['000', '007', '10', '100', '10th', '11', '12', '12th', '13', '14',
       '15', '150', '15th', '16', '16th', '17', '17th', '18', '1863',
       '1890', '18th', '18thcentury', '19', '1910', '1910s', '1920',
       '1920s', '1927', '1930s', '1937', '1940s', '1941', '1944', '1945',
       '1950', '1950s', '1955', '1959', '1960', '1960s', '1962', '1964',
       '1965', '1967', '1969', '1970s', '1971', '1972', '1973', '1974'],
      dtype=object)

#### nltk library
Now in this we have words such as action , actions ie same words/objectsbut two entities.
The PorterStemmer from NLTK is used to reduce words to their base form (stem), aiding in normalization and reducing the complexity of text data for improved analysis and model efficiency in natural language processing tasks.








In [248]:
import nltk

In [249]:
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()

In [250]:
def stem(text):
    y=[]
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)

In [251]:
# Example of what it does
print(ps.stem("rain"))
print(ps.stem("raining"))
print(ps.stem("rains"))

rain
rain
rain


In [252]:
Data["tags"][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'

In [253]:
stem('in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron')

'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi sciencefict cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d samworthington zoesaldana sigourneyweav jamescameron'

In [254]:
# Applying this function to ourData["tags"].apply(stem)
Data["tags"]=Data["tags"].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Data["tags"]=Data["tags"].apply(stem)


### Similarity

Each movie is a vector based on its attributes (such as genres, actors, crew and keywords), cosine similarity provides a quantitative measure of how similar two movies are in terms of their content and characteristics. This approach allows us to efficiently compare movies and identify those that share common themes, genres, or appeal to similar audiences. This metric calculates the cosine of the angle between two vectors, producing a score that ranges from -1 (perfectly dissimilar) to 1 (perfectly similar).

In [255]:
from sklearn.metrics.pairwise import cosine_similarity

In [256]:
cosine_similarity(vectors)

array([[1.        , 0.07644708, 0.05096472, ..., 0.02111002, 0.02272727,
        0.        ],
       [0.07644708, 1.        , 0.05714286, ..., 0.02366905, 0.        ,
        0.        ],
       [0.05096472, 0.05714286, 1.        , ..., 0.02366905, 0.        ,
        0.        ],
       ...,
       [0.02111002, 0.02366905, 0.02366905, ..., 1.        , 0.06333005,
        0.04174829],
       [0.02272727, 0.        , 0.        , ..., 0.06333005, 1.        ,
        0.04494666],
       [0.        , 0.        , 0.        , ..., 0.04174829, 0.04494666,
        1.        ]])

In [260]:
cosine_similarity(vectors).shape

(4806, 4806)

In [258]:
Similarity=cosine_similarity(vectors)

In [269]:
Similarity[0]# tells similarity of first movie with other movies

array([1.        , 0.07644708, 0.05096472, ..., 0.02111002, 0.02272727,
       0.        ])

#### Explaination
array([1.        , 0.07644708, 0.05096472, ..., 0.02111002, 0.02272727,
       0.        ])
* This suggests similarity of first movie with other movies. The diagnol will always be 1 as similarity with itself =1


### Recommendation Function

We will create recommend function that retrieves the index of a given movie from the dataset and calculates its cosine similarity scores with all other movies using a precomputed similarity matrix. It then identifies the top 10 movies with the highest similarity scores, excluding the input movie itself, and prints their titles. This approach allows for personalized movie recommendations based on shared characteristics like genres and actors, enhancing user experience by suggesting relevant films.

In [266]:
# Index of movies
Data[Data["title"]=="John Carter"].index[0]

4

In [267]:
# shows distance between 1st movie with first movie, first movie with second movie and so on
list(enumerate(Similarity[0]))

[(0, 0.9999999999999999),
 (1, 0.07644707871564382),
 (2, 0.05096471914376255),
 (3, 0.03270349700838643),
 (4, 0.15231794896123563),
 (5, 0.09984038297885897),
 (6, 0.01696133828881896),
 (7, 0.15231794896123563),
 (8, 0.051708768999501914),
 (9, 0.06396021490668313),
 (10, 0.09090909090909091),
 (11, 0.06978631577988531),
 (12, 0.07756315349925287),
 (13, 0.03739787960033829),
 (14, 0.0978231976089037),
 (15, 0.03798685881987931),
 (16, 0.06897007348075543),
 (17, 0.12704195151210987),
 (18, 0.09731236802019037),
 (19, 0.07106690545187015),
 (20, 0.049642754424112145),
 (21, 0.07537783614444091),
 (22, 0.052486388108147805),
 (23, 0.08398387664337814),
 (24, 0.047088160934801115),
 (25, 0.045454545454545456),
 (26, 0.14301938838683886),
 (27, 0.15194743527951723),
 (28, 0.09417632186960223),
 (29, 0.05653337710833068),
 (30, 0.06333004963811235),
 (31, 0.1399731277389636),
 (32, 0.07720914364821829),
 (33, 0.08703882797784893),
 (34, 0.0),
 (35, 0.08122955416108235),
 (36, 0.15559397

In [268]:
# Now we will sort on basis of distance, and use ennumerate function to associate each score with its position in the list.
sorted(list(enumerate(Similarity[0])),reverse=True,key=lambda x:x[1])

[(0, 0.9999999999999999),
 (539, 0.23268946049775863),
 (2409, 0.215365246126974),
 (1194, 0.21105794120443455),
 (1216, 0.2110579412044345),
 (260, 0.2038588765750502),
 (582, 0.20327890704543544),
 (507, 0.20259510388803326),
 (1204, 0.2000494010134742),
 (1920, 0.20004940101347418),
 (3730, 0.19738550848793068),
 (74, 0.19382872151427658),
 (83, 0.1918806447200494),
 (322, 0.1906925178491185),
 (3608, 0.18993429409939655),
 (495, 0.18860838403857944),
 (2333, 0.18815445225159544),
 (47, 0.18698939800169148),
 (4192, 0.18463723646899916),
 (942, 0.18370235837851734),
 (61, 0.18090680674665818),
 (466, 0.18090680674665818),
 (972, 0.17588161767036212),
 (4405, 0.17407765595569785),
 (1329, 0.17094086468945693),
 (2971, 0.16960013132499205),
 (973, 0.16898159235484367),
 (305, 0.1682904582015223),
 (1201, 0.1679677532867563),
 (1344, 0.16685595311797868),
 (2204, 0.16685595311797866),
 (4048, 0.1657596304538454),
 (3327, 0.16448792373994225),
 (1444, 0.16070608663330624),
 (220, 0.1599

In [270]:
Data["title"][0]

'Avatar'

In [271]:
Data["title"][539]

'Titan A.E.'

In [272]:
Data["title"][1216]

'Aliens vs Predator: Requiem'

In [273]:
def recommend(movie):
    movie_index=Data[Data["title"]==movie].index[0]
    distance=Similarity[movie_index]
    movies_list_top_ten=sorted(list(enumerate(distance)),reverse=True,key=lambda x:x[1])[1:11]
    for i in movies_list_top_ten:
        print(Data.iloc[i[0]].title)

In [278]:
recommend("Batman Begins")

The Dark Knight
The Dark Knight Rises
Batman
Batman
Batman & Robin
Amidst the Devil's Wings
Batman v Superman: Dawn of Justice
Batman Forever
Defendor
Dead Man Down
