<a href="https://colab.research.google.com/github/CodesByVishal/Movie-Reccomendation-System-ML-Project/blob/main/Movie_Reccomendation_System_ML_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
Movies hold universal appeal, connecting people of all backgrounds. Despite this unity, our individual movie preferences remain distinct, ranging from specific genres like thrillers, romance, or sci-fi to focusing on favorite actors and directors. While it’s challenging to generalize movies everyone would enjoy, data scientists analyze behavioral patterns to identify groups of similar movie preferences in society. As data scientists, we extract valuable insights from audience behavior and movie attributes to develop the “Movie Recommendation System.”

Movie recommendation systems are not just about convenience; they represent a fascinating intersection of data science, machine learning, and user experience design. These systems can make highly personalized recommendations that keep you engaged and satisfied by analyzing vast amounts of data, such as your viewing history, ratings, and even the time you spend watching certain genres. One of the most famous websites for movie recommendations is IMDB. Let’s delve into the “Movie Recommendation System” fundamentals to unlock the magic of personalized movie suggestions using machine learning algorithms.

# What is a Recommendation System?
A Recommendation System is a filtration program whose prime goal is to predict a user’s “rating” or “preference” toward a domain-specific item or item. In our case, this domain-specific item is a movie. Therefore, the main focus of our recommendation system is to filter and predict only those movies that a user would prefer, given some data about the user.

# Why Recommendation Systems?
## Recommendation Systems are essential for several reasons:

- Recommendation Systems offer personalized suggestions based on user preferences
and user ratings, ensuring that they discover content and products that are relevant and interesting to them.
- By providing tailored recommendations, users are more likely to engage with the platform, increasing user satisfaction and retention.
- E-commerce platforms like Amazon use recommendation engines to promote products, leading to higher sales and revenue as users discover and purchase items they might not have considered.
- In today’s digital landscape, recommendation systems help users navigate the overwhelming amount of available content, making it easier to find a particular movie or series they seek or in any language they want, like English or Korean.
- Recommendation algorithms expose users to new and diverse content, expanding their horizons and introducing them to items they might have overlooked.
- Relying on past behavior and preferences, recommendation systems help users make informed decisions about complex and subjective choices such as movies, music, or books.



In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Loading Dataset
movie = pd.read_csv('/content/drive/MyDrive/Moive Reccomendation ML Project/tmdb_5000_movies.csv')
credits = pd.read_csv('/content/drive/MyDrive/Moive Reccomendation ML Project/tmdb_5000_credits.csv')

In [4]:
# Overview of Movie data
movie.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [5]:
# Overview of crew data
credits.head(5)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [6]:
#merging both data
movies = movie.merge(credits, on='title')

In [7]:
#Calling any 5 random rows fron the merged dataset
movies.sample(5)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
2819,12000000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",http://www.sonypictures.com/movies/thinklikeaman/,67660,"[{""id"": 5999, ""name"": ""advice""}, {""id"": 9673, ...",en,Think Like a Man,The balance of power in four couples’ relation...,6.035152,"[{""name"": ""Rainforest Films"", ""id"": 1309}]",...,122.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Let the mind games begin.,Think Like a Man,6.9,281,67660,"[{""cast_id"": 6, ""character"": ""Dominic"", ""credi...","[{""credit_id"": ""586ecf54c3a3683b8500bafb"", ""de..."
3607,0,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",,13776,"[{""id"": 10738, ""name"": ""diner""}, {""id"": 158938...",en,Diner,"Set in 1959, Diner shows how five young men re...",5.87193,"[{""name"": ""Metro-Goldwyn-Mayer (MGM)"", ""id"": 8...",...,110.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Suddenly, life was more than French fries, gra...",Diner,6.9,83,13776,"[{""cast_id"": 1, ""character"": ""Edward 'Eddie' S...","[{""credit_id"": ""52fe459a9251416c7505c1fd"", ""de..."
2622,15000000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",http://www.filminfocus.com/film/in_bruges,8321,"[{""id"": 167221, ""name"": ""bruges belgium""}, {""i...",en,In Bruges,"Ray and Ken, two hit men, are in Bruges, Belgi...",25.329493,"[{""name"": ""Blueprint Pictures"", ""id"": 2376}, {...",...,107.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Shoot first. Sightsee later.,In Bruges,7.4,1388,8321,"[{""cast_id"": 11, ""character"": ""Ray"", ""credit_i...","[{""credit_id"": ""52fe449ec3a36847f80a06b7"", ""de..."
3989,0,"[{""id"": 53, ""name"": ""Thriller""}, {""id"": 28, ""n...",,356483,"[{""id"": 2210, ""name"": ""climate change""}, {""id""...",en,Unnatural,Global climate change prompts a scientific cor...,0.687656,"[{""name"": ""August Heart Entertainment"", ""id"": ...",...,89.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Some things were never meant to be.,Unnatural,4.3,12,356483,"[{""cast_id"": 0, ""character"": ""Martin Nakos"", ""...","[{""credit_id"": ""56247aa892514171c501082d"", ""de..."
830,30000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",,393,"[{""id"": 380, ""name"": ""brother brother relation...",en,Kill Bill: Vol. 2,The Bride unwaveringly continues on her roarin...,50.622607,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...",...,136.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,The bride is back for the final cut.,Kill Bill: Vol. 2,7.6,3948,393,"[{""cast_id"": 1, ""character"": ""Beatrix 'The Bri...","[{""credit_id"": ""52fe423ec3a36847f800efa5"", ""de..."


In [8]:
movies.shape

(4809, 23)

# Starting EDA of the dataset


In [9]:
#checking null values
null = pd.DataFrame(movies.isnull().sum()).reset_index()
null = null.rename(columns={'index':'column', 0:'null_count'})
null['percentage'] = round((null['null_count']/len(movies))*100,2)
null

Unnamed: 0,column,null_count,percentage
0,budget,0,0.0
1,genres,0,0.0
2,homepage,3096,64.38
3,id,0,0.0
4,keywords,0,0.0
5,original_language,0,0.0
6,original_title,0,0.0
7,overview,3,0.06
8,popularity,0,0.0
9,production_companies,0,0.0


In [10]:
movies['homepage'].sample(10)

367                 http://www.theinterpretermovie.com/
3937                                                NaN
2813                 http://www.theperfectgamemovie.com
2694                                                NaN
2622          http://www.filminfocus.com/film/in_bruges
3170    http://www.newline.com/properties/setitoff.html
292                         http://www.eragonmovie.com/
1348                          http://www.seewinter.com/
1973                                                NaN
3076                       http://www.iamlovemovie.com/
Name: homepage, dtype: object

In [11]:
movies.drop('homepage', axis = 1, inplace = True )


In [12]:
movies.drop('tagline', axis = 1, inplace = True)

In [13]:
movies.columns

Index(['budget', 'genres', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'title', 'vote_average', 'vote_count',
       'movie_id', 'cast', 'crew'],
      dtype='object')

In [14]:
# Dropping null value coulmns
movies.dropna(inplace=True)

In [15]:
movies.isna().sum()

budget                  0
genres                  0
id                      0
keywords                0
original_language       0
original_title          0
overview                0
popularity              0
production_companies    0
production_countries    0
release_date            0
revenue                 0
runtime                 0
spoken_languages        0
status                  0
title                   0
vote_average            0
vote_count              0
movie_id                0
cast                    0
crew                    0
dtype: int64

In [16]:
# Checking duplicate values
movies.duplicated().sum()

0

In [17]:
movies.head(5)

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,...,revenue,runtime,spoken_languages,status,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",...,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",...,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",...,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,Spectre,6.3,4466,206647,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",...,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Dark Knight Rises,7.6,9106,49026,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",...,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,John Carter,6.1,2124,49529,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [18]:
movies.describe()

Unnamed: 0,budget,id,popularity,revenue,runtime,vote_average,vote_count,movie_id
count,4805.0,4805.0,4805.0,4805.0,4805.0,4805.0,4805.0,4805.0
mean,29048660.0,56855.301561,21.509226,82343600.0,106.909469,6.094527,690.902185,56855.301561
std,40714840.0,88195.44646,31.810774,162888300.0,22.551937,1.18772,1234.542302,88195.44646
min,0.0,5.0,0.000372,0.0,0.0,0.0,0.0,5.0
25%,800000.0,9009.0,4.682881,0.0,94.0,5.6,54.0,9009.0
50%,15000000.0,14608.0,12.929525,19184020.0,104.0,6.2,236.0,14608.0
75%,40000000.0,58431.0,28.350927,92921200.0,118.0,6.8,738.0,58431.0
max,380000000.0,447027.0,875.581305,2787965000.0,338.0,10.0,13752.0,447027.0


In [19]:
# to convert string of list to list of list
import ast
#since we have to use ast.literal_eval for converting string of list to list of list

In [20]:
#Creating a function to get the name of genres
def convert(obj):
  l = []
  for i in ast.literal_eval(obj):
    l.append(i['name'])
  return l

In [21]:
movies['genres'][0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [22]:
#Applying the convert function on genre column
movies['genres'] = movies['genres'].apply(convert)

In [23]:
movies['genres'][0]

['Action', 'Adventure', 'Fantasy', 'Science Fiction']

In [24]:
movies['keywords'][0]

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

In [25]:
#Applying the convert function on keywords column
movies['keywords'] = movies['keywords'].apply(convert)

In [26]:
movies['keywords'][0]

['culture clash',
 'future',
 'space war',
 'space colony',
 'society',
 'space travel',
 'futuristic',
 'romance',
 'space',
 'alien',
 'tribe',
 'alien planet',
 'cgi',
 'marine',
 'soldier',
 'battle',
 'love affair',
 'anti war',
 'power relations',
 'mind and soul',
 '3d']

In [27]:
movies['cast'][0]

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

In [28]:
#Creating a function to get the top 5  name of cast
def convert_top_5(obj):
  l = []
  count = 0
  for i in ast.literal_eval(obj):
    if count !=5:
      l.append(i['name'])
      count = count+1
    else:
      break

  return l


In [29]:
#Applying the convert_top_5 function on cast column
movies['cast'] = movies['cast'].apply(convert_top_5)

In [30]:
movies['cast'][0]

['Sam Worthington',
 'Zoe Saldana',
 'Sigourney Weaver',
 'Stephen Lang',
 'Michelle Rodriguez']

In [31]:
#Creating a function to get the Director name from crew
def director(obj):
  l = []
  for i in ast.literal_eval(obj):
    if i['job'] =='Director':
      l.append(i['name'])
      break
  return l

In [32]:
movies['crew'][0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [33]:
#Applying the director function on crew column to get the director name
movies['crew'] = movies['crew'].apply(director)

In [34]:
movies['crew'][0]

['James Cameron']

In [35]:
movies['production_companies'][0]

'[{"name": "Ingenious Film Partners", "id": 289}, {"name": "Twentieth Century Fox Film Corporation", "id": 306}, {"name": "Dune Entertainment", "id": 444}, {"name": "Lightstorm Entertainment", "id": 574}]'

In [36]:
#Applying the convert function on production_companies column
movies['production_companies'] = movies['production_companies'].apply(convert)

In [37]:
movies['production_companies'][0]

['Ingenious Film Partners',
 'Twentieth Century Fox Film Corporation',
 'Dune Entertainment',
 'Lightstorm Entertainment']

In [38]:
movies['production_countries'][0]

'[{"iso_3166_1": "US", "name": "United States of America"}, {"iso_3166_1": "GB", "name": "United Kingdom"}]'

In [39]:
#Applying the convert function on production_companies column
movies['production_countries'] = movies['production_countries'].apply(convert)

In [40]:
movies['production_countries'][0]

['United States of America', 'United Kingdom']

In [41]:
movies['spoken_languages'][0]

'[{"iso_639_1": "en", "name": "English"}, {"iso_639_1": "es", "name": "Espa\\u00f1ol"}]'

In [42]:
#Applying the convert function on production_companies column
movies['spoken_languages'] = movies['spoken_languages'].apply(convert)

In [43]:
movies['spoken_languages'][0]

['English', 'Español']

In [44]:
movies.sample(10)

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,...,revenue,runtime,spoken_languages,status,title,vote_average,vote_count,movie_id,cast,crew
883,52000000,"[Drama, Crime]",640,"[con man, biography, fbi agent, overhead camer...",en,Catch Me If You Can,"A true story about Frank Abagnale Jr. who, bef...",73.944049,"[Kemp Company, Splendid Pictures, Parkes/MacDo...",[United States of America],...,352114312,141.0,"[English, Français]",Released,Catch Me If You Can,7.7,3795,640,"[Leonardo DiCaprio, Tom Hanks, Christopher Wal...",[Steven Spielberg]
160,145000000,"[Fantasy, Action, Adventure, Animation, Comedy...",82702,"[father son relationship, wife husband relatio...",en,How to Train Your Dragon 2,The thrilling second chapter of the epic How T...,100.21391,"[DreamWorks Animation, Mad Hatter Entertainment]",[United States of America],...,609123048,102.0,[English],Released,How to Train Your Dragon 2,7.6,3106,82702,"[Jay Baruchel, Gerard Butler, Kristen Wiig, Jo...",[Dean DeBlois]
2254,0,"[Comedy, Drama, Foreign]",14652,[],fr,Bon voyage,Isabelle Adjani and Gerard Depardieu star in d...,0.49446,[],[France],...,0,114.0,"[Italiano, Français, Deutsch, English]",Released,Bon voyage,5.7,20,14652,"[Grégori Derangère, Gérard Depardieu, Isabelle...",[Jean-Paul Rappeneau]
2097,19000000,"[Crime, Drama, Thriller]",274,"[based on novel, psychopath, horror, suspense,...",en,The Silence of the Lambs,"FBI trainee, Clarice Starling ventures into a ...",18.174804,"[Orion Pictures, Strong Heart/Demme Production]",[United States of America],...,272742922,119.0,[English],Released,The Silence of the Lambs,8.1,4443,274,"[Jodie Foster, Anthony Hopkins, Scott Glenn, T...",[Jonathan Demme]
2096,20000000,[Comedy],10678,"[prison, ex-boyfriend, support, escape, lawyer]",en,Bringing Down the House,"Straight-laced lawyer, Peter Sanderson (Steve ...",8.351385,"[Hyde Park Films, Touchstone Pictures]",[United States of America],...,132675402,105.0,"[English, Français]",Released,Bringing Down the House,5.4,183,10678,"[Steve Martin, Queen Latifah, Eugene Levy, Joa...",[Adam Shankman]
876,75000000,"[Mystery, Thriller, Crime]",11456,"[menace, adoption, dangerous, adoptive father,...",en,Domestic Disturbance,A divorced father discovers that his 12-year-o...,8.41856,[Paramount Pictures],[United States of America],...,54249294,89.0,[English],Released,Domestic Disturbance,5.4,113,11456,"[John Travolta, Vince Vaughn, Teri Polo, Matt ...",[Harold Becker]
2218,20000000,"[Action, Thriller]",146198,"[heist, betrayal, dirty cop]",en,Triple 9,A gang of criminals and corrupt cops plan the ...,29.371987,"[Worldview Entertainment, Anonymous Content, M...",[United States of America],...,12639297,115.0,[English],Released,Triple 9,5.6,797,146198,"[Casey Affleck, Chiwetel Ejiofor, Woody Harrel...",[John Hillcoat]
4767,0,[Comedy],222250,[],en,A True Story,Mike and Matt own nothing and share everything...,0.970351,[Team Awesome Films],[],...,0,96.0,[],Released,A True Story,6.8,2,222250,"[Katrina Bowden, Jon Gries, Malcolm Goodwin, C...",[Malcolm Goodwin]
2228,20000000,"[Action, Comedy, Crime]",14396,[],en,Code Name: The Cleaner,"Cedric the Entertainer plays Jake, a seemingly...",9.36146,[New Line Cinema],[United States of America],...,10337477,84.0,[English],Released,Code Name: The Cleaner,4.7,78,14396,"[Cedric the Entertainer, Lucy Liu, Nicollette ...",[Les Mayfield]
4697,0,"[Drama, Mystery, Science Fiction, Thriller]",36549,[independent film],en,Yesterday Was a Lie,Hoyle a girl with a sharp mind and a weakness ...,0.145014,[],[United States of America],...,0,89.0,[English],Released,Yesterday Was a Lie,6.0,4,36549,"[Chase Masterson, John Newton, Kipleigh Brown,...",[James Kerwin]


In [45]:
movies.iloc[4695:4696]

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,...,revenue,runtime,spoken_languages,status,title,vote_average,vote_count,movie_id,cast,crew
4699,0,"[Thriller, Drama, Science Fiction]",289180,[woman director],en,H.,H. is a modern interpretation of a classic Gre...,1.045623,[],"[United States of America, Argentina]",...,0,93.0,[English],Released,H.,6.5,4,289180,"[Robin Bartlett, Rebecca Dayan, Will Janowitz,...",[Rania Attieh]


In [46]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4805 entries, 0 to 4808
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4805 non-null   int64  
 1   genres                4805 non-null   object 
 2   id                    4805 non-null   int64  
 3   keywords              4805 non-null   object 
 4   original_language     4805 non-null   object 
 5   original_title        4805 non-null   object 
 6   overview              4805 non-null   object 
 7   popularity            4805 non-null   float64
 8   production_companies  4805 non-null   object 
 9   production_countries  4805 non-null   object 
 10  release_date          4805 non-null   object 
 11  revenue               4805 non-null   int64  
 12  runtime               4805 non-null   float64
 13  spoken_languages      4805 non-null   object 
 14  status                4805 non-null   object 
 15  title                 4805

# Data Manipulation for ML Part

In [49]:
# making sub-dataset
movies = movies[['movie_id', 'title',   'overview','genres', 'keywords','cast','crew']]

In [50]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]


In [51]:
# Converting keywords in list for ease of merging it with other column in future
movies['overview']= movies['overview'].apply(lambda x: x.split())


In [52]:
#Removing space between cast name, crew name, keywords and genres
movies['cast']=movies['cast'].apply(lambda x : [i.replace(' ','') for i in x ])
movies['crew']=movies['crew'].apply(lambda x : [i.replace(' ','') for i in x ])
movies['keywords']=movies['keywords'].apply(lambda x : [i.replace(' ','') for i in x ])
movies['genres']=movies['genres'].apply(lambda x : [i.replace(' ','') for i in x ])

In [53]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron]


In [54]:
#merging different columns to get a single column with tags
movies['tags']= movies['overview']+ movies['genres']+ movies['keywords']+movies['cast']+movies['crew']

In [55]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux, Ralp...",[SamMendes],"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman, Anne...",[ChristopherNolan],"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton, Wi...",[AndrewStanton],"[John, Carter, is, a, war-weary,, former, mili..."


In [56]:
#Creating new data frame with desired columns
new_df = movies[['movie_id', 'title','tags']]
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


In [57]:
#Applying lambda function for converting list into string
new_df['tags'] = new_df['tags'].apply(lambda x : " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x : " ".join(x))


In [58]:
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [59]:
# Changing tags to lower case for ease in future
new_df['tags'] = new_df['tags'].apply(lambda x : x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x : x.lower())


In [60]:
new_df['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver stephenlang michellerodriguez jamescameron'

In [61]:
#Importing NLTK library to use Porter Stemmer for removing similar words like (actions, action or actor, actors)\
import nltk

In [62]:
from nltk.stem.porter import PorterStemmer


In [63]:
ps = PorterStemmer()

In [64]:
#creating a function for removing repeating words
def stem(text):
  y = []
  for i in text.split():
    y.append(ps.stem(i))
  return " ".join(y)

In [65]:
# applying the stem function
new_df['tags'] = new_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stem)


In [66]:
new_df['tags'][0]

'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi sciencefict cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d samworthington zoesaldana sigourneyweav stephenlang michellerodriguez jamescameron'

In [67]:
# Removing stop words from tags
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000, stop_words='english')

In [68]:
# converting each movie to a vector
vectors = cv.fit_transform(new_df['tags']).toarray()

In [69]:
vectors[0]

array([0, 0, 0, ..., 0, 0, 0])