![alt text](bg1.jpg)

# About The Dataset

# Overview:
Welcome to the TMDB 5000 Movie Dataset with Ratings, a comprehensive collection that merges the original TMDb 5000 Movie Dataset with additional user ratings. This dataset offers an extensive exploration of the cinematic world, providing valuable insights for data enthusiasts, researchers, and machine learning practitioners.

# Datasets Included:
### 1. tmdb_5000_movies:

#### Columns:

* `budget` : The budget allocated for the movie.
* `genres` : Genres associated with the movie.
* `homepage` : The homepage URL of the movie.
* `movie_id` : The unique identifier assigned by TMDb.
* `keywords` : Keywords or tags related to the movie.
* `original_language` : The original language of the movie.
* `original_title` : The original title of the movie.
* `overview` : A brief overview or synopsis of the movie.
* `popularity` : Popularity score of the movie.
* `production_companies` : Companies involved in the movie's production.
* `production_countries` : Countries where the movie was produced.
* `release_date` : The date when the movie was released.
* `revenue` : The revenue generated by the movie.
* `runtime` : The duration of the movie in minutes.
* `spoken_languages` : Languages spoken in the movie.
* `status` : The production status of the movie.
* `tagline` : A tagline associated with the movie.
* `title` : The title of the movie.
* `vote_average` : Average user rating.
* `vote_count` : The number of votes the movie received.
* `ratingId` : Unique identifier for ratings.

### 2. tmdb_5000_credits:

#### Columns:

* `movie_id` : The unique identifier assigned by TMDb.
* `title` : The title of the movie.
* `cast` : Cast members of the movie.
* `crew` : Crew members involved in the movie.

### 3. tmdb_movie_ratings:

#### Columns:
 
* `userId` : Unique identifier for users.
* `ratingId` : Unique identifier for ratings.
* `rating` : User rating for the movie.
* `timestamp` : The timestamp when the rating was given.

# Key Features:

* `Comprehensive Movie Details` : Explore detailed information about 5000 movies, including budget, genres, production details, and more.
* `User Ratings` : Gain insights into audience reception with user ratings, facilitating sentiment analysis and audience preferences.
* `Cast and Crew Information` : Delve into the cast and crew details for each movie, providing a comprehensive understanding of the creative forces behind the scenes.

# Potential Use Cases:

* `Predictive Modeling` : Develop machine learning models to predict user ratings based on various movie attributes.
* `Exploratory Data Analysis (EDA)` : Uncover trends and patterns in movie-related features, exploring correlations between budget, revenue, and user ratings.
* `Content Recommendation` : Leverage user ratings to build recommendation systems for personalized movie suggestions.

# Dataset Sources: 

1. [https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata)

2. [https://kaggle.com/datasets/aayushsoni4/tmdb-5000-movie-dataset-with-ratings](https://kaggle.com/datasets/aayushsoni4/tmdb-5000-movie-dataset-with-ratings)

# Data Preprocessing

## Import necessary libraries

In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import ast

## Load the Datasets

In [4]:
movies = pd.read_csv('dataset/tmdb_5000_movies.csv')
credit = pd.read_csv('dataset/tmdb_5000_credits.csv') 

## How big is the Data?

In [5]:
movies.shape

(4803, 20)

In [6]:
credit.shape

(4803, 4)

## How does the data look like?

In [7]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [8]:
credit.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


## Merging both the `movies` and `credit` dataframes together into a single dataframe `movies` based on the `title` column 

In [9]:
movies = movies.merge(credit,on='title')

In [10]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


## What is the data type of the columns?

In [11]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

## Selection of the relevant features:
* genres
* movie_id
* keywords
* title (original_title is discarded as it can be in multiple languages)
* overview (if 2 movies have similar summary then they can be recommended together)
* cast 
* crew

In [12]:
movies = movies[['movie_id','title','overview','release_date','genres','keywords','cast','crew']]

In [13]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,release_date,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...",2009-12-10,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",2007-05-19,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


## Are there any Missing Values?

In [14]:
movies.isnull().sum()

movie_id        0
title           0
overview        3
release_date    1
genres          0
keywords        0
cast            0
crew            0
dtype: int64

In [15]:
movies.dropna(inplace=True)

## Are there any Duplicated Values?

In [16]:
movies.duplicated().sum()

0

## Converting the release date to release year only for ease

In [17]:
# Convert release_date column to datetime if necessary
movies['release_date'] = pd.to_datetime(movies['release_date'])

# Convert release_date to year only
movies['release_date'] = movies['release_date'].apply(lambda x: x.strftime('%Y'))

In [18]:
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [19]:
movies.iloc[0].keywords

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

## We need to convert the above weird format of `genres` and `keywords` to a stable format

In [20]:
# '[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

# It's basically a list of dictionaries (in string format). So we need to convert the string of list to List first of all.

# We need it in the format: ['Action','Adventure','Fantasy','Science Fiction']

In [21]:
def string_to_list_convert1(text):
    L = []
    for i in ast.literal_eval(text):
        L.append(i['name']) 
    return L 

In [22]:
movies['genres'] = movies['genres'].apply(string_to_list_convert1)
movies.head(2)

Unnamed: 0,movie_id,title,overview,release_date,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...",2009,"[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",2007,"[Adventure, Fantasy, Action]","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [23]:
movies['keywords'] = movies['keywords'].apply(string_to_list_convert1)
movies.head(2)

Unnamed: 0,movie_id,title,overview,release_date,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...",2009,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",2007,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


## Now from the `cast` column (which is a list of dictionaries), we need to select the top 5 important Actors 

In [24]:
movies['cast'][0]

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

In [25]:
def string_to_list_convert2(text):
    L = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 5:
            L.append(i['name'])
        counter+=1
    return L 

In [26]:
movies['cast'] = movies['cast'].apply(string_to_list_convert2)
movies.head()

Unnamed: 0,movie_id,title,overview,release_date,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...",2009,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",2007,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,2015,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux, R...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,2012,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman, A...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...",2012,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton,...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


## Now from the `crew` column which is a list of dictionaries (in string format), we need to extract the director name where job = 'Director' 

In [27]:
movies['crew'][0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [28]:
def fetch_director(text):
    L = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            L.append(i['name'])
    return L 

In [29]:
movies['crew'] = movies['crew'].apply(fetch_director)
movies.head()

Unnamed: 0,movie_id,title,overview,release_date,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...",2009,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",2007,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski]
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,2015,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",[Sam Mendes]
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,2012,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman, A...",[Christopher Nolan]
4,49529,John Carter,"John Carter is a war-weary, former military ca...",2012,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton,...",[Andrew Stanton]


## Next we need to convert the `overview` column from string to list of words so that they can be easily concatenated with the other columns which are also in List format.

In [30]:
movies['overview'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [31]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())
movies.head(2)

Unnamed: 0,movie_id,title,overview,release_date,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...",2009,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...",2007,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski]


## Now we need to remove the spaces between the words to make them a single entity otherwise it may cause issues later on.

In [32]:
def collapse(L):
    L1 = []
    for i in L:
        L1.append(i.replace(" ",""))
    return L1

In [33]:
movies['cast'] = movies['cast'].apply(collapse)
movies['crew'] = movies['crew'].apply(collapse)
movies['genres'] = movies['genres'].apply(collapse)
movies['keywords'] = movies['keywords'].apply(collapse)

In [34]:
movies.head()

Unnamed: 0,movie_id,title,overview,release_date,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...",2009,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...",2007,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...",[GoreVerbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...",2015,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux, Ralp...",[SamMendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...",2012,"[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman, Anne...",[ChristopherNolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...",2012,"[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton, Wi...",[AndrewStanton]


## Finally we are merging the `overview`, `release_date`, `genres`, `keywords`, `cast` and `crew` columns into a single column named `tags`. 

In [35]:
# # Concatenate columns into 'tags' column
movies['tags'] = movies['overview'].apply(lambda x: ' '.join(x)) + ' ' + \
                 movies['release_date'] + ' ' + \
                 movies['genres'].apply(lambda x: ' '.join(x)) + ' ' + \
                 movies['keywords'].apply(lambda x: ' '.join(x)) + ' ' + \
                 movies['cast'].apply(lambda x: ' '.join(x)) + ' ' + \
                 movies['crew'].apply(lambda x: ' '.join(x))


In [36]:
movies['tags'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. 2009 Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver StephenLang MichelleRodriguez JamesCameron'

## Making the final dataframe containing only the `movie_id`, `title` and `tags`

In [37]:
movies = movies[['movie_id','title','tags']]
movies.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [38]:
# Converting the 'tags' column contents to lower case

movies['tags'] = movies['tags'].apply(lambda x: x.lower())
movies.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


In [39]:
# Try to incorporate popularity and year as different separate columns and use them to predict movies alongwith tags

In [40]:
movies['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. 2009 action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver stephenlang michellerodriguez jamescameron'

In [41]:
# Counting the number of words in the tags
len(movies['tags'][0].split(" ")) 

60

In [42]:
movies['tags'][1]

"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. 2007 adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger johnnydepp orlandobloom keiraknightley stellanskarsgård chowyun-fat goreverbinski"

In [43]:
# Counting the number of words in the tags
len(movies['tags'][1].split(" ")) 

60

## Applying Lemmatization over the `tags` column

In [44]:
import spacy
 
# Load the spaCy English model
nlp = spacy.load('en_core_web_md')

In [45]:
def lemmatize(text):
    
    # Process the text using spaCy
    doc = nlp(text) 
    
    # Extract lemmatized tokens
    lemmatized_tokens = [token.lemma_ for token in doc]
    
    # Join the lemmatized tokens into a sentence
    lemmatized_text = ' '.join(lemmatized_tokens)
    
    return lemmatized_text

In [46]:
movies['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. 2009 action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver stephenlang michellerodriguez jamescameron'

In [47]:
lemmatize(movies['tags'][0])

'in the 22nd century , a paraplegic marine be dispatch to the moon pandora on a unique mission , but become tear between follow order and protect an alien civilization . 2009 action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelation mindandsoul 3d samworthington zoesaldana sigourneyweaver stephenlang michellerodriguez jamescameron'

In [48]:
movies['tags'] = movies['tags'].apply(lemmatize)

## Applying Stemming over the `tags` column

In [49]:
# import nltk
# from nltk.stem.porter import PorterStemmer

# ps = PorterStemmer()

In [50]:
# def stem(text):
#     y = []
    
#     for i in text.split():
#         y.append(ps.stem(i))
        
#     return " ".join(y)

In [51]:
# movies['tags'][0]

In [52]:
# stem(movies['tags'][0])

## Using "Count Vectorizer" Bag of Words technique we will convert the textual data into a n-dimensional vector so that we can find the Cosine similarity between them

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000, stop_words='english')

In [54]:
vectors = tfidf.fit_transform(movies['tags']).toarray()

In [55]:
vectors

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [56]:
vectors.shape

(4805, 5000)

In [57]:
tfidf.vocabulary_

{'century': 782,
 'marine': 2821,
 'dispatch': 1334,
 'moon': 3043,
 'pandora': 3308,
 'unique': 4713,
 'mission': 3010,
 'tear': 4474,
 'follow': 1750,
 'order': 3256,
 'protect': 3571,
 'alien': 213,
 'civilization': 893,
 '2009': 81,
 'action': 129,
 'adventure': 160,
 'fantasy': 1657,
 'sciencefiction': 3984,
 'cultureclash': 1129,
 'future': 1822,
 'society': 4201,
 'spacetravel': 4238,
 'futuristic': 1823,
 'romance': 3859,
 'space': 4232,
 'tribe': 4644,
 'alienplanet': 216,
 'soldier': 4205,
 'battle': 475,
 '3d': 96,
 'samworthington': 3943,
 'zoesaldana': 4994,
 'sigourneyweaver': 4126,
 'michellerodriguez': 2964,
 'jamescameron': 2325,
 'captain': 722,
 'long': 2713,
 'believe': 496,
 'dead': 1201,
 'come': 962,
 'life': 2670,
 'head': 2018,
 'edge': 1449,
 'earth': 1436,
 'turner': 4665,
 'elizabeth': 1476,
 'quite': 3611,
 'pron': 3561,
 '2007': 79,
 'ocean': 3220,
 'drugabuse': 1399,
 'exoticisland': 1606,
 'loveofone': 2746,
 'slife': 4177,
 'traitor': 4622,
 'shipwreck'

## We will now use `Cosine Similarity` method (based on the angle between the vectors) to find how similar are the Movies. We will not use `Euclidean Distance` method (tip to tip straight line distance) as this method fails in case of higher dimensions which is known by "Curse of Dimensionality".

## Note: Cosine distance is inversely proportional to Similarity

In [58]:
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(vectors)

In [59]:
similarity

array([[1.        , 0.01979536, 0.02786703, ..., 0.01967063, 0.00514235,
        0.        ],
       [0.01979536, 1.        , 0.01331543, ..., 0.02386098, 0.00304572,
        0.00758164],
       [0.02786703, 0.01331543, 1.        , ..., 0.0252822 , 0.02009737,
        0.00829941],
       ...,
       [0.01967063, 0.02386098, 0.0252822 , ..., 1.        , 0.03111044,
        0.04864359],
       [0.00514235, 0.00304572, 0.02009737, ..., 0.03111044, 1.        ,
        0.03672032],
       [0.        , 0.00758164, 0.00829941, ..., 0.04864359, 0.03672032,
        1.        ]])

In [60]:
similarity.shape

(4805, 4805)

In [61]:
similarity[0]

array([1.        , 0.01979536, 0.02786703, ..., 0.01967063, 0.00514235,
       0.        ])

## Movie Recommendation Function to display top 10 recommended movies similar to user input

In [62]:
def recommend(movie):
    index = movies[movies['title'] == movie].index[0]
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    #print(distances)
    for i in distances[1:11]:
        print(movies.iloc[i[0]].title)
        #print(distances)

In [63]:
recommend('Avatar')

Falcon Rising
Battle: Los Angeles
Aliens
Apollo 18
Star Trek Into Darkness
Lifeforce
Meet Dave
Titan A.E.
Predators
Independence Day


In [64]:
recommend('The Conjuring 2')

The Conjuring
Restoration
Insidious: Chapter 2
Witchboard
Darkness Falls
The Grudge
Insidious
Sardaarji
Exorcist II: The Heretic
Higher Ground


In [65]:
import gzip
import pickle

pickle.dump(movies.to_dict(),open('model/movie_dict.pkl','wb'))

pickle.dump(similarity,open('model/similarity.pkl','wb'))

In [66]:
# Original

'''
Titan A.E.
Aliens
Aliens vs Predator: Requiem
Independence Day
Battle: Los Angeles
Small Soldiers
Predators
Lifeforce
Ender's Game
Falcon Rising
'''

# Using Lemmatization

'''
Independence Day
Aliens
Titan A.E.
Falcon Rising
Aliens vs Predator: Requiem
U.F.O.
Ender's Game
Lifeforce
Small Soldiers
The Fifth Element
'''

# Using Stemming

'''
Titan A.E.
Aliens
Aliens vs Predator: Requiem
Independence Day
Battle: Los Angeles
Small Soldiers
Predators
Lifeforce
Ender's Game
Falcon Rising
'''

"\nTitan A.E.\nAliens\nAliens vs Predator: Requiem\nIndependence Day\nBattle: Los Angeles\nSmall Soldiers\nPredators\nLifeforce\nEnder's Game\nFalcon Rising\n"