### Project - Movies Recommendation System

Dataset - Kaggle TMDB Movies Dataset

Content based Recommendation System

Approach:

1.Merge Both the Dataset (Dataset with Movies till 2016 + Dataset of Movies 2017)

2.Then will Web Scrape for Movies of 2018,2019,2020,2021,2022 (Wikipedia & TMDB Database site)

3.Will Create a Features Called "Detail" that will include the Genres,overview keywords, Top 3 Cast & Lastly Director Names.

4.Now will Create Vector of Each Movies Using the Detail Feature.

5.Movies will be recommended based on Simliar Vector.

6.So When a User Enters his/her Favorite Movies. then Based on that Movies.similar Movies will be Recommended using the cosine similarity technique

In [1]:
#Importing the Required Libaries
import numpy as np
import pandas as pd
import ast
import matplotlib.pyplot as plt

# For Movie Web Scrapping
from tmdbv3api import TMDb
from tmdbv3api import Movie
import json
import requests

#Model & Feature Extraction
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#To Save the Model.
import pickle

import warnings
warnings.filterwarnings("ignore")  #--to ignore warnings

In [2]:
#Loading the Dataset
movies = pd.read_csv('C:/Users/Akaash/Downloads/movies_data.csv')
credits = pd.read_csv('C:/Users/Akaash/Downloads/credits.csv')

In [3]:
#Checking movies Data
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [4]:
#Checking the Shape of Movies Dataset
movies.shape

(4803, 20)

In [5]:
#Converting the Release_date Feature to DateTime
movies['release_date'] = pd.to_datetime(movies['release_date'], errors='coerce')
#Creating A Year Column Feature
movies['year'] = movies['release_date'].dt.year
#Dropping Original Columns
movies = movies.drop(columns=['release_date'])
movies['year'].value_counts().sort_index()

1916.0      1
1925.0      1
1927.0      1
1929.0      2
1930.0      1
         ... 
2013.0    231
2014.0    238
2015.0    216
2016.0    104
2017.0      1
Name: year, Length: 90, dtype: int64

Inference:
We don't have enough data for the movies from 2017, 2018, 2019,2020 & 2021,We'll deal with it in the upcoming preprocessing steps

In [6]:
#Checking credit Data
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [7]:
#Mergeing Both the Dataset
df = movies.merge(credits,on='title')
df.shape

(4809, 23)

In [8]:
# Will Use these Feature to Create the Recommendation system
df1 = df[['id','title','overview','genres','keywords','cast','crew']]
df1.head()

Unnamed: 0,id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [9]:
#Checking NA Values
df1.isnull().sum()

id          0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [10]:
# Dropping the NA Values as Count of NA Value is Small
df1.dropna(inplace=True)
df1.shape

(4806, 7)

In [11]:
#Checking the genres Features for Preprocessing
df1.genres[0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

Inference: It is a List of Dictionary, We want only a List of Genres. So will Preprocess this Feature

In [12]:
#To get a list of genres
def convert(text):
    List = []
    for i in ast.literal_eval(text):
        List.append(i['name']) 
    return List

In [13]:
#Calling the Function to get list of genres,keywords,cast
df1['genres'] = df1['genres'].apply(convert)
df1['keywords'] = df1['keywords'].apply(convert)
df1['cast'] = df1['cast'].apply(convert)
df1.head(2)

Unnamed: 0,id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [14]:
#Getting only First 3 Actors of the Movies
df1['cast'] = df1['cast'].apply(lambda x:x[0:3])
df1.head(2)

Unnamed: 0,id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [15]:
#Function to Get only the Director Name from Crew Features
def fetch_director(text):
    List = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            List.append(i['name'])
    return List

In [16]:
#Calling the fetch_director function to get Director Name's
df1['crew'] = df1['crew'].apply(fetch_director)
df1.head(2)

Unnamed: 0,id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]


In [17]:
#Checking the Overview Feature
df1['overview'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

Inference: Overview Feature is a String will Convert this into a List so we can concat with other list

In [18]:
#Converting Overview Feature into a List
df1['overview'] = df1['overview'].apply(lambda x:x.split())
df1.head(2)

Unnamed: 0,id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]


In [19]:
#To Remove Space between a Single Entity
def collapse(L):
    List = []
    for i in L:
        List.append(i.replace(" ",""))
    return List

In [20]:
#Making Single Entities by Removing Space
df1['cast'] = df1['cast'].apply(collapse)
df1['crew'] = df1['crew'].apply(collapse)
df1['genres'] = df1['genres'].apply(collapse)
df1['keywords'] = df1['keywords'].apply(collapse)
df1.head(3)

Unnamed: 0,id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]


Inference: Now Will Create a Detail Feature Which will include All Overview,genres,keywords,cast and crew feature 

In [21]:
#Creating Detail Feature
df1['Detail'] = df1['overview'] + df1['genres'] + df1['keywords'] + df1['cast'] + df1['crew']
#Dropping Original Columns
final_df = df1.drop(columns=['overview','genres','keywords','cast','crew'])
#Checking the final_df
final_df.head()

Unnamed: 0,id,title,Detail
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


In [22]:
#Converting the Detail column back to string/Text
final_df['Detail'] = final_df['Detail'].apply(lambda x: " ".join(x))
final_df.head()

Unnamed: 0,id,title,Detail
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


#### Movies of 2017

In [23]:
#Loading the Dataset
movies17 = pd.read_csv('C:/Users/Akaash/Downloads/movies_data_17.csv')
keywords17 = pd.read_csv('C:/Users/Akaash/Downloads/keywords_17.csv')
credits17 = pd.read_csv('C:/Users/Akaash/Downloads/credits_17.csv')

In [24]:
#Converting the Release_date Feature to DateTime
movies17['release_date'] = pd.to_datetime(movies17['release_date'], errors='coerce')
#Creating A Year Column Feature
movies17['year'] = movies17['release_date'].dt.year
#Dropping Original Columns
movies17 = movies17.drop(columns=['release_date'])
movies17['year'].value_counts().sort_index()

1874.0       1
1878.0       1
1883.0       1
1887.0       1
1888.0       2
          ... 
2015.0    1905
2016.0    1604
2017.0     532
2018.0       5
2020.0       1
Name: year, Length: 135, dtype: int64

Inference: will Take 2017 movies data from here.

In [25]:
# Getting only 2017 movies as we already have movies up to the year 2016 Previously 
# We don't have enough data for the movies from 2018, 2019 and 2020. 
# We'll deal with it in the upcoming preprocessing step
df2 = movies17.loc[movies17.year == 2017,['id','title','overview','genres']]
df2.shape

(532, 4)

In [26]:
#Converting id feature to int
df2['id']=pd.to_numeric(df2['id'],errors='coerce')
#Mergeing Movies & Keywords Dataset
df2 = df2.merge(keywords17,on='id')
#Mergeing Both the Dataset
df2 = df2.merge(credits17,on='id')
df2.head(2)

Unnamed: 0,id,title,overview,genres,keywords,cast,crew
0,166426,Pirates of the Caribbean: Dead Men Tell No Tales,"Thrust into an all-new paycheck, a down-on-his...","[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...","[{'id': 658, 'name': 'sea'}, {'id': 3799, 'nam...","[{'cast_id': 1, 'character': 'Captain Jack Spa...","[{'credit_id': '52fe4c9cc3a36847f8236a65', 'de..."
1,141052,Justice League,Fueled by his restored faith in humanity and i...,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...","[{'id': 849, 'name': 'dc comics'}, {'id': 9717...","[{'cast_id': 2, 'character': 'Bruce Wayne / Ba...","[{'credit_id': '55ef66dbc3a3686f1700a52d', 'de..."


In [27]:
df2.isnull().sum()

id           0
title        0
overview    12
genres       0
keywords     0
cast         0
crew         0
dtype: int64

In [28]:
# Dropping the NA Values as Count of NA Value is Small
df2.dropna(inplace=True)
df2.shape

(520, 7)

In [29]:
#Preprocessing Movies of 2017

#Calling the Function to get list of genres,keywords,cast
df2['genres'] = df2['genres'].apply(convert)
df2['keywords'] = df2['keywords'].apply(convert)
df2['cast'] = df2['cast'].apply(convert)

#Getting only First 3 Actors of the Movies
df2['cast'] = df2['cast'].apply(lambda x:x[0:3])

#Calling the fetch_director function to get Director Name's
df2['crew'] = df2['crew'].apply(fetch_director)

#Converting Overview Feature into a List
df2['overview'] = df2['overview'].apply(lambda x:x.split())

#Making Single Entities by Removing Space
df2['cast'] = df2['cast'].apply(collapse)
df2['crew'] = df2['crew'].apply(collapse)
df2['genres'] = df2['genres'].apply(collapse)
df2['keywords'] = df2['keywords'].apply(collapse)

#Creating Detail Feature
df2['Detail'] = df2['overview'] + df2['genres'] + df2['keywords'] + df2['cast'] + df2['crew']
#Dropping Original Columns
final_df17 = df2.drop(columns=['overview','genres','keywords','cast','crew'])

#Converting the Detail column back to string/Text
final_df17['Detail'] = final_df17['Detail'].apply(lambda x: " ".join(x))
final_df17.head()

Unnamed: 0,id,title,Detail
0,166426,Pirates of the Caribbean: Dead Men Tell No Tales,"Thrust into an all-new paycheck, a down-on-his..."
1,141052,Justice League,Fueled by his restored faith in humanity and i...
2,284053,Thor: Ragnarok,Thor is imprisoned on the other side of the un...
3,283995,Guardians of the Galaxy Vol. 2,The Guardians must fight to keep their newfoun...
4,245842,The King's Daughter,King Louis XIV's quest for immortality leads h...


Inference: Now will append this 2017's movies with previous data

In [30]:
#Appending the Dataset (Previous + 2017 Movies)
final_df = final_df.append(final_df17)
#Checking the final_df
final_df.shape

(5326, 3)

#### Movies of 2018

Extracting features of 2018 movies from Wikipedia

In [31]:
#Link
link = "https://en.wikipedia.org/wiki/List_of_American_films_of_2018"
# Getting the 4 table present in the link as 4 dfs
df3 = pd.read_html(link, header=0)[2]
df4 = pd.read_html(link, header=0)[3]
df5 = pd.read_html(link, header=0)[4]
df6 = pd.read_html(link, header=0)[5]

#Appending all 4 df into 1
df7 = df3.append(df4.append(df5.append(df6,ignore_index=True),ignore_index=True),ignore_index=True)
df7.head()

Unnamed: 0,Opening,Opening.1,Title,Production company,Cast and crew,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Ref.,Ref.
0,JANUARY,5,Insidious: The Last Key,Universal Pictures / Blumhouse Productions / S...,Adam Robitel (director); Leigh Whannell (scree...,[2],
1,JANUARY,5,The Strange Ones,Vertical Entertainment,Lauren Wolkstein (director); Christopher Radcl...,[3],
2,JANUARY,5,Stratton,Momentum Pictures,"Simon West (director); Duncan Falconer, Warren...",[4],
3,JANUARY,10,Sweet Country,Samuel Goldwyn Films,"Warwick Thornton (director); David Tranter, St...",[5],
4,JANUARY,12,The Commuter,Lionsgate / StudioCanal / The Picture Company,Jaume Collet-Serra (director); Byron Willinger...,[6],


In [32]:
#Initializing to Scapes the Genre
tmdb = TMDb()
tmdb.api_key = 'eab26b9d5f2cc3ad149a18a460342632'
tmdb_movie = Movie()
#For ID
def get_id(x):
    ID = []
    result = tmdb_movie.search(x)
    if not result:
        return np.NaN
    else:
        movie_id = result[0].id
        response = requests.get('https://api.themoviedb.org/3/movie/{}?api_key={}'.format(movie_id, tmdb.api_key))
        data_json = response.json()
        response.close()
        if data_json['id']:
            ID.append(data_json['id'])
            val = ','.join(str(i) for i in ID)
            return val
        else:
            np.NaN

def get_genre(x):
    genres = []
    result = tmdb_movie.search(x)
    if not result:
        return np.NaN
    else:
        movie_id = result[0].id
        response = requests.get('https://api.themoviedb.org/3/movie/{}?api_key={}'.format(movie_id, tmdb.api_key))
        data_json = response.json()
        response.close()
        if data_json['genres']:
            genre_str = " " 
            for i in range(0,len(data_json['genres'])):
                genres.append(data_json['genres'][i]['name'])
            return genre_str.join(genres)
        else:
            np.NaN
        
#For Overview
def get_overview(x):
    overview = []
    result = tmdb_movie.search(x)
    if not result:
        return np.NaN
    else:
        movie_id = result[0].id
        response = requests.get('https://api.themoviedb.org/3/movie/{}?api_key={}'.format(movie_id, tmdb.api_key))
        data_json = response.json()
        response.close()
        if data_json['overview']:
            overview_str = " " 
            for i in range(0,len(data_json['overview'])):
                overview.append(data_json['overview'])
            return overview_str.join(overview)
        else:
            np.NaN

In [33]:
#Creating the ID, Genres, overview Columns
df7['id'] = df7['Title'].map(lambda x: get_id(str(x)))
df7['genres'] = df7['Title'].map(lambda x: get_genre(str(x)))
df7['overview'] = df7['Title'].map(lambda x: get_overview(str(x)))
df7.head(2)

Unnamed: 0,Opening,Opening.1,Title,Production company,Cast and crew,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Ref.,Ref.,id,genres,overview
0,JANUARY,5,Insidious: The Last Key,Universal Pictures / Blumhouse Productions / S...,Adam Robitel (director); Leigh Whannell (scree...,[2],,406563,Horror Mystery Thriller,Parapsychologist Elise Rainier and her team tr...
1,JANUARY,5,The Strange Ones,Vertical Entertainment,Lauren Wolkstein (director); Christopher Radcl...,[3],,426258,Thriller Drama,Mysterious events surround the travels of two ...


In [34]:
#Taking the Important Columns
df8 = df7[['id','Title','Cast and crew','genres','overview']]
df8 = df8.rename(columns={'Title':'title'})
df8.head(2)

Unnamed: 0,id,title,Cast and crew,genres,overview
0,406563,Insidious: The Last Key,Adam Robitel (director); Leigh Whannell (scree...,Horror Mystery Thriller,Parapsychologist Elise Rainier and her team tr...
1,426258,The Strange Ones,Lauren Wolkstein (director); Christopher Radcl...,Thriller Drama,Mysterious events surround the travels of two ...


In [35]:
#Checking NA Values.
df8.isnull().sum()

id               0
title            0
Cast and crew    0
genres           2
overview         0
dtype: int64

In [36]:
# Dropping the NA Values as Count of NA Value is Small
df8.dropna(inplace=True)
df8.shape

(270, 5)

In [37]:
#Get Actors Name First 3 Actors
def get_actor(x):
    return ((x.split("screenplay); ")[-1]).split(", ")[0:3])

In [38]:
#Calling the get actor Function.
df8['cast'] = df8['Cast and crew'].map(lambda x: get_actor(x))
df8.head(2)

Unnamed: 0,id,title,Cast and crew,genres,overview,cast
0,406563,Insidious: The Last Key,Adam Robitel (director); Leigh Whannell (scree...,Horror Mystery Thriller,Parapsychologist Elise Rainier and her team tr...,"[Lin Shaye, Angus Sampson, Leigh Whannell]"
1,426258,The Strange Ones,Lauren Wolkstein (director); Christopher Radcl...,Thriller Drama,Mysterious events surround the travels of two ...,"[Alex Pettyfer, James Freedson-Jackson, Emily ..."


In [39]:
#Function to fetch Directors
def get_director(x):
    if " (director)" in x:
        return (x.split(" (director)")[0:1])
    elif " (directors)" in x:
        return (x.split(" (directors)")[0:1])
    else:
        return (x.split(" (director/screenplay)")[0:1])

In [40]:
#Calling the get_director function to get Director Name's
df8['crew'] = df8['Cast and crew'].map(lambda x: get_director(x))
#Dropping Original Columns
df8 = df8.drop(columns=['Cast and crew'])
df8.head(2)

Unnamed: 0,id,title,genres,overview,cast,crew
0,406563,Insidious: The Last Key,Horror Mystery Thriller,Parapsychologist Elise Rainier and her team tr...,"[Lin Shaye, Angus Sampson, Leigh Whannell]",[Adam Robitel]
1,426258,The Strange Ones,Thriller Drama,Mysterious events surround the travels of two ...,"[Alex Pettyfer, James Freedson-Jackson, Emily ...",[Lauren Wolkstein]


Inference: Features is a String will Convert this into a List so we can concat with other Features.

In [41]:
#Converting Feature into a List
df8['genres'] = df8['genres'].apply(lambda x:x.split())
df8['overview'] = df8['overview'].apply(lambda x:x.split())
df8.head(2)

Unnamed: 0,id,title,genres,overview,cast,crew
0,406563,Insidious: The Last Key,"[Horror, Mystery, Thriller]","[Parapsychologist, Elise, Rainier, and, her, t...","[Lin Shaye, Angus Sampson, Leigh Whannell]",[Adam Robitel]
1,426258,The Strange Ones,"[Thriller, Drama]","[Mysterious, events, surround, the, travels, o...","[Alex Pettyfer, James Freedson-Jackson, Emily ...",[Lauren Wolkstein]


In [42]:
#Making Single Entities by Removing Space
df8['cast'] = df8['cast'].apply(collapse)
df8['crew'] = df8['crew'].apply(collapse)
df8['genres'] = df8['genres'].apply(collapse)
df8.head(3)

Unnamed: 0,id,title,genres,overview,cast,crew
0,406563,Insidious: The Last Key,"[Horror, Mystery, Thriller]","[Parapsychologist, Elise, Rainier, and, her, t...","[LinShaye, AngusSampson, LeighWhannell]",[AdamRobitel]
1,426258,The Strange Ones,"[Thriller, Drama]","[Mysterious, events, surround, the, travels, o...","[AlexPettyfer, JamesFreedson-Jackson, EmilyAlt...",[LaurenWolkstein]
2,348389,Stratton,"[Action, Thriller]","[A, British, Special, Boat, Service, commando,...","[DominicCooper, AustinStowell, GemmaChan]",[SimonWest]


In [43]:
#Creating Detail Feature
df8['Detail'] = df8['overview'] + df8['genres'] + df8['cast'] + df8['crew']
#Dropping Original Columns
final_df18 = df8.drop(columns=['overview','genres','cast','crew'])

#Converting the Detail column back to string/Text
final_df18['Detail'] = final_df18['Detail'].apply(lambda x: " ".join(x))
final_df18.head()

Unnamed: 0,id,title,Detail
0,406563,Insidious: The Last Key,Parapsychologist Elise Rainier and her team tr...
1,426258,The Strange Ones,Mysterious events surround the travels of two ...
2,348389,Stratton,A British Special Boat Service commando tracks...
3,468210,Sweet Country,"It’s 1929 on the vast, desert-like, Eastern Ar..."
4,399035,The Commuter,"A businessman, on his daily commute home, gets..."


In [44]:
#Appending the Dataset (Previous + 2018 Movies)
final_df = final_df.append(final_df18)
#Checking the final_df
final_df.shape

(5596, 3)

#### Movies of 2019

In [45]:
link = "https://en.wikipedia.org/wiki/List_of_American_films_of_2019"
df9 = pd.read_html(link, header=0)[2]
df10 = pd.read_html(link, header=0)[3]
df11 = pd.read_html(link, header=0)[4]
df12 = pd.read_html(link, header=0)[5]

#Appending all 4 df into 1
df13 = df9.append(df10.append(df11.append(df12,ignore_index=True),ignore_index=True),ignore_index=True)
df13.head()

Unnamed: 0,Opening,Opening.1,Title,Production company,Cast and crew,Ref.
0,JANUARY,4,Escape Room,Columbia Pictures / Original Film,"Adam Robitel (director); Bragi F. Schut, Maria...",[2]
1,JANUARY,4,Rust Creek,IFC Films,Jen McGowan (director); Julie Lipson (screenpl...,[3]
2,JANUARY,4,American Hangman,Hangman Justice Productions,Wilson Coneybeare (director/screenplay); Donal...,[4]
3,JANUARY,11,A Dog's Way Home,Columbia Pictures,Charles Martin Smith (director); W. Bruce Came...,[5]
4,JANUARY,11,The Upside,STX Entertainment,Neil Burger (director); Jon Hartmere (screenpl...,[6]


In [46]:
#Creating the ID, Genres, overview Columns
df13['id'] = df13['Title'].map(lambda x: get_id(str(x)))
df13['genres'] = df13['Title'].map(lambda x: get_genre(str(x)))
df13['overview'] = df13['Title'].map(lambda x: get_overview(str(x)))
df13.head(2)

Unnamed: 0,Opening,Opening.1,Title,Production company,Cast and crew,Ref.,id,genres,overview
0,JANUARY,4,Escape Room,Columbia Pictures / Original Film,"Adam Robitel (director); Bragi F. Schut, Maria...",[2],522681,Horror Thriller Mystery,Six strangers find themselves in circumstances...
1,JANUARY,4,Rust Creek,IFC Films,Jen McGowan (director); Julie Lipson (screenpl...,[3],561362,Thriller Drama,When an overachieving college senior makes a w...


In [47]:
#Taking the Important Columns
df14 = df13[['id','Title','Cast and crew','genres','overview']]
df14 = df14.rename(columns={'Title':'title'})
df14.head(2)

Unnamed: 0,id,title,Cast and crew,genres,overview
0,522681,Escape Room,"Adam Robitel (director); Bragi F. Schut, Maria...",Horror Thriller Mystery,Six strangers find themselves in circumstances...
1,561362,Rust Creek,Jen McGowan (director); Julie Lipson (screenpl...,Thriller Drama,When an overachieving college senior makes a w...


In [48]:
#Checking NA Values.
df14.isnull().sum()

id               0
title            0
Cast and crew    0
genres           0
overview         0
dtype: int64

Inference: No NA Values Presents

In [49]:
#Calling the get actor Function.
df14['cast'] = df14['Cast and crew'].map(lambda x: get_actor(x))
df14.head(2)

Unnamed: 0,id,title,Cast and crew,genres,overview,cast
0,522681,Escape Room,"Adam Robitel (director); Bragi F. Schut, Maria...",Horror Thriller Mystery,Six strangers find themselves in circumstances...,"[Taylor Russell, Logan Miller, Deborah Ann Woll]"
1,561362,Rust Creek,Jen McGowan (director); Julie Lipson (screenpl...,Thriller Drama,When an overachieving college senior makes a w...,"[Hermione Corfield, Jay Paulson, Sean O'Bryan]"


In [50]:
#Calling the get_director function to get Director Name's
df14['crew'] = df14['Cast and crew'].map(lambda x: get_director(x))
#Dropping Original Columns
df14 = df14.drop(columns=['Cast and crew'])
df14.head(2)

Unnamed: 0,id,title,genres,overview,cast,crew
0,522681,Escape Room,Horror Thriller Mystery,Six strangers find themselves in circumstances...,"[Taylor Russell, Logan Miller, Deborah Ann Woll]",[Adam Robitel]
1,561362,Rust Creek,Thriller Drama,When an overachieving college senior makes a w...,"[Hermione Corfield, Jay Paulson, Sean O'Bryan]",[Jen McGowan]


Inference: Features is a String will Convert this into a List so we can concat with other Features.

In [51]:
#Converting Feature into a List
df14['genres'] = df14['genres'].apply(lambda x:x.split())
df14['overview'] = df14['overview'].apply(lambda x:x.split())
df14.head(2)

Unnamed: 0,id,title,genres,overview,cast,crew
0,522681,Escape Room,"[Horror, Thriller, Mystery]","[Six, strangers, find, themselves, in, circums...","[Taylor Russell, Logan Miller, Deborah Ann Woll]",[Adam Robitel]
1,561362,Rust Creek,"[Thriller, Drama]","[When, an, overachieving, college, senior, mak...","[Hermione Corfield, Jay Paulson, Sean O'Bryan]",[Jen McGowan]


In [52]:
#Making Single Entities by Removing Space
df14['cast'] = df14['cast'].apply(collapse)
df14['crew'] = df14['crew'].apply(collapse)
df14['genres'] = df14['genres'].apply(collapse)
df14.head(3)

Unnamed: 0,id,title,genres,overview,cast,crew
0,522681,Escape Room,"[Horror, Thriller, Mystery]","[Six, strangers, find, themselves, in, circums...","[TaylorRussell, LoganMiller, DeborahAnnWoll]",[AdamRobitel]
1,561362,Rust Creek,"[Thriller, Drama]","[When, an, overachieving, college, senior, mak...","[HermioneCorfield, JayPaulson, SeanO'Bryan]",[JenMcGowan]
2,567738,American Hangman,[Thriller],"[An, unidentified, man, posts, a, live, feed, ...","[DonaldSutherland, VincentKartheiser, OliverDe...",[WilsonConeybeare]


In [53]:
#Creating Detail Feature
df14['Detail'] = df14['overview'] + df14['genres'] + df14['cast'] + df14['crew']
#Dropping Original Columns
final_df19 = df14.drop(columns=['overview','genres','cast','crew'])

#Converting the Detail column back to string/Text
final_df19['Detail'] = final_df19['Detail'].apply(lambda x: " ".join(x))
final_df19.head()

Unnamed: 0,id,title,Detail
0,522681,Escape Room,Six strangers find themselves in circumstances...
1,561362,Rust Creek,When an overachieving college senior makes a w...
2,567738,American Hangman,An unidentified man posts a live feed on socia...
3,508763,A Dog's Way Home,"The adventure of Bella, a dog who embarks on a..."
4,440472,The Upside,Phillip is a wealthy quadriplegic who needs a ...


In [54]:
#Appending the Dataset (Previous + 2019 Movies)
final_df = final_df.append(final_df19)
#Checking the final_df
final_df.shape

(5838, 3)

#### Movies of 2020

In [55]:
link = "https://en.wikipedia.org/wiki/List_of_American_films_of_2020"
df15 = pd.read_html(link, header=0)[2]
df16 = pd.read_html(link, header=0)[3]
df17 = pd.read_html(link, header=0)[4]
df18 = pd.read_html(link, header=0)[5]

#Appending all 4 df into 1
df19 = df15.append(df16.append(df17.append(df18,ignore_index=True),ignore_index=True),ignore_index=True)
df19.head()

Unnamed: 0,Opening,Opening.1,Title,Production company,Cast and crew,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Ref.,Ref.
0,JANUARY,3,The Grudge,Screen Gems / Stage 6 Films / Ghost House Pict...,Nicolas Pesce (director/screenplay); Andrea Ri...,[2],
1,JANUARY,10,Underwater,20th Century Fox / TSG Entertainment / Chernin...,"William Eubank (director); Brian Duffield, Ada...",[3],
2,JANUARY,10,Like a Boss,Paramount Pictures,"Miguel Arteta (director); Sam Pitman, Adam Col...",[4],
3,JANUARY,10,Three Christs,IFC Films,Jon Avnet (director/screenplay); Eric Nazarian...,,
4,JANUARY,10,Inherit the Viper,Barry Films / Tycor International Film Company,Anthony Jerjen (director); Andrew Crabtree (sc...,[5],


In [56]:
#Creating the ID, Genres, overview Columns
df19['id'] = df19['Title'].map(lambda x: get_id(str(x)))
df19['genres'] = df19['Title'].map(lambda x: get_genre(str(x)))
df19['overview'] = df19['Title'].map(lambda x: get_overview(str(x)))
df19.head(2)

Unnamed: 0,Opening,Opening.1,Title,Production company,Cast and crew,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Ref.,Ref.,id,genres,overview
0,JANUARY,3,The Grudge,Screen Gems / Stage 6 Films / Ghost House Pict...,Nicolas Pesce (director/screenplay); Andrea Ri...,[2],,1970,Horror Mystery Thriller,An American nurse living and working in Tokyo ...
1,JANUARY,10,Underwater,20th Century Fox / TSG Entertainment / Chernin...,"William Eubank (director); Brian Duffield, Ada...",[3],,443791,Action Horror Science Fiction Thriller,After an earthquake destroys their underwater ...


In [57]:
#Taking the Important Columns
df20 = df19[['id','Title','Cast and crew','genres','overview']]
df20 = df20.rename(columns={'Title':'title'})
df20.head(2)

Unnamed: 0,id,title,Cast and crew,genres,overview
0,1970,The Grudge,Nicolas Pesce (director/screenplay); Andrea Ri...,Horror Mystery Thriller,An American nurse living and working in Tokyo ...
1,443791,Underwater,"William Eubank (director); Brian Duffield, Ada...",Action Horror Science Fiction Thriller,After an earthquake destroys their underwater ...


In [58]:
#Checking NA Values.
df20.isnull().sum()

id               1
title            0
Cast and crew    0
genres           1
overview         1
dtype: int64

In [59]:
# Dropping the NA Values as Count of NA Value is Small
df20.dropna(inplace=True)
df20.shape

(273, 5)

In [60]:
#Calling the get actor Function.
df20['cast'] = df20['Cast and crew'].map(lambda x: get_actor(x))
df20.head(2)

Unnamed: 0,id,title,Cast and crew,genres,overview,cast
0,1970,The Grudge,Nicolas Pesce (director/screenplay); Andrea Ri...,Horror Mystery Thriller,An American nurse living and working in Tokyo ...,"[Andrea Riseborough, Demián Bichir, John Cho]"
1,443791,Underwater,"William Eubank (director); Brian Duffield, Ada...",Action Horror Science Fiction Thriller,After an earthquake destroys their underwater ...,"[Kristen Stewart, Vincent Cassel, Jessica Henw..."


In [61]:
#Calling the get_director function to get Director Name's
df20['crew'] = df20['Cast and crew'].map(lambda x: get_director(x))
#Dropping Original Columns
df20 = df20.drop(columns=['Cast and crew'])
df20.head(2)

Unnamed: 0,id,title,genres,overview,cast,crew
0,1970,The Grudge,Horror Mystery Thriller,An American nurse living and working in Tokyo ...,"[Andrea Riseborough, Demián Bichir, John Cho]",[Nicolas Pesce]
1,443791,Underwater,Action Horror Science Fiction Thriller,After an earthquake destroys their underwater ...,"[Kristen Stewart, Vincent Cassel, Jessica Henw...",[William Eubank]


Inference: Features is a String will Convert this into a List so we can concat with other Features.

In [62]:
#Converting Feature into a List
df20['genres'] = df20['genres'].apply(lambda x:x.split())
df20['overview'] = df20['overview'].apply(lambda x:x.split())
df20.head(2)

Unnamed: 0,id,title,genres,overview,cast,crew
0,1970,The Grudge,"[Horror, Mystery, Thriller]","[An, American, nurse, living, and, working, in...","[Andrea Riseborough, Demián Bichir, John Cho]",[Nicolas Pesce]
1,443791,Underwater,"[Action, Horror, Science, Fiction, Thriller]","[After, an, earthquake, destroys, their, under...","[Kristen Stewart, Vincent Cassel, Jessica Henw...",[William Eubank]


In [63]:
#Making Single Entities by Removing Space
df20['cast'] = df20['cast'].apply(collapse)
df20['crew'] = df20['crew'].apply(collapse)
df20['genres'] = df20['genres'].apply(collapse)
df20.head(3)

Unnamed: 0,id,title,genres,overview,cast,crew
0,1970,The Grudge,"[Horror, Mystery, Thriller]","[An, American, nurse, living, and, working, in...","[AndreaRiseborough, DemiánBichir, JohnCho]",[NicolasPesce]
1,443791,Underwater,"[Action, Horror, Science, Fiction, Thriller]","[After, an, earthquake, destroys, their, under...","[KristenStewart, VincentCassel, JessicaHenwick]",[WilliamEubank]
2,526019,Like a Boss,[Comedy],"[Two, female, friends, with, very, different, ...","[TiffanyHaddish, RoseByrne, SalmaHayek]",[MiguelArteta]


In [64]:
#Creating Detail Feature
df20['Detail'] = df20['overview'] + df20['genres'] + df20['cast'] + df20['crew']
#Dropping Original Columns
final_df20 = df20.drop(columns=['overview','genres','cast','crew'])

#Converting the Detail column back to string/Text
final_df20['Detail'] = final_df20['Detail'].apply(lambda x: " ".join(x))
final_df20.head()

Unnamed: 0,id,title,Detail
0,1970,The Grudge,An American nurse living and working in Tokyo ...
1,443791,Underwater,After an earthquake destroys their underwater ...
2,526019,Like a Boss,Two female friends with very different ideals ...
3,418533,Three Christs,Dr. Alan Stone breaks new ground for treatment...
4,634904,Inherit the Viper,"Since the death of their father, the Riley sib..."


In [65]:
#Appending the Dataset (Previous + 2020 Movies)
final_df = final_df.append(final_df20)
#Checking the final_df
final_df.shape

(6111, 3)

#### Movies of 2021

In [66]:
link = "https://en.wikipedia.org/wiki/List_of_American_films_of_2021"
df21 = pd.read_html(link, header=0)[3]
df22 = pd.read_html(link, header=0)[4]
df23 = pd.read_html(link, header=0)[5]
df24 = pd.read_html(link, header=0)[6]

#Appending all 4 df into 1
df25 = df21.append(df22.append(df23.append(df24,ignore_index=True),ignore_index=True),ignore_index=True)
df25.head()

Unnamed: 0,Opening,Opening.1,Title,Production company,Cast and crew,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Ref.,Ref.
0,JANUARY,1.0,Shadow in the Cloud,Vertical Entertainment,Roseanne Liang (director/screenplay); Max Land...,[2],
1,JANUARY,13.0,The White Tiger,Netflix,Ramin Bahrani (director/screenplay); Adarsh Go...,,
2,JANUARY,14.0,Locked Down,HBO Max / Warner Bros. Pictures,Doug Liman (director); Steven Knight (screenpl...,[3],
3,JANUARY,15.0,The Dig,Netflix / Clerkenwell Films,Simon Stone (director); Moira Buffini (screenp...,[4],
4,JANUARY,15.0,Outside the Wire,Netflix,"Mikael Håfström (director); Rob Yescombe, Rowa...",[5],


In [67]:
#Creating the ID, Genres, overview Columns
df25['id'] = df25['Title'].map(lambda x: get_id(str(x)))
df25['genres'] = df25['Title'].map(lambda x: get_genre(str(x)))
df25['overview'] = df25['Title'].map(lambda x: get_overview(str(x)))
df25.head(2)

Unnamed: 0,Opening,Opening.1,Title,Production company,Cast and crew,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Ref.,Ref.,id,genres,overview
0,JANUARY,1.0,Shadow in the Cloud,Vertical Entertainment,Roseanne Liang (director/screenplay); Max Land...,[2],,675327,Horror Action War,A WWII pilot traveling with top secret documen...
1,JANUARY,13.0,The White Tiger,Netflix,Ramin Bahrani (director/screenplay); Adarsh Go...,,,628534,Drama,An ambitious Indian driver uses his wit and cu...


In [68]:
#Taking the Important Columns
df26 = df25[['id','Title','Cast and crew','genres','overview']]
df26 = df26.rename(columns={'Title':'title'})
df26.head(2)

Unnamed: 0,id,title,Cast and crew,genres,overview
0,675327,Shadow in the Cloud,Roseanne Liang (director/screenplay); Max Land...,Horror Action War,A WWII pilot traveling with top secret documen...
1,628534,The White Tiger,Ramin Bahrani (director/screenplay); Adarsh Go...,Drama,An ambitious Indian driver uses his wit and cu...


In [69]:
#Checking NA Values.
df26.isnull().sum()

id               1
title            1
Cast and crew    1
genres           1
overview         1
dtype: int64

In [70]:
# Dropping the NA Values as Count of NA Value is Small
df26.dropna(inplace=True)
df26.shape

(356, 5)

In [71]:
#Calling the get actor Function.
df26['cast'] = df26['Cast and crew'].map(lambda x: get_actor(x))
df26.head(2)

Unnamed: 0,id,title,Cast and crew,genres,overview,cast
0,675327,Shadow in the Cloud,Roseanne Liang (director/screenplay); Max Land...,Horror Action War,A WWII pilot traveling with top secret documen...,"[Chloë Grace Moretz, Taylor John Smith, Beulah..."
1,628534,The White Tiger,Ramin Bahrani (director/screenplay); Adarsh Go...,Drama,An ambitious Indian driver uses his wit and cu...,"[Adarsh Gourav, Rajkummar Rao, Priyanka Chopra..."


In [72]:
#Calling the get_director function to get Director Name's
df26['crew'] = df26['Cast and crew'].map(lambda x: get_director(x))
#Dropping Original Columns
df26 = df26.drop(columns=['Cast and crew'])
df26.head(2)

Unnamed: 0,id,title,genres,overview,cast,crew
0,675327,Shadow in the Cloud,Horror Action War,A WWII pilot traveling with top secret documen...,"[Chloë Grace Moretz, Taylor John Smith, Beulah...",[Roseanne Liang]
1,628534,The White Tiger,Drama,An ambitious Indian driver uses his wit and cu...,"[Adarsh Gourav, Rajkummar Rao, Priyanka Chopra...",[Ramin Bahrani]


Inference: Features is a String will Convert this into a List so we can concat with other Features.

In [73]:
#Converting Feature into a List
df26['genres'] = df26['genres'].apply(lambda x:x.split())
df26['overview'] = df26['overview'].apply(lambda x:x.split())
df26.head(2)

Unnamed: 0,id,title,genres,overview,cast,crew
0,675327,Shadow in the Cloud,"[Horror, Action, War]","[A, WWII, pilot, traveling, with, top, secret,...","[Chloë Grace Moretz, Taylor John Smith, Beulah...",[Roseanne Liang]
1,628534,The White Tiger,[Drama],"[An, ambitious, Indian, driver, uses, his, wit...","[Adarsh Gourav, Rajkummar Rao, Priyanka Chopra...",[Ramin Bahrani]


In [74]:
#Making Single Entities by Removing Space
df26['cast'] = df26['cast'].apply(collapse)
df26['crew'] = df26['crew'].apply(collapse)
df26['genres'] = df26['genres'].apply(collapse)
df26.head(3)

Unnamed: 0,id,title,genres,overview,cast,crew
0,675327,Shadow in the Cloud,"[Horror, Action, War]","[A, WWII, pilot, traveling, with, top, secret,...","[ChloëGraceMoretz, TaylorJohnSmith, BeulahKoale]",[RoseanneLiang]
1,628534,The White Tiger,[Drama],"[An, ambitious, Indian, driver, uses, his, wit...","[AdarshGourav, RajkummarRao, PriyankaChopraJonas]",[RaminBahrani]
2,741228,Locked Down,"[Comedy, Crime, Drama]","[During, a, COVID-19, lockdown,, sparring, cou...","[AnneHathaway, ChiwetelEjiofor, StephenMerchant]",[DougLiman]


In [75]:
#Creating Detail Feature
df26['Detail'] = df26['overview'] + df26['genres'] + df26['cast'] + df26['crew']
#Dropping Original Columns
final_df21 = df26.drop(columns=['overview','genres','cast','crew'])

#Converting the Detail column back to string/Text
final_df21['Detail'] = final_df21['Detail'].apply(lambda x: " ".join(x))
final_df21.head()

Unnamed: 0,id,title,Detail
0,675327,Shadow in the Cloud,A WWII pilot traveling with top secret documen...
1,628534,The White Tiger,An ambitious Indian driver uses his wit and cu...
2,741228,Locked Down,"During a COVID-19 lockdown, sparring couple Li..."
3,532865,The Dig,"As WWII looms, a wealthy widow hires an amateu..."
4,775996,Outside the Wire,"In the near future, a drone pilot is sent into..."


In [76]:
#Appending the Dataset (Previous + 2021 Movies)
final_df = final_df.append(final_df21)
#Checking the final_df
final_df.shape

(6467, 3)

#### Movies of 2022

In [77]:
link = "https://en.wikipedia.org/wiki/List_of_American_films_of_2022"
df27 = pd.read_html(link, header=0)[3]
df28 = pd.read_html(link, header=0)[4]
df29 = pd.read_html(link, header=0)[5]
df30 = pd.read_html(link, header=0)[6]

#Appending all 4 df into 1
df31 = df27.append(df28.append(df29.append(df30,ignore_index=True),ignore_index=True),ignore_index=True)
df31.head()

Unnamed: 0,Opening,Opening.1,Title,Production company,Cast and crew,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Ref.,Ref.
0,JANUARY,7.0,The 355,Universal Pictures / Freckle Films / FilmNatio...,Simon Kinberg (director/screenplay); Theresa R...,[2],
1,JANUARY,7.0,The Legend of La Llorona,Saban Films,Patricia Harris Seeley (director/screenplay); ...,[3],
2,JANUARY,7.0,The Commando,Saban Films,Asif Akbar (director); Koji Steven Sakai (scre...,[4],
3,JANUARY,14.0,Scream,Paramount Pictures / Spyglass Media Group,"Matt Bettinelli-Olpin, Tyler Gillett (director...",[5],
4,JANUARY,14.0,Hotel Transylvania: Transformania,Amazon Studios / Columbia Pictures / Sony Pict...,"Jennifer Kluska, Derek Drymon (directors); Amo...",[6],


In [78]:
#Creating the ID, Genres, overview Columns
df31['id'] = df31['Title'].map(lambda x: get_id(str(x)))
df31['genres'] = df31['Title'].map(lambda x: get_genre(str(x)))
df31['overview'] = df31['Title'].map(lambda x: get_overview(str(x)))
df31.head(2)

Unnamed: 0,Opening,Opening.1,Title,Production company,Cast and crew,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Ref.,Ref.,id,genres,overview
0,JANUARY,7.0,The 355,Universal Pictures / Freckle Films / FilmNatio...,Simon Kinberg (director/screenplay); Theresa R...,[2],,522016,Action Thriller,"A group of top female agents from American, Br..."
1,JANUARY,7.0,The Legend of La Llorona,Saban Films,Patricia Harris Seeley (director/screenplay); ...,[3],,631947,Horror Thriller,"While vacationing in Mexico, a couple discover..."


In [79]:
#Taking the Important Columns
df32 = df31[['id','Title','Cast and crew','genres','overview']]
df32 = df32.rename(columns={'Title':'title'})
df32.head(2)

Unnamed: 0,id,title,Cast and crew,genres,overview
0,522016,The 355,Simon Kinberg (director/screenplay); Theresa R...,Action Thriller,"A group of top female agents from American, Br..."
1,631947,The Legend of La Llorona,Patricia Harris Seeley (director/screenplay); ...,Horror Thriller,"While vacationing in Mexico, a couple discover..."


In [80]:
#Checking NA Values.
df32.isnull().sum()

id               1
title            1
Cast and crew    1
genres           2
overview         1
dtype: int64

In [81]:
# Dropping the NA Values as Count of NA Value is Small
df32.dropna(inplace=True)
df32.shape

(159, 5)

In [82]:
#Calling the get actor Function.
df32['cast'] = df32['Cast and crew'].map(lambda x: get_actor(x))
df32.head(2)

Unnamed: 0,id,title,Cast and crew,genres,overview,cast
0,522016,The 355,Simon Kinberg (director/screenplay); Theresa R...,Action Thriller,"A group of top female agents from American, Br...","[Jessica Chastain, Lupita Nyong'o, Penélope Cruz]"
1,631947,The Legend of La Llorona,Patricia Harris Seeley (director/screenplay); ...,Horror Thriller,"While vacationing in Mexico, a couple discover...","[Autumn Reeser, Antonio Cupo, Danny Trejo]"


In [83]:
#Calling the get_director function to get Director Name's
df32['crew'] = df32['Cast and crew'].map(lambda x: get_director(x))
#Dropping Original Columns
df32 = df32.drop(columns=['Cast and crew'])
df32.head(2)

Unnamed: 0,id,title,genres,overview,cast,crew
0,522016,The 355,Action Thriller,"A group of top female agents from American, Br...","[Jessica Chastain, Lupita Nyong'o, Penélope Cruz]",[Simon Kinberg]
1,631947,The Legend of La Llorona,Horror Thriller,"While vacationing in Mexico, a couple discover...","[Autumn Reeser, Antonio Cupo, Danny Trejo]",[Patricia Harris Seeley]


Inference: Features is a String will Convert this into a List so we can concat with other Features.

In [84]:
#Converting Feature into a List
df32['genres'] = df32['genres'].apply(lambda x:x.split())
df32['overview'] = df32['overview'].apply(lambda x:x.split())
df32.head(2)

Unnamed: 0,id,title,genres,overview,cast,crew
0,522016,The 355,"[Action, Thriller]","[A, group, of, top, female, agents, from, Amer...","[Jessica Chastain, Lupita Nyong'o, Penélope Cruz]",[Simon Kinberg]
1,631947,The Legend of La Llorona,"[Horror, Thriller]","[While, vacationing, in, Mexico,, a, couple, d...","[Autumn Reeser, Antonio Cupo, Danny Trejo]",[Patricia Harris Seeley]


In [85]:
#Making Single Entities by Removing Space
df32['cast'] = df32['cast'].apply(collapse)
df32['crew'] = df32['crew'].apply(collapse)
df32['genres'] = df32['genres'].apply(collapse)
df32.head(3)

Unnamed: 0,id,title,genres,overview,cast,crew
0,522016,The 355,"[Action, Thriller]","[A, group, of, top, female, agents, from, Amer...","[JessicaChastain, LupitaNyong'o, PenélopeCruz]",[SimonKinberg]
1,631947,The Legend of La Llorona,"[Horror, Thriller]","[While, vacationing, in, Mexico,, a, couple, d...","[AutumnReeser, AntonioCupo, DannyTrejo]",[PatriciaHarrisSeeley]
2,753232,The Commando,"[Action, Crime, Thriller]","[An, elite, DEA, agent, returns, home, after, ...","[MickeyRourke, MichaelJaiWhite]",[AsifAkbar]


In [86]:
#Creating Detail Feature
df32['Detail'] = df32['overview'] + df32['genres'] + df32['cast'] + df32['crew']
#Dropping Original Columns
final_df22 = df32.drop(columns=['overview','genres','cast','crew'])

#Converting the Detail column back to string/Text
final_df22['Detail'] = final_df22['Detail'].apply(lambda x: " ".join(x))
final_df22.head()

Unnamed: 0,id,title,Detail
0,522016,The 355,"A group of top female agents from American, Br..."
1,631947,The Legend of La Llorona,"While vacationing in Mexico, a couple discover..."
2,753232,The Commando,An elite DEA agent returns home after a failed...
3,646385,Scream,Twenty-five years after a streak of brutal mur...
4,585083,Hotel Transylvania: Transformania,"When Van Helsing's mysterious invention, the ""..."


In [87]:
#Appending the Dataset (Previous + 2022 Movies)
final_df = final_df.append(final_df22)
#Checking the final_df
final_df.shape

(6626, 3)

Inference: So This is Our Final Movies Data with 6624 Unique Movies, Now will Train Recommand System

In [88]:
#Dropping a Datapoints As it Doesn't Exist on TMDB 
final_df = final_df.drop([final_df.index[3297]])
final_df.reset_index(inplace = True, drop = True)
final_df.shape

(6625, 3)

In [89]:
#Dropping a Datapoints As it Doesn't Exist on TMDB 
final_df = final_df.drop([final_df.index[4621]])
final_df.reset_index(inplace = True, drop = True)
final_df.shape

(6624, 3)

In [90]:
#Using Count Vectorizer to Get Vector of Movies
cv = CountVectorizer(max_features=5000,stop_words='english')
#Vectorizing
vector = cv.fit_transform(final_df['Detail']).toarray()
#shape
vector.shape

(6624, 5000)

#### Training System

In [91]:
#Calculating Cosine Similarity
similarity = cosine_similarity(vector)
similarity

array([[1.00000000e+00, 1.30066495e-01, 7.27392967e-02, ...,
        0.00000000e+00, 7.95386898e-04, 0.00000000e+00],
       [1.30066495e-01, 1.00000000e+00, 8.83021571e-02, ...,
        0.00000000e+00, 9.65563073e-04, 5.12988695e-02],
       [7.27392967e-02, 8.83021571e-02, 1.00000000e+00, ...,
        0.00000000e+00, 5.22438629e-02, 0.00000000e+00],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.00000000e+00, 0.00000000e+00, 1.18899866e-05],
       [7.95386898e-04, 9.65563073e-04, 5.22438629e-02, ...,
        0.00000000e+00, 1.00000000e+00, 5.97607128e-02],
       [0.00000000e+00, 5.12988695e-02, 0.00000000e+00, ...,
        1.18899866e-05, 5.97607128e-02, 1.00000000e+00]])

In [92]:
#Saving the Model & Movies_data in a Pickle file.
pickle.dump(final_df,open('C:/Users/Akaash/Downloads/movies_list.pkl','wb'))
pickle.dump(similarity,open('C:/Users/Akaash/Downloads/recommand.pkl','wb'))

In [93]:
#Getting a index of a Movie
final_df[final_df['title'] == 'Whiplash'].index[0]

3870

In [94]:
#Recommend Function
def recommend(movie,n):
    index = final_df[final_df['title'] == movie].index[0]
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    for i in distances[1:n]:
        print(final_df.iloc[i[0]].title)

In [95]:
#Calling the Recommend Function to Recommend Movies
recommend('Whiplash',4)

Alone With Her
R100
Selena


In [96]:
recommend('Black Swan',5)

Desert Dancer
ABCD (Any Body Can Dance)
Addicted
Showgirls
