# Movie Recommender I

## Plan Walkthrough

We have the IMDB Dataset movie dataset consisting of movie_metadata, ratings, links, keywords and credits.
And as per my planning, I'll be using features such as genre, keywords, overview, cast and crew to relate
movies of my choice and recommend movies with respect to that. All of these information can be found within
the IMDB review page for any specific movie. My initial plan was to also include reviews but reviews can be
polarizing frm person to person and might create conflict in the decision factor of the LSTM algorithm that
I'm gonna use. So I'll be merging all the different datasets and processing, wrangling and extracting only
the required features from them. That's the first part of data processing which will result in having movie
information from the year 1984 to 2013.

Here comes the 2nd part, where I wished to get watchable, newer movies for which I need new and latest data.
So I scraped the movie titles from the year 2013-2023 from Wikipedia(forgot they had an API), then I scraped
the necessary features (genre, keywords, overview, etc) for all the new movie titles from IMDB's own website.
The source code for the scraper has been uploaded in the github, make sure to check that out. I had to utilize
Python packages like Multiprocessing & Pooling to run my program and produce the scraped data in batches. 
Multi-threading would have been faster for read-write process,but my program had to open 4 links and scrape
data for individual movies which was not efficient at all; I know, right. In short I was scraping movie info
one at a time and appending that to a shared memory pool whcich was being converted to a dataframe and then
producing the batch out-put as a CSV file.

In the 3rd part, I tried to clean and process the data, and since I knew what features I'll be requiring, I
only collected those features, so no feature additions. With that being done, I had two datasets which I am
cleaning and processing below. Once that is done, I'll be using the LSTM algorithm to generate recommendations.

## Part-1 And Part-3

As mentioned above, the part-1 and part-3 of my entire plan has been done below.
After I test the results, I will the taking the code to a python program based
environment and making a fully functional program out of it which will help avoid
the clutter that we're facing in the notebook environment.

## Processing the IMDB dataset:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Trying to collect the new batch-wise movie dataset and merge them.

In [2]:
df_batch1 = pd.read_csv(r"C:\Users\aniru\OneDrive\Documents\SRC\SRC 4\Movie recc\batch1.csv")
df_batch2 = pd.read_csv(r"C:\Users\aniru\OneDrive\Documents\SRC\SRC 4\Movie recc\batch1.csv")
df_batch3 = pd.read_csv(r"C:\Users\aniru\OneDrive\Documents\SRC\SRC 4\Movie recc\batch1.csv")

In [3]:
merged_raw_df = pd.concat([df_batch1, df_batch2, df_batch3], axis = 0, ignore_index = True)

In [4]:
merged_raw_df.head()

Unnamed: 0,id,keywords,genres,original_title,overview,cast,crew
0,3829920,helicopter;wildfire;baby;firefighter;forest fi...,Disaster;Action;Biography;Drama,Only%20the%20Brave,"In;2007;Prescott,;Arizona,;Eric;Marsh;of;the;P...",Josh Brolin;Miles Teller;Jeff Bridges;Jennifer...,Joseph Kosinski;Sean Flynn;Ken Nolan;Eric Warr...
1,1758810,snow;police investigation;serial killer;promis...,Serial Killer;Crime;Drama;Mystery;Thriller,The%20Snowman,When;an;elite;crime;squad's;lead;detective;inv...,Michael Fassbender;Rebecca Ferguson;Charlotte ...,Tomas Alfredson;Jo Nesbø;Peter Straughan;Hosse...
2,6217804,sequel;second part;directed by star;rhyme in t...,Comedy;Drama;Horror;Mystery,Boo%202!%20A%20Madea%20Halloween,The;film;opens;after;school;on;Tiffany's;18th;...,Tyler Perry;Cassi Davis;Patrice Lovely;Yousef ...,Tyler Perry;Tyler Perry;Brian Schulman;Ozzie A...
3,1230168,homeless man;friendship;kindness;husband wife ...,Biography;Drama,Same%20Kind%20of%20Different%20as%20Me,Ron;Hall;(Greg;Kinnear);lost;track;of;what;mat...,Greg Kinnear;Renée Zellweger;Djimon Hounsou;Jo...,Michael Carney;Ron Hall;Denver Moore;Lynn Vinc...
4,2620590,texas;chainsaw;nurse;barn;sheriff;violence;ori...,Serial Killer;Slasher Horror;Teen Horror;Crime...,Leatherface,A;violent;teen;and;three;others;kidnap;a;young...,Stephen Dorff;Lili Taylor;Sam Strike;Vanessa G...,Alexandre Bustillo;Julien Maury;Seth M. Sherwo...


In [5]:
merged_raw_df['original_title'] = merged_raw_df['original_title'].apply(lambda x: x.strip().replace('%20',' '))

Now, starting with processing the IMDB dataset along with extracting features 

In [7]:
metadata_raw_2 = pd.read_excel(r"C:\Users\aniru\OneDrive\Documents\SRC\SRC 4\Movie recc\Merged metadata new.xlsx")

In [8]:
rating_raw = pd.read_csv(r"C:\Users\aniru\OneDrive\Documents\SRC\SRC 4\Movie recc\ratings.csv")

In [9]:
metadata_raw = pd.read_excel(r"C:\Users\aniru\OneDrive\Documents\SRC\SRC 4\Movie recc\movies_metadata_xls.xlsx")

In [14]:
metadata_raw = metadata_raw[(metadata_raw['original_language'] == 'en') | (metadata_raw['original_language'] == 'hi')]

In [16]:
credits_raw = pd.read_csv(r"C:\Users\aniru\OneDrive\Documents\SRC\SRC 4\Movie recc\credits.csv")
credits_raw.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [17]:
keywords_raw = pd.read_csv(r"C:\Users\aniru\OneDrive\Documents\SRC\SRC 4\Movie recc\keywords.csv")
keywords_raw.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [18]:
links_raw = pd.read_csv(r"C:\Users\aniru\OneDrive\Documents\SRC\SRC 4\Movie recc\links.csv")
links_raw.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [19]:
metadata_raw = metadata_raw[['id','genres','original_title','overview','production_companies']]   
#selecting required colomns from tables
credits_raw = credits_raw[['id','cast','crew']]
keywords_raw = keywords_raw[['id','keywords']]
print(metadata_raw.shape)

(32777, 5)


In [20]:
metadata_raw.isnull().sum()       

id                       0
genres                   0
original_title           0
overview                71
production_companies     2
dtype: int64

In [21]:
metadata_raw_2.isnull().sum()       

id                  0
keywords          167
genres             78
original_title     70
overview           70
cast              159
crew               79
dtype: int64

In [22]:
credits_raw.isnull().sum()

id      0
cast    0
crew    0
dtype: int64

In [23]:
keywords_raw.isnull().sum()

id          0
keywords    0
dtype: int64

In [24]:
metadata_raw.loc[:,'id'] = metadata_raw.loc[:,'id'].fillna(0)
metadata_raw.loc[:,'id']=pd.to_numeric(metadata_raw.loc[:,'id']).astype('Int64')
merged_data = keywords_raw.merge(metadata_raw, on = 'id')
merged_data = merged_data.merge(credits_raw, on = 'id')

In [25]:
merged_data.head()

Unnamed: 0,id,keywords,genres,original_title,overview,production_companies,cast,crew
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",Jumanji,When siblings Judy and Peter discover an encha...,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",Grumpier Old Men,A family wedding reignites the ancient feud be...,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",[{'name': 'Twentieth Century Fox Film Corporat...,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","[{'id': 35, 'name': 'Comedy'}]",Father of the Bride Part II,Just when George Banks has recovered from his ...,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de..."


In [26]:
import ast
def toList(text):
    list = []
    for i in ast.literal_eval(text):                 #to convert the string object into a list
            list.append(i['name'])
    return list 

In [27]:
def toList2(data):
    res = []
    for word in data.strip().split(';'):
        res.append(word)
    return res

In [28]:
merged_data['genres'] = merged_data['genres'].apply(toList)
merged_data['keywords'] = merged_data['keywords'].apply(toList)

In [29]:
metadata_raw_2[['genres','keywords']] = metadata_raw_2[['genres','keywords']].astype(str)

In [30]:
metadata_raw_2['genres'] = metadata_raw_2['genres'].apply(toList2)
metadata_raw_2['keywords'] = metadata_raw_2['keywords'].apply(toList2)

In [31]:
metadata_raw_2.head()

Unnamed: 0,id,keywords,genres,original_title,overview,cast,crew
0,3829920,"[helicopter, wildfire, baby, firefighter, fore...","[Disaster, Action, Biography, Drama]",Only%20the%20Brave,"In;2007;Prescott,;Arizona,;Eric;Marsh;of;the;P...",Josh Brolin;Miles Teller;Jeff Bridges;Jennifer...,Joseph Kosinski;Sean Flynn;Ken Nolan;Eric Warr...
1,1758810,"[snow, police investigation, serial killer, pr...","[Serial Killer, Crime, Drama, Mystery, Thriller]",The%20Snowman,When;an;elite;crime;squad's;lead;detective;inv...,Michael Fassbender;Rebecca Ferguson;Charlotte ...,Tomas Alfredson;Jo NesbÃ¸;Peter Straughan;Hoss...
2,6217804,"[sequel, second part, directed by star, rhyme ...","[Comedy, Drama, Horror, Mystery]",Boo%202!%20A%20Madea%20Halloween,The;film;opens;after;school;on;Tiffany's;18th;...,Tyler Perry;Cassi Davis;Patrice Lovely;Yousef ...,Tyler Perry;Tyler Perry;Brian Schulman;Ozzie A...
3,1230168,"[homeless man, friendship, kindness, husband w...","[Biography, Drama]",Same%20Kind%20of%20Different%20as%20Me,Ron;Hall;(Greg;Kinnear);lost;track;of;what;mat...,Greg Kinnear;RenÃ©e Zellweger;Djimon Hounsou;J...,Michael Carney;Ron Hall;Denver Moore;Lynn Vinc...
4,2620590,"[texas, chainsaw, nurse, barn, sheriff, violen...","[Serial Killer, Slasher Horror, Teen Horror, C...",Leatherface,A;violent;teen;and;three;others;kidnap;a;young...,Stephen Dorff;Lili Taylor;Sam Strike;Vanessa G...,Alexandre Bustillo;Julien Maury;Seth M. Sherwo...


In [32]:
metadata_raw_2['original_title'] = metadata_raw_2['original_title'].astype(str)

In [33]:
metadata_raw_2['original_title'] = metadata_raw_2['original_title'].apply(lambda x: x.strip().replace('%20',' '))

In [34]:
metadata_raw_2[['original_title','overview']] = metadata_raw_2[['original_title','overview']].astype(str)

In [35]:
def getDir(obj):
    list = []
    for i in ast.literal_eval(obj):           #to convert the string object into a list
        if i['job'] == 'Director':            # to get the name of the director of the movie 
            list.append(i['name'])
            break;
    return list 

In [36]:
merged_data['crew'] = merged_data['crew'].apply(getDir)

In [37]:
def get_dir2(data):
    res = []
    res.append(data.strip().split(';')[0])
    return res

In [38]:
metadata_raw_2[['crew']] = metadata_raw_2[['crew']].astype(str)

In [39]:
metadata_raw_2['crew'] = metadata_raw_2['crew'].apply(get_dir2)

In [40]:
def castNames(text):
    list = []
    counter = 0
    for i in ast.literal_eval(text):           #to convert the string object into a list
        if(counter<5):
            list.append(i['name'])
            counter+=1
        else:
            break
    return list 

In [41]:
merged_data['cast'] = merged_data['cast'].apply(castNames)

In [42]:
metadata_raw_2[['cast']] = metadata_raw_2[['cast']].astype(str)

In [43]:
def castNames2(data):
    res = []
    counter = 0
    for name in data.strip().split(';'):
        if counter < 5:
            res.append(name)
            counter+=1
        else:
            break
    return res

In [44]:
metadata_raw_2['cast'] = metadata_raw_2['cast'].apply(castNames2)

In [45]:
merged_data['overview'] = merged_data['overview'].astype(str)
merged_data['overview'] = merged_data['overview'].apply(lambda x : x.split())

In [46]:
metadata_raw_2['overview'] = metadata_raw_2['overview'].apply(lambda x : x.split(';'))

In [47]:
def collapse(list):
    collapsed_list = []
    for i in list:
        collapsed_list.append(i.replace(" ",""))
    return collapsed_list

In [48]:
merged_data['keywords'] = merged_data['keywords'].apply(collapse)
merged_data['genres'] = merged_data['genres'].apply(collapse)
merged_data['crew'] = merged_data['crew'].apply(collapse)
merged_data['cast'] = merged_data['cast'].apply(collapse)

In [49]:
metadata_raw_2['keywords'] = metadata_raw_2['keywords'].apply(collapse)
metadata_raw_2['genres'] = metadata_raw_2['genres'].apply(collapse)
metadata_raw_2['crew'] = metadata_raw_2['crew'].apply(collapse)
metadata_raw_2['cast'] = metadata_raw_2['cast'].apply(collapse)

In [50]:
merged_data.head()

Unnamed: 0,id,keywords,genres,original_title,overview,production_companies,cast,crew
0,862,"[jealousy, toy, boy, friendship, friends, riva...","[Animation, Comedy, Family]",Toy Story,"[Led, by, Woody,, Andy's, toys, live, happily,...","[{'name': 'Pixar Animation Studios', 'id': 3}]","[TomHanks, TimAllen, DonRickles, JimVarney, Wa...",[JohnLasseter]
1,8844,"[boardgame, disappearance, basedonchildren'sbo...","[Adventure, Fantasy, Family]",Jumanji,"[When, siblings, Judy, and, Peter, discover, a...","[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[RobinWilliams, JonathanHyde, KirstenDunst, Br...",[JoeJohnston]
2,15602,"[fishing, bestfriend, duringcreditsstinger, ol...","[Romance, Comedy]",Grumpier Old Men,"[A, family, wedding, reignites, the, ancient, ...","[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[WalterMatthau, JackLemmon, Ann-Margret, Sophi...",[HowardDeutch]
3,31357,"[basedonnovel, interracialrelationship, single...","[Comedy, Drama, Romance]",Waiting to Exhale,"[Cheated, on,, mistreated, and, stepped, on,, ...",[{'name': 'Twentieth Century Fox Film Corporat...,"[WhitneyHouston, AngelaBassett, LorettaDevine,...",[ForestWhitaker]
4,11862,"[baby, midlifecrisis, confidence, aging, daugh...",[Comedy],Father of the Bride Part II,"[Just, when, George, Banks, has, recovered, fr...","[{'name': 'Sandollar Productions', 'id': 5842}...","[SteveMartin, DianeKeaton, MartinShort, Kimber...",[CharlesShyer]


In [51]:
metadata_raw_2.head()

Unnamed: 0,id,keywords,genres,original_title,overview,cast,crew
0,3829920,"[helicopter, wildfire, baby, firefighter, fore...","[Disaster, Action, Biography, Drama]",Only the Brave,"[In, 2007, Prescott,, Arizona,, Eric, Marsh, o...","[JoshBrolin, MilesTeller, JeffBridges, Jennife...",[JosephKosinski]
1,1758810,"[snow, policeinvestigation, serialkiller, prom...","[SerialKiller, Crime, Drama, Mystery, Thriller]",The Snowman,"[When, an, elite, crime, squad's, lead, detect...","[MichaelFassbender, RebeccaFerguson, Charlotte...",[TomasAlfredson]
2,6217804,"[sequel, secondpart, directedbystar, rhymeinti...","[Comedy, Drama, Horror, Mystery]",Boo 2! A Madea Halloween,"[The, film, opens, after, school, on, Tiffany'...","[TylerPerry, CassiDavis, PatriceLovely, Yousef...",[TylerPerry]
3,1230168,"[homelessman, friendship, kindness, husbandwif...","[Biography, Drama]",Same Kind of Different as Me,"[Ron, Hall, (Greg, Kinnear), lost, track, of, ...","[GregKinnear, RenÃ©eZellweger, DjimonHounsou, ...",[MichaelCarney]
4,2620590,"[texas, chainsaw, nurse, barn, sheriff, violen...","[SerialKiller, SlasherHorror, TeenHorror, Crim...",Leatherface,"[A, violent, teen, and, three, others, kidnap,...","[StephenDorff, LiliTaylor, SamStrike, VanessaG...",[AlexandreBustillo]


In [56]:
merged_data['collection'] = merged_data['keywords'] + merged_data['genres'] + merged_data['overview'] + merged_data['cast'] + merged_data['crew']

In [57]:
metadata_raw_2['collection'] = metadata_raw_2['keywords'] + metadata_raw_2['genres'] + metadata_raw_2['overview'] + metadata_raw_2['cast'] + metadata_raw_2['crew']

In [58]:
merged_data = merged_data[['id','original_title', 'collection']]

In [59]:
metadata_raw_2 = metadata_raw_2[['id','original_title', 'collection']]

In [60]:
final_merge = pd.concat([merged_data,metadata_raw_2], axis = 0, ignore_index = True)

In [64]:
final_merge

Unnamed: 0,id,original_title,collection
0,862,Toy Story,"[jealousy, toy, boy, friendship, friends, riva..."
1,8844,Jumanji,"[boardgame, disappearance, basedonchildren'sbo..."
2,15602,Grumpier Old Men,"[fishing, bestfriend, duringcreditsstinger, ol..."
3,31357,Waiting to Exhale,"[basedonnovel, interracialrelationship, single..."
4,11862,Father of the Bride Part II,"[baby, midlifecrisis, confidence, aging, daugh..."
...,...,...,...
36135,14307536,A Jazzman's Blues,"[love, story, murder, secret, 1930s, 1980s, ye..."
36136,57773,The Munsters,"[sitcom, monster, spoof, husbandwiferelationsh..."
36137,15474916,Smile,"[psychological, jumpscare, curse, goinginsane,..."
36138,9731598,Bros,"[mansucksaman'sfinger, barechestedmale, gaykis..."


In [65]:
final_merge['collection'] = final_merge['collection'].apply(lambda x: " ".join(x))

In [80]:
final_merge

Unnamed: 0,id,original_title,collection
0,862,Toy Story,jealousy toy boy friendship friends rivalry bo...
1,8844,Jumanji,boardgame disappearance basedonchildren'sbook ...
2,15602,Grumpier Old Men,fishing bestfriend duringcreditsstinger oldmen...
3,31357,Waiting to Exhale,basedonnovel interracialrelationship singlemot...
4,11862,Father of the Bride Part II,baby midlifecrisis confidence aging daughter m...
...,...,...,...
36135,14307536,A Jazzman's Blues,love story murder secret 1930s 1980s year1937 ...
36136,57773,The Munsters,sitcom monster spoof husbandwiferelationship f...
36137,15474916,Smile,psychological jumpscare curse goinginsane evil...
36138,9731598,Bros,mansucksaman'sfinger barechestedmale gaykiss g...


In [83]:
final_merge.drop_duplicates(inplace = True)

### Creating the recommender:

Now that we have our final data, we can move on to creating the model and running our recommender system.

In [71]:
# Importing Required packages:
import nltk
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer

In [73]:
# Creating Porter stemmer object
ps = PorterStemmer()

In [74]:
# We'd be needing the base form of words to act as rating factor for different movies
# so we'll be stemming down the words in the collection.
def stem(text):
    y = []
    
    for i in text.split():
        y.append(ps.stem(i))
        
    return " ".join(y)

In [85]:
# applying the stemming process to our collection of data
final_merge['collection'] = final_merge['collection'].apply(stem)

In [89]:
# Creating a vector from our vocabulary of words and removing the common non-relevnt words
cv = CountVectorizer(max_features=5000,stop_words='english')

In [90]:
# Fitting the vector of vocabulary with our collection of words for each individual movie
# its like creating a matrix consisting of movies in each row and every distinct word of
# our vocabulary in the columns and every movie will be having a vector which will go through
# only those columns of which the words descrivbes the movie best
vector = cv.fit_transform(final_merge['collection']).toarray()
vector.shape

(34899, 5000)

In [91]:
# defining a similarity object which will make relations among movies and according to which
# we can have our recommender sustem give similarly vectorised movies with our choice provided.
similarity = cosine_similarity(vector)

In [92]:
# the final function which will give us recommendation based on our choice
def recommend(movie):
    index = final_merge[final_merge['original_title'] == movie].index[0]
    distances = similarity[index]
    most_similar_movies = sorted(list(enumerate(distances)),reverse=True,key = lambda x: x[1])[1:6]
    
    for i in most_similar_movies:
        print(final_merge.iloc[i[0]].original_title)
# Make sure to type the correct name of the movie.. we dont have a closest match system yet ;P
# will be adding closest match of movies searched by using python's 'difflib' library

In [118]:
recommend('Life of Pi')

Bright Eyes
Cluny Brown
A Horse for Danny
No, No, Nanette
The Lion King
