Building movie recommender based on Tfidf Vectorizer and BERT pre-trained model

DATASET : https://www.kaggle.com/jrobischon/wikipedia-movie-plots

Using a subset of the above dataset. Total movies considered=3000

---





**APPROACH 1: USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TFIDF)**

In [1]:
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
dataset_path="drive/My Drive/wiki_movie_plots_deduped.csv"
data=pd.read_csv(dataset_path)

In [4]:
data.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


In [5]:
len(data)

34886

In [6]:
import numpy as np
np.unique(data['Origin/Ethnicity'])

array(['American', 'Assamese', 'Australian', 'Bangladeshi', 'Bengali',
       'Bollywood', 'British', 'Canadian', 'Chinese', 'Egyptian',
       'Filipino', 'Hong Kong', 'Japanese', 'Kannada', 'Malayalam',
       'Malaysian', 'Maldivian', 'Marathi', 'Punjabi', 'Russian',
       'South_Korean', 'Tamil', 'Telugu', 'Turkish'], dtype=object)

In [7]:
len(data.loc[data['Origin/Ethnicity']=='American'])

17377

In [8]:
len(data.loc[data['Origin/Ethnicity']=='Telugu'])

1311

In [9]:
len(data.loc[data['Origin/Ethnicity']=='British'])

3670

In [10]:
df1=pd.DataFrame(data.loc[data['Origin/Ethnicity']=='American'].iloc[13000:15000])
df2=pd.DataFrame(data.loc[data['Origin/Ethnicity']=='British'].iloc[:1000])
data=pd.concat([df1,df2],ignore_index=True)

In [11]:
len(data)

3000

In [12]:
finaldata=data[["Title","Plot"]]
finaldata=finaldata.set_index('Title')

In [13]:
finaldata.head(10)

Unnamed: 0_level_0,Plot
Title,Unnamed: 1_level_1
Good Burger,"On the first day of summer, slacker high schoo..."
Good Will Hunting,Twenty-year-old Will Hunting of South Boston i...
Goodbye America,As the U.S. Subic Bay naval base's operations ...
Gridlock'd,"Set in Detroit, Gridlock'd centers around hero..."
Grosse Pointe Blank,Professional assassin Martin Blank finds himse...
Hacks,Brian (Stephen Rea) is a television writer-pro...
Hard Eight,"Sydney, a gambler in his 60s, finds a young ma..."
Henry Fool,Socially inept garbage-man Simon Grim is befri...
Hercules,"In Ancient Greece, after imprisoning the Titan..."
Highball,A newly married couple decides to improve thei...


In [14]:
finaldata["Plot"][0]

"On the first day of summer, slacker high school student Dexter Reed takes his mother's car on a joyride while she is on a business trip and accidentally crashes into and damages the car of his teacher, Mr. Wheat. Dexter is in danger of going to jail, as he does not have a driver's license or insurance. But Mr. Wheat agrees to let Dexter pay for the damages to both cars in exchange for not calling the police on Dexter. With the damages estimated at $1,900, Dexter is forced to get a summer job. After being dismissed from the new, soon-to-open Mondo Burger for clashing with and insulting the owner/manager, Kurt Bozwell, he ends up finding employment at Good Burger where he meets and reluctantly befriends dimwitted Ed and a host of other colorful employees. Initially, neither of them are aware that it was Ed who inadvertently caused Dexter's car accident; Ed had been on his way to make a delivery, and skated in front of Dexter, causing him to swerve out of control and crash into Mr. Wheat

In [15]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [16]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words=set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [17]:
VERB_CODES = {'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'}

In [18]:
def preprocess_sentences(text):
  text=text.lower()
  temp_sent=[]
  words=nltk.word_tokenize(text)
  tags=nltk.pos_tag(words)
  for i, word in enumerate(words):
      if tags[i][1] in VERB_CODES: 
          lemmatized = lemmatizer.lemmatize(word, 'v')
      else:
          lemmatized =lemmatizer.lemmatize(word)
      if lemmatized not in stop_words and lemmatized.isalpha():
          temp_sent.append(lemmatized)
        
  finalsent = ' '.join(temp_sent)
  finalsent = finalsent.replace("n't", " not")
  finalsent = finalsent.replace("'m", " am")
  finalsent = finalsent.replace("'s", " is")
  finalsent = finalsent.replace("'re"," are")
  finalsent = finalsent.replace("'ll", " will")
  finalsent = finalsent.replace("'ve", " have")
  finalsent = finalsent.replace("'d", " would")
  return finalsent


In [19]:
finaldata["plot_processed"]=finaldata["Plot"].apply(preprocess_sentences)

In [20]:
finaldata.head()

Unnamed: 0_level_0,Plot,plot_processed
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Good Burger,"On the first day of summer, slacker high schoo...",first day summer slacker high school student d...
Good Will Hunting,Twenty-year-old Will Hunting of South Boston i...,hunt south boston genius though work janitor m...
Goodbye America,As the U.S. Subic Bay naval base's operations ...,subic bay naval base operation slowly wind nav...
Gridlock'd,"Set in Detroit, Gridlock'd centers around hero...",set detroit gridlock center around heroin addi...
Grosse Pointe Blank,Professional assassin Martin Blank finds himse...,professional assassin martin blank find depres...


In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvec=TfidfVectorizer()

In [22]:
tfidf_movieid=tfidfvec.fit_transform((finaldata["plot_processed"]))

In [23]:
from sklearn.metrics.pairwise import cosine_similarity
cos_sim=cosine_similarity(tfidf_movieid,tfidf_movieid)

In [24]:
indices=pd.Series(finaldata.index)

In [25]:
def recommendations(title, cosine_sim = cos_sim):
    recommended_movies = []
    index = indices[indices == title].index[0]
    similarity_scores = pd.Series(cosine_sim[index]).sort_values(ascending = False)
    top_10_movies = list(similarity_scores.iloc[1:11].index)
    for i in top_10_movies:
        recommended_movies.append(list(finaldata.index)[i])
    return recommended_movies

In [26]:
recommendations("Harry Potter and the Chamber of Secrets")

["Harry Potter and the Sorcerer's Stone",
 'Harry Potter and the Goblet of Fire',
 'Harry Potter and the Prisoner of Azkaban',
 'Six Ways to Sunday',
 'Twilight',
 ' The Tailor of Panama',
 'Kiss Kiss Bang Bang',
 'Palmetto',
 'The Four Feathers',
 'Dumb and Dumberer: When Harry Met Lloyd']

In [27]:
recommendations("Ice Age")

['Ice Age: The Meltdown',
 'Bella',
 ' East Side Story',
 'Mimic',
 'Go',
 'Flushed Away',
 'Blow',
 'Brown Sugar',
 'Lords of Dogtown',
 'Dinosaur']

In [28]:
recommendations("Blackmail")

['The Last Days of Disco',
 'Killing Me Softly',
 'The Longest Yard',
 'Closer',
 'Bringing Out the Dead',
 'Catch Me If You Can',
 'Novocaine',
 'The Phantom Light',
 'Transporter 2',
 'The Transporter']



---



**APPROACH 2: (Transfer learning)
    Building movie recommender on pre-trained Bidirectional Encoder Representations from Transformers (BERT) model** 


Using transfer learning technique, weights of the pre trained BERT model are used by feature extraction and the model is further trained on our dataset.
BERT model is loaded using pytorch with hugging face transformers package. We use a lighter and faster version of BERT-Distilled BERT.


**BERT model methodolgy:**

BERT is basically a stack of transformer/ encoder layers. It understands the context of a sentence efficiently by observing the sentence from the left as well as from right i.e., bidirectionally. It is a pre trained language model, which performs the following two tasks primarily.
1.   Masked Language Modelling (MLM) :
2.   Next Sequence Prediction (NSP):




In [29]:
!pip install transformers #Hugging face transformer library-Has BERT language model

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/3a/83/e74092e7f24a08d751aa59b37a9fc572b2e4af3918cb66f7766c3affb1b4/transformers-3.5.1-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 13.2MB/s 
Collecting sentencepiece==0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 33.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 59.0MB/s 
[?25hCollecting tokenizers==0.9.3
[?25l  Downloading https://files.pythonhosted.org/packages/4c/34/b39eb9994bc3c999270b69c9eea40ecc6f0e97991dba28282b9fd32d44ee/tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K 

In [30]:
import torch

In [31]:
import transformers

In [32]:
import warnings
warnings.filterwarnings('ignore')

In [33]:
bert_model=transformers.DistilBertModel
berttokenizer=transformers.DistilBertTokenizer
weights_type="distilbert-base-uncased"

In [34]:
tokenizer=berttokenizer.from_pretrained(weights_type)
model=bert_model.from_pretrained(weights_type)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




In [35]:
inputs=finaldata["Plot"].apply((lambda plot: tokenizer.encode(plot, add_special_tokens=True,max_length=100,truncation=True)))

In [36]:
inputs[0]

[101,
 2006,
 1996,
 2034,
 2154,
 1997,
 2621,
 1010,
 19840,
 2121,
 2152,
 2082,
 3076,
 14375,
 7305,
 3138,
 2010,
 2388,
 1005,
 1055,
 2482,
 2006,
 1037,
 6569,
 15637,
 2096,
 2016,
 2003,
 2006,
 1037,
 2449,
 4440,
 1998,
 9554,
 19119,
 2046,
 1998,
 12394,
 1996,
 2482,
 1997,
 2010,
 3836,
 1010,
 2720,
 1012,
 10500,
 1012,
 14375,
 2003,
 1999,
 5473,
 1997,
 2183,
 2000,
 7173,
 1010,
 2004,
 2002,
 2515,
 2025,
 2031,
 1037,
 4062,
 1005,
 1055,
 6105,
 2030,
 5427,
 1012,
 2021,
 2720,
 1012,
 10500,
 10217,
 2000,
 2292,
 14375,
 3477,
 2005,
 1996,
 12394,
 2000,
 2119,
 3765,
 1999,
 3863,
 2005,
 2025,
 4214,
 1996,
 2610,
 2006,
 14375,
 1012,
 2007,
 1996,
 12394,
 4358,
 102]

In [37]:
import numpy as np
def padding(list_of_sent):
  output=[]
  max_len=100
  for sent in list_of_sent.values:
    padded_sent=sent+[0]*(max_len-len(sent))
    output.append(padded_sent)
  output=np.array(output)
  return output

In [38]:
inputs=padding(inputs)

In [39]:
inputs[11]

array([  101,  2848, 17935, 28139,  1010,  5650, 22688,  1010,  9658,
       15333,  6826, 10762,  1010,  1998,  4656,  4895,  4590,  2024,
        2176,  7587,  2359,  2718,  3549,  2551,  2005,  1037,  2167,
        4759,  9452,  3029,  2040,  2031,  7376,  1037,  1002,  2184,
        2454,  7421,  1011, 11965,  2075,  3274,  9090,  1012,  1996,
       15862,  2404,  2009,  2503,  1037,  6556,  2491,  2482,  2000,
       13583,  2009,  2627,  3036,  2012,  2624,  3799,  2248,  3199,
        1012,  2174,  1010,  1037, 17434,  4666,  1011,  2039,  5158,
        1010,  4786,  1037,  2450,  2315,  3680,  1012, 23484,  2000,
       21089,  2202,  1996, 15862,  1005,  4524,  4820,  1996,  6556,
        2491,  2482,  2096,  2016,  2003,  4192,  2188,  2000,  3190,
         102])

In [40]:
mask=np.where(inputs!=0,1,0)

In [41]:
embedded_inputs=torch.tensor(inputs)
attention_mask=torch.tensor(mask)

In [42]:
with torch.no_grad():
  final_states = model(embedded_inputs, attention_mask=attention_mask)

In [43]:
extracted_features=final_states[0][:,0,:].numpy()

In [44]:
from sklearn.metrics.pairwise import cosine_similarity

In [45]:
cos_sim=cosine_similarity(extracted_features,extracted_features)

In [46]:
indices=pd.Series(finaldata.index)

In [47]:
def recommendations(title, cosine_sim = cos_sim):
    recommended_movies = []
    index = indices[indices == title].index[0]
    similarity_scores = pd.Series(cosine_sim[index]).sort_values(ascending = False)
    top_10_movies = list(similarity_scores.iloc[1:11].index)
    for i in top_10_movies:
        recommended_movies.append(list(finaldata.index)[i])
    return recommended_movies

In [48]:
recommendations("Big Fat Liar")

['Warriors of Virtue',
 'The Long Weekend',
 ' Napoleon Dynamite',
 'Lloyd',
 'Orange County',
 'Loser',
 'Road Trip',
 'Accepted',
 'Hoot',
 "National Lampoon's Van Wilder"]

In [49]:
recommendations("Harry Potter and the Chamber of Secrets")

['Harry Potter and the Prisoner of Azkaban',
 "Harry Potter and the Sorcerer's Stone",
 "Pooh's Heffalump Movie",
 'The Magnet',
 'Harry Potter and the Goblet of Fire',
 'Shrek',
 'Elf',
 'Robots',
 'The Chronicles of Narnia: The Lion, the Witch and the Wardrobe',
 "Piglet's Big Movie"]

In [50]:
recommendations("Ice Age")

['Dreamcatcher',
 'Casper Meets Wendy',
 'Chicken Little',
 "Pooh's Heffalump Movie",
 'Shrek',
 "Clifford's Really Big Movie",
 'Muppets from Space',
 'Sexy Beast',
 'The Village',
 'Charlie and the Chocolate Factory']

In [51]:
recommendations("Blackmail")

['The 39 Steps',
 'The Interrupted Journey',
 'Gosford Park',
 'Brass Monkey',
 'Murder Without Crime',
 'London',
 'One Wild Oat',
 'Something Always Happens',
 'Piccadilly',
 'The Silent Passenger']