# Prototype 1 - BM25 3/3 Indexes - Correct Result



---



---



## Future Potential Adjustments:


*   Optimization - apply multi-stage retrieval system





---



---



# Imports & Installs

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import nltk
from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer

from nltk.stem.porter import PorterStemmer

In [2]:
!pip install nltk



In [3]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [4]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True



---



---



# Input Query

Please describe the film you would like to see. An example could be 'battle', 'war', 'survivor'.

In [6]:
q_terms = ['survivor', 'battle']
df_q = pd.DataFrame(q_terms, columns=['Query'])
df_q

Unnamed: 0,Query
0,survivor
1,battle




---



---



# Dataset

In [7]:
df0 = pd.read_csv('imdb_top_1000.csv') 
df = pd.read_csv('imdb_top_1000.csv')  
df0.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


 

---



---



# Stop Word Removal

In [8]:
stop_words = stopwords.words('english')
df['Overview'] = df['Overview'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
df_q['Query'] = df_q['Query'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

df['Overview']

0      Two imprisoned men bond number years, finding ...
1      An organized crime dynasty's aging patriarch t...
2      When menace known Joker wreaks havoc chaos peo...
3      The early life career Vito Corleone 1920s New ...
4      A jury holdout attempts prevent miscarriage ju...
                             ...                        
995    A young New York socialite becomes interested ...
996    Sprawling epic covering life Texas cattle ranc...
997    In Hawaii 1941, private cruelly punished boxin...
998    Several survivors torpedoed merchant ship Worl...
999    A man London tries help counter-espionage Agen...
Name: Overview, Length: 1000, dtype: object

 

---



---



# Lemmatization

In [9]:
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    words = text.split()
    words = [lemmatizer.lemmatize(word,pos='v') for word in words]
    return ' '.join(words)
df['Overview'] = df['Overview'].apply(lemmatize_words)
df_q['Query'] = df_q['Query'].apply(lemmatize_words)

df['Overview']

0      Two imprison men bond number years, find solac...
1      An organize crime dynasty's age patriarch tran...
2      When menace know Joker wreak havoc chaos peopl...
3      The early life career Vito Corleone 1920s New ...
4      A jury holdout attempt prevent miscarriage jus...
                             ...                        
995    A young New York socialite become interest you...
996    Sprawling epic cover life Texas cattle rancher...
997    In Hawaii 1941, private cruelly punish box uni...
998    Several survivors torpedo merchant ship World ...
999    A man London try help counter-espionage Agent....
Name: Overview, Length: 1000, dtype: object

 

---



---



# Stemming

In [10]:
porter_stemmer  = PorterStemmer()
def stemmer_words(text):
    words = text.split()
    words = [porter_stemmer.stem(word) for word in words]
    return ' '.join(words)
df['Overview'] = df['Overview'].apply(stemmer_words)
df_q['Query'] = df_q['Query'].apply(stemmer_words)

df['Overview']

0      two imprison men bond number years, find solac...
1      An organ crime dynasty' age patriarch transfer...
2      when menac know joker wreak havoc chao peopl g...
3      the earli life career vito corleon 1920 new yo...
4      A juri holdout attempt prevent miscarriag just...
                             ...                        
995    A young new york socialit becom interest young...
996    sprawl epic cover life texa cattl rancher fami...
997    In hawaii 1941, privat cruelli punish box unit...
998    sever survivor torpedo merchant ship world war...
999    A man london tri help counter-espionag agent. ...
Name: Overview, Length: 1000, dtype: object

 

---



---



# Dataset Reduction

In [11]:
#N=900
#df = df.iloc[:-N , :] 
# We remove 900/1000 observations for the purpose of this architecture
# In practice, this cell would be removed, thus using all 1000 observations

In [12]:
df = df.drop(['Poster_Link', 'Certificate', 'No_of_Votes', 'Gross', 'Runtime', 'Meta_score','Director', 'Star1', 'Star2', 'Star3', 'Star4'], axis = 1)
df = df.rename(columns={'Series_Title': "Title",
                        'Released_Year': 'Year',
                        'IMDB_Rating': 'Rating'})

In [13]:
print(df.shape)
df.head()

(1000, 5)


Unnamed: 0,Title,Year,Genre,Rating,Overview
0,The Shawshank Redemption,1994,Drama,9.3,"two imprison men bond number years, find solac..."
1,The Godfather,1972,"Crime, Drama",9.2,An organ crime dynasty' age patriarch transfer...
2,The Dark Knight,2008,"Action, Crime, Drama",9.0,when menac know joker wreak havoc chao peopl g...
3,The Godfather: Part II,1974,"Crime, Drama",9.0,the earli life career vito corleon 1920 new yo...
4,12 Angry Men,1957,"Crime, Drama",9.0,A juri holdout attempt prevent miscarriag just...


 

---



---



# The Documents

In [14]:
documents=df['Overview'].unique()
print(documents.shape)
documents

(1000,)


array(['two imprison men bond number years, find solac eventu redempt act common decency.',
       "An organ crime dynasty' age patriarch transfer control clandestin empir reluct son.",
       'when menac know joker wreak havoc chao peopl gotham, batman must accept one greatest psycholog physic test abil fight injustice.',
       'the earli life career vito corleon 1920 new york citi portrayed, son, michael, expand tighten grip famili crime syndicate.',
       'A juri holdout attempt prevent miscarriag justic forc colleagu reconsid evidence.',
       "gandalf aragorn lead world men sauron' armi draw gaze frodo sam approach mount doom one ring.",
       'the live two mob hitmen, boxer, gangster wife, pair diner bandit intertwin four tale violenc redemption.',
       'In german-occupi poland world war ii, industrialist oskar schindler gradual becom concern jewish workforc wit persecut nazis.',
       'A thief steal corpor secret use dream-shar technolog give invers task plant idea mind c

 

---



---



# Resultant Indexed Query 

In [15]:
df_q['Query']

0    survivor
1       battl
Name: Query, dtype: object



---



---



# Vectorization (Bag of Words)

In [16]:
vectorizer = CountVectorizer(stop_words='english')
documents_vectorized = vectorizer.fit_transform(documents)
vocabulary = vectorizer.get_feature_names_out()

In [17]:
dataframe = pd.DataFrame(documents_vectorized.toarray(), columns=vocabulary)
dataframe.head()

Unnamed: 0,00,000,007,10,100,100th,11,1183,12,13,...,zero,zodiac,zombi,zombie,zon,zone,zorg,zuckerberg,zulu,édith
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
dfs = (dataframe > 0).sum(axis=0)
dfs.shape # Number of unique words in 100 documents
dfs

00            1
000           5
007           2
10            2
100           1
             ..
zone          1
zorg          1
zuckerberg    1
zulu          1
édith         1
Length: 4728, dtype: int64

In [19]:
# then calculate idf:
N = dataframe.shape[0]

In [20]:
idfs = np.log10(N/dfs)
idfs

00            3.00000
000           2.30103
007           2.69897
10            2.69897
100           3.00000
               ...   
zone          3.00000
zorg          3.00000
zuckerberg    3.00000
zulu          3.00000
édith         3.00000
Length: 4728, dtype: float64

 

---



---



# BM25 Frequency Conversion

In [21]:
k_1 = 1.2 # single value
b = 0.8 # single value
## considering all words in doc
dls = [len(d.split(' ')) for d in documents] # vector
## considering words minus stop words in doc. 
dls = dataframe.sum(axis=1).tolist()
print(dls)
avgdl = np.mean(dls) # single value

[11, 11, 17, 18, 10, 15, 13, 17, 13, 12, 15, 18, 13, 18, 19, 20, 22, 12, 18, 14, 17, 9, 9, 17, 12, 14, 17, 11, 16, 24, 19, 10, 10, 24, 17, 10, 12, 11, 11, 11, 10, 17, 19, 10, 14, 14, 10, 23, 14, 16, 15, 10, 17, 12, 21, 19, 9, 13, 15, 20, 12, 16, 13, 17, 17, 17, 14, 14, 9, 9, 16, 16, 15, 16, 11, 18, 13, 11, 13, 9, 11, 11, 11, 14, 23, 13, 12, 10, 23, 17, 10, 24, 9, 18, 9, 12, 18, 10, 12, 9, 14, 12, 13, 11, 17, 16, 21, 17, 9, 23, 9, 24, 4, 10, 15, 7, 16, 12, 12, 11, 9, 7, 10, 14, 14, 10, 21, 7, 16, 15, 12, 16, 11, 10, 16, 24, 11, 9, 22, 8, 14, 11, 17, 17, 20, 9, 14, 18, 15, 13, 14, 18, 13, 18, 12, 11, 11, 15, 21, 12, 9, 24, 15, 9, 13, 21, 9, 17, 21, 22, 11, 15, 15, 11, 9, 9, 14, 12, 15, 13, 14, 11, 10, 13, 10, 13, 8, 8, 9, 12, 13, 12, 14, 4, 13, 14, 11, 14, 11, 9, 12, 13, 18, 17, 9, 13, 11, 19, 6, 17, 12, 18, 22, 20, 10, 14, 9, 26, 18, 16, 11, 8, 13, 17, 15, 21, 15, 7, 6, 22, 14, 26, 14, 11, 8, 13, 13, 10, 7, 14, 14, 9, 12, 18, 14, 15, 9, 12, 8, 12, 7, 23, 13, 16, 13, 15, 11, 10, 14, 14, 

In [22]:
# With these we can calculate the BM25 term frequency quantification. Switching to numpy to make things easier:
numerator = np.array((k_1 + 1) * dataframe)
denominator = np.array(k_1 *((1 - b) + b * (dls / avgdl))).reshape(N,1) + np.array(dataframe)

BM25_tf = numerator / denominator

idfs = np.array(idfs)

BM25_score = BM25_tf * idfs

In [23]:
np.array(dataframe)

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [24]:
bm25_idf = pd.DataFrame(BM25_score, columns=vocabulary)
bm25_idf.head()

Unnamed: 0,00,000,007,10,100,100th,11,1183,12,13,...,zero,zodiac,zombi,zombie,zon,zone,zorg,zuckerberg,zulu,édith
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
cols = list(bm25_idf.columns.values)
cols[10:20]    # remove brackets to see full list of words to chose from

['14', '1431', '16', '160', '16th', '17', '170', '1820', '18th', '19']

In [26]:
sum_of = bm25_idf.sum(axis=0) 
df_sum = pd.DataFrame(sum_of, columns=['bm25_idf'])
print(df_sum.shape)
df_sum

(4728, 1)


Unnamed: 0,bm25_idf
00,2.206294
000,10.830017
007,4.427434
10,4.957533
100,3.401051
...,...
zone,3.528433
zorg,2.795518
zuckerberg,2.566999
zulu,3.665727


In [27]:
bm25_idf

Unnamed: 0,00,000,007,10,100,100th,11,1183,12,13,...,zero,zodiac,zombi,zombie,zon,zone,zorg,zuckerberg,zulu,édith
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
df_sum_gt = df_sum[df_sum['bm25_idf']>1]     # lt = less than,  gt = greater than
#df_sum_gt

 

---



---



# Scoring the Query

In [29]:
q_terms_only_df = bm25_idf[df_q['Query']]

score_q_d = q_terms_only_df.sum(axis=1)

 

---



---



# Results Investigation

In [30]:
# rank the documents
result = sorted(zip(documents,score_q_d.values), key = lambda tup:tup[1], reverse=True)
# We have now ranked each document, retrieving documents closely related to query

In [31]:
result = [x for x in result if not 0.0 in x]
result    # add [0] for single results

[('A sole survivor tell twisti event lead horrif gun battl boat, begin five crimin meet seemingli random polic lineup.',
  3.529105351646774),
 ('four week mysterious, incur viru spread throughout uk, hand survivor tri find sanctuary.',
  2.518873751498002),
 ('the lone survivor onslaught flesh-possess spirit hole cabin group stranger demon continu attack.',
  2.2727273981696503),
 ('sever survivor torpedo merchant ship world war II find lifeboat one crew member u-boat sink ship.',
  2.2010318679057392),
 ('A grow nation genet evolv ape lead caesar threaten band human survivor devast viru unleash decad earlier.',
  2.070405700691785),
 ("outnumb british soldier battl zulu warrior rorke' drift.",
  2.0500955672881513),
 ('A russian german sniper play game cat-and-mous battl stalingrad.',
  1.973312123001283),
 ('A young man surviv disast sea hurtl epic journey adventur discovery. while cast away, form unexpect connect anoth survivor: fearsom bengal tiger.',
  1.8507323088039045),
 ('aft

In [32]:
def_res = pd.DataFrame(result)
def_res

Unnamed: 0,0,1
0,A sole survivor tell twisti event lead horrif ...,3.529105
1,"four week mysterious, incur viru spread throug...",2.518874
2,the lone survivor onslaught flesh-possess spir...,2.272727
3,sever survivor torpedo merchant ship world war...,2.201032
4,A grow nation genet evolv ape lead caesar thre...,2.070406
5,outnumb british soldier battl zulu warrior ror...,2.050096
6,A russian german sniper play game cat-and-mous...,1.973312
7,A young man surviv disast sea hurtl epic journ...,1.850732
8,"after tragic accident, two stage magician enga...",1.773986
9,"frank, singl man rais child prodigi niec mary,...",1.773986


 

---



---



# Final Resultant Output

In [33]:
string_def = def_res.iloc[0,0]
string_def

'A sole survivor tell twisti event lead horrif gun battl boat, begin five crimin meet seemingli random polic lineup.'

In [34]:
result1 = df[df.eq(string_def).any(1)]
result1

Unnamed: 0,Title,Year,Genre,Rating,Overview
41,The Usual Suspects,1995,"Crime, Mystery, Thriller",8.5,A sole survivor tell twisti event lead horrif ...


In [35]:
result1_index = result1.index[0]
result1_index

41

In [36]:
overview1 = df0.loc[result1_index,:].Overview
title1 = df0.loc[result1_index,:].Series_Title

In [37]:
print('The resultant movie title suggestion, based on your input query is:', title1)
print('Please see the overview of the movie below:\n', overview1)

The resultant movie title suggestion, based on your input query is: The Usual Suspects
Please see the overview of the movie below:
 A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.


 

---



---



So someone has searched for a film with a 'battle' in it and also a 'survivor'. The best matched result is given above.

This is a prototype and therefore very simple. The final prototype would list more movie matches.



---



---

