# Aim:- Make a mini search engine

Sub Task :-
    1. Generate TF-IDF vectors for give corpus
    2. 2-3 Domains of documents (In this Name of Player is considered :- )
    3. Search plot as query and suggest similar Player from Dataset
    4. GUI Implementation Optional

# Packages Used

NLTK :-
    Used for basic text pre processing structure.
    
Sklearn :-
    For generating tfidf matrix, cosine_similarity
    
Pandas:-
    For accessing the data & filtering it

In [None]:
# Importing necessary libraries

import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import pandas as pd

# Fetching the data from csv file

In [None]:
df = pd.read_csv("Shakespeare_data.csv")
df.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


# Filtering the data for the purpose

In [None]:
del [df['Dataline'],df['PlayerLinenumber'],df['ActSceneLine']]
df.head()

Unnamed: 0,Play,Player,PlayerLine
0,Henry IV,,ACT I
1,Henry IV,,SCENE I. London. The palace.
2,Henry IV,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,Henry IV,KING HENRY IV,"So shaken as we are, so wan with care,"
4,Henry IV,KING HENRY IV,"Find we a time for frighted peace to pant,"


In [None]:
df['Player'].value_counts()[:10]

GLOUCESTER        1920
HAMLET            1582
IAGO              1161
FALSTAFF          1117
KING HENRY V      1086
BRUTUS            1051
OTHELLO            928
MARK ANTONY        927
KING HENRY VI      917
DUKE VINCENTIO     909
Name: Player, dtype: int64

In [None]:
requires = {'GLOUCESTER':0,'HAMLET':1,'IAGO':2,'FALSTAFF':3,'KING HENRY V':4,'BRUTUS':5,'OTHELLO':6,'MARK ANTONY':7,'KING HENRY VI':8,'DUKE VINCENTIO':9}
df = df[df['Player'].isin(requires)]
df.head()
# df['Player'].value_counts()

Unnamed: 0,Play,Player,PlayerLine
114,Henry IV,FALSTAFF,"Now, Hal, what time of day is it, lad?"
126,Henry IV,FALSTAFF,"Indeed, you come near me now, Hal, for we that..."
127,Henry IV,FALSTAFF,"purses go by the moon and the seven stars, and..."
128,Henry IV,FALSTAFF,"by Phoebus, he,'that wandering knight so fair...."
129,Henry IV,FALSTAFF,"I prithee, sweet wag, when thou art king, as, God"


In [None]:
print('Size of dataset :-',len(df))

Size of dataset :- 11598


In [None]:
df.reset_index(drop=True,inplace=True)
print(df)

              Play      Player  \
0         Henry IV    FALSTAFF   
1         Henry IV    FALSTAFF   
2         Henry IV    FALSTAFF   
3         Henry IV    FALSTAFF   
4         Henry IV    FALSTAFF   
...            ...         ...   
11593  Richard III  GLOUCESTER   
11594  Richard III  GLOUCESTER   
11595  Richard III  GLOUCESTER   
11596  Richard III  GLOUCESTER   
11597  Richard III  GLOUCESTER   

                                              PlayerLine  
0                 Now, Hal, what time of day is it, lad?  
1      Indeed, you come near me now, Hal, for we that...  
2      purses go by the moon and the seven stars, and...  
3      by Phoebus, he,'that wandering knight so fair....  
4      I prithee, sweet wag, when thou art king, as, God  
...                                                  ...  
11593   Farewell, good cousin, farewell, gentle friends.  
11594                                             Exeunt  
11595                                             ACT IV  
1

In [None]:
temp=df.groupby('Player')
temp.first()

Unnamed: 0_level_0,Play,PlayerLine
Player,Unnamed: 1_level_1,Unnamed: 2_level_1
BRUTUS,Coriolanus,He has no equal.
DUKE VINCENTIO,Measure for measure,Escalus.
FALSTAFF,Henry IV,"Now, Hal, what time of day is it, lad?"
GLOUCESTER,Henry VI Part 1,England ne'er had a king until his time.
HAMLET,Hamlet,"[Aside] A little more than kin, and less than..."
IAGO,Othello,"'Sblood, but you will not hear me:"
KING HENRY V,Henry V,Where is my gracious Lord of Canterbury?
KING HENRY VI,Henry VI Part 1,"Uncles of Gloucester and of Winchester,"
MARK ANTONY,Antony and Cleopatra,There's beggary in the love that can be reckon'd.
OTHELLO,Othello,'Tis better as it is.


# Pre-processing the data

In [None]:
def toLower(sentence):
    return sentence.lower()

def tokenizer(sentence):
    tokens = list(set(nltk.word_tokenize(sentence)))
    return tokens

def stopwords_removal(tokens):
    stop_words = nltk.corpus.stopwords.words('english')
    stop_words.extend([',','?','""',"''",'.','!', "'",'"',"'d","'ll",'[',']','--',':',';','///'])
    filtered_tokens = [i for i in tokens if not i in stop_words]
    return filtered_tokens

def stemming(tokens):
    stemmer = nltk.stem.porter.PorterStemmer()
    stemmed_tokens = [stemmer.stem(i) for i in tokens]
    return stemmed_tokens

def pre_process(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    tokens = stopwords_removal(tokens)
    stems = stemming(tokens)
    return stems


In [None]:
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_features=15000,
                                 use_idf=True,tokenizer=pre_process)

tfidf_matrix = tfidf_vectorizer.fit_transform(df['PlayerLine']) #fit the vectorizer to synopses

print(tfidf_matrix.shape)

(11598, 6282)


In [None]:
terms = tfidf_vectorizer.get_feature_names()
print(terms)

["'a", "'adieu", "'agrippa", "'all", "'an", "'as'", "'bi", "'bout", "'brutu", "'cert", "'content", "'curs", "'do", "'down", "'em", "'england", "'enough", "'faith", "'fli", "'for", "'fore", "'gainst", "'god", "'good", "'greed", "'harri", "'havior", "'help", "'hem", "'ho", "'if", "'impon", "'imprimi", "'inde", "'it", "'jealousi", "'larum", "'love", "'m", "'man", "'margaret", "'mongst", "'no", "'non", "'now", "'one", "'peac", "'putter", "'re", "'s", "'sblood", "'scape", "'scuse", "'seem", "'sees", "'shall", "'sleep", "'speak", "'stablish", "'stand", "'stroy", "'sweet", "'swound", "'t", "'te", "'that", "'the", "'there", "'these", "'thi", "'thou", "'to", "'to-morrow", "'tween", "'twere", "'twill", "'twixt", "'twould", "'ve", "'we", "'well", "'while", "'who", "'whore", "'your", "'zound", 'a-b', 'a-bird', 'a-curs', 'a-day', 'a-do', 'a-front', 'a-kil', 'a-piec', 'abandon', 'abhor', 'abhorson', 'abid', 'abil', 'abject', 'abl', 'aboard', 'abod', 'abomin', 'abound', 'abridg', 'abroach', 'abroad',

# Finding cosine similarity

In [None]:
def get_cosine_matrix(sentence):
    vect = tfidf_vectorizer.transform([sentence])
    dictionary = dict()
    for i in range(tfidf_matrix.shape[0]):
        dictionary[df['Play'].iloc[i]]=1-cosine_similarity(vect,tfidf_matrix[i])[0][0]#1-0-1
    dictionary = dict(sorted(dictionary.items(), key=lambda item: item[1]))
    return dictionary

# Inserting the query and find similar search

In [None]:
sentence = input("Enter the PlayerLine of Play to get recommendation :- \n")
matrix1 = get_cosine_matrix(sentence)
matrix = list(get_cosine_matrix(sentence).keys())
lst = []
for i in range(5):
    lst.append(matrix[i])
print("\nRelated Plays and Players:- \n",lst)

Enter the PlayerLine of Play to get recommendation :- 
# But I have that within which passeth show /// It is not nor it cannot come to good /// And makes each petty artery in this body /// That you must teach me. But let me conjure you, by /// Exit 

Related Plays and Players:- 
 ['Henry VI Part 2', 'Henry IV', 'Henry VI Part 1', 'Henry VI Part 3', 'Antony and Cleopatra']


In [None]:
# some example of queries

# You tread upon my patience: but be sure /// Danger and disobedience in thine eye /// Whose tongue shall ask me for one penny cost /// We licence your departure with your son /// Betwixt that Holmedon and this seat of ours
# More dazzled and drove back his enemies /// Except it be to pray against thy foes /// These news would cause him once more yield the ghost /// Exit /// BISHOP
# But I have that within which passeth show /// It is not nor it cannot come to good /// And makes each petty artery in this body /// That you must teach me. But let me conjure you, by /// Exit
# It speaks against her with the other proofs /// Yet be content /// Is my lord angry /// When it hath blown his ranks into the air /// To kiss in private /// From this time forth I never will speak word
# hostess of the tavern a most sweet wench /// time and oft /// Thou hast the most unsavoury similes and art indeed /// Christendom /// and I paid nothing for it neither, but was paid for
# He has no equal /// Come, sir, come, we know you well enough /// In that there's comfort /// But hearts for the event /// And we will follow
# Escalus /// That does affect it. Once more, fare you well /// Goes all decorum. /// Enter ISABELLA and FRANCISCA /// Even like an o'ergrown lion in a cave,

# Learning Outcomes

1. Get detailed knowledge of pre processing steps with it's importnace & technical terms like corpus,vectorize
2. Get to know the importnace of the filtering process for the corpus and methodology
3. Get to know what is cosine similarity,term frquency,tf-idf and their pros&cons.