# Importing the necessery libraries needed

In [1]:
from inverted_index import InvertedIndex
import nltk
from utils import read_data
nltk.download('stopwords')
inv_ind = InvertedIndex()

# Initialization done

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ivanyanakiev1/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## We will now proceed to read the documents from the data folder

In [2]:
documents = read_data("./shakespeare")
# Print the first 1 documents
print(documents[1])

('Othello', "\tOTHELLO\n\n\tDRAMATIS PERSONAE\n\nDUKE OF VENICE:\n\nBRABANTIO\ta senator.\n\n\tOther Senators.\n\t(Senator:)\n\t(First Senator:)\n\t(Second Senator:)\n\nGRATIANO\tbrother to Brabantio.\n\nLODOVICO\tkinsman to Brabantio.\n\nOTHELLO\ta noble Moor in the service of the Venetian state.\n\nCASSIO\this lieutenant.\n\nIAGO\this ancient.\n\nRODERIGO\ta Venetian gentleman.\n\nMONTANO\tOthello's predecessor in the government of Cyprus.\n\n\tClown, servant to Othello. (Clown:)\n\nDESDEMONA\tdaughter to Brabantio and wife to Othello.\n\nEMILIA\twife to Iago.\n\nBIANCA\tmistress to Cassio.\n\n\tSailor, Messenger, Herald, Officers, Gentlemen,\n\tMusicians, and Attendants.\n\t(Sailor:)\n\t(First Officer:)\n\t(Messenger:)\n\t(Gentleman:)\n\t(First Gentleman:)\n\t(Second Gentleman:)\n\t(Third Gentleman:)\n\t(First Musician:)\n\nSCENE\tVenice: a Sea-port in Cyprus.\n\n\tOTHELLO\n\nACT I\n\nSCENE I\tVenice. A street.\n\n\t[Enter RODERIGO and IAGO]\n\nRODERIGO\tTush! never tell me; I take 

## Print the number of the documents as well as their document title

In [3]:
print(len(documents))
for i in documents:
    # Print document title
    print(i[0])

39
Julius Caesar
Othello
A Midsummer Night's Dream
Troilus and Cressida
King Richard II
King Henry IV, II
Titus Andronicus
Much Ado About Nothing
Love's Labour's Lost
The Two Gentlemen of Verona
The Comedy of Errors
Cymbeline
All's Well that Ends Well
Twelfth Night
King Lear
The Tempest
Macbeth
Venus and Adonis
Timon of Athens
King Henry VIII
The Merchant of Venice
A Lover's Complaint
King Henry VI
Measure for Measure
Collection of Shakespeare Sonnets
Antony and Cleopatra
King John
Coriolanus
King Henry V
The Merry Wives of Windsor
Romeo and Juliet
King Henry IV
Hamlet
King Richard III
Pericles, Prince of Tyre
The Taming of the Shrew
The Winter's Tale
As You Like It
The Rape of Lucrece


## Add documents to the inverted matrix

In [4]:
for i in documents:
    # Add document to inverted index
    inv_ind.add_document(i)

## Print come descriptives so that we can verify everything works

In [5]:
print(inv_ind.get_total_terms())
print(inv_ind.get_total_docs())

19202
39


## Generate a term by document matrix using log entropy

## Explanation of Log-Entropy and the 2 components of it
Log-Entropy is a statistical analysis of probabilities and calculation of a surprise "index" when certain event occurs. For example 
if a certain event has 90% chance to occur then the Log-Entropy of that event will be low since the surprise factor will be low.

### Component 1
1. This is the logarithm of the term frequency of i in document j. The term frequency can be better described as the probability of term i to occur in document j.
Since this term occurs multiple times throughout one document (potentially or 0) we will need to multiply the natural log of it to the next component.

### Component 2
2. The second term can be interpreted as the actual amount of surprise given a discrete variable X and it's probability P(X). In order to compute the surprise
we need the frequency of the term in the current document in regards to the total fequency.This number will of course be less than 1 and can be interpreted as yet another probability of occurance of the discrete variable X or in our case the term. We then divide by the log of the total number of documents and this final value tha we would have would represent the surprise of occurance of term i. If surprise is low then probability was high, if surprise is high then probability was low.


In [None]:
inv_ind.calcLogEntropy()
inv_ind.generate_term_by_doc_matrix(log_entropy=True)

### Perform a search query: "scotland kings and thanes" using the Log-Entropy weighting scheme

In [17]:
result = inv_ind.search("scotland kings and thanes",log_entropy=True)
for i in range(0,10):
    print(result[i])

('Macbeth', 0.07769052662351016)
('King Henry VI', 0.031869310554797074)
('King Henry IV', 0.030208435869742898)
('King Henry IV, II', 0.028145973814360462)
('King John', 0.027447611557131362)
('King Henry V', 0.027437757231046665)
('King Richard III', 0.026833074525408347)
('King Richard II', 0.02678812593487038)
('King Henry VIII', 0.02590799663639283)
("All's Well that Ends Well", 0.025433109729848902)


## Generate term by document matrix without using TF model only

In [None]:
inv_ind.generate_term_by_doc_matrix()

### Test on the same data set with the same query and print top 10 results

In [12]:
result = inv_ind.search("scotland kings and thanes")
for i in range(0,10):
    print(result[i])

('King Henry V', 0.2659215354074205)
('King Henry VI', 0.2617837075388672)
('King John', 0.2472493685181954)
('King Richard II', 0.2253953514359277)
('King Lear', 0.20415867029886436)
('King Henry VIII', 0.1998881836179047)
('King Richard III', 0.18418950223762484)
('Hamlet', 0.1241932011402115)
("All's Well that Ends Well", 0.11190811971898096)
('King Henry IV', 0.10791586179996365)


## Generate term by doc matrix using the TF-IDF model

In [None]:
inv_ind.calcTFIDF()
inv_ind.generate_term_by_doc_matrix()

### Test again on the same query and print the top 10 results

In [14]:
result = inv_ind.search("scotland kings and thanes",tfidf= True)
for i in range(0,10):
    print(result[i])

('Macbeth', 0.06342271855923712)
('King Henry VI', 0.01072781887182939)
('King Henry V', 0.009657775892130944)
('King John', 0.008469965122871384)
('King Richard II', 0.0077213170531481206)
('King Lear', 0.006993816919843515)
('King Henry IV', 0.006848287703722218)
('King Henry VIII', 0.0068475238333851225)
('King Richard III', 0.006723494450381792)
('King Henry IV, II', 0.005135632065177295)


## Does Log-Entropy work better than or worse than TF and TF-IDf
1. It performs better than TF since TF tends to favor longer documents since it looks at only the term frequency, thus relative global frequencies are ignored
2. Log-Entropy is calculated based on probabilities and tries to determine the surprise of seeing a term, in its calculation more factors are accounted for
such as frequency in current doc vs total frequency
3. Looking at the test data from the example showed in class we can determine that TF-IDF could be stated that is comparable to Log-Entropy if we look at the results.
They are very similar in nature and the rankings are also nearly the same only with few rotations here and there.

### Conclusion
From the results we can conclude that the Log-Entropy model represents definitely more accurate solutions than the TF model only and data would suggest slight improvement over the TF-IDF model.