## Introduction

The purpose of this activity is to explore the use of different weighting schemes to measure term importance.
We will use the BeautifulSoup package to learn more about our data. BeautifulSoup parses html 
and other document formats and places the contents in a parse tree, where each node corresponds to a tag.  
We can iterate over these tags and extract their contents (text).

As you work through this worksheet, notice the comments marked with triple hash marks, they give instructions.

## Step 1: 
        * walk through the code below, reading the comments to help you understand what it does, then
        * execute the code

In [28]:
# open the file, parse the documents, normalize the text, and create a nested list where each inner list
# is the text of a document in the collection

# import the bs module for reading xml text and nltk and string
from bs4 import BeautifulSoup
import nltk
import string

### replace /home/jupyter-lballest@mtholyoke-8f27d with the name of your home directory
### you should only need to replace the part containing the username (e.g. lballest to your username)
file = '/home/jupyter-lballest/COMSC-341TE/ap89.sample.txt'

# create translation table deleting for punctuation
punctTable = str.maketrans('','',string.punctuation)

# create dashTable to replace dash with space
dashTable = str.maketrans('-',' ','')

''' remove punctuation, lowercase and tokenize on whitespace'''
def normalize (text):
    # replace dash with ws
    text = text.translate(dashTable)
    #delete punctuation
    line = text.translate(punctTable)
    #lowercase
    line = line.lower()
    #tokenize
    tokens = line.split()
    return(tokens)


# open sgml file of documents
fp = open(file,'r')

# read as one long string.  Note this is not a good approach since a larger file
# will need to be read a line at a time to avoid memory problems, but this is a reasonably small file
text = fp.read()

# parse with html (similar to SGML) parser which returns a tree structure
# containing our parsed document data
# each sgml tag will be a node in the tree, each 'doc' tag a sub-tree
soup = BeautifulSoup(text, 'html.parser')

# list to collect the list of tokens for each doc 
collectionText = []

# can use tags to access specific objects of each doc in the tree
# tags we care about are 'doc', 'head', and 'text'
# we can access the text of those objects using the .text member
docNum = 0
# iterate on the 'doc' nodes of the tree
for doc in soup('doc'):
    docNum += 1
    # initialize string to collect document headlines and text
    text = ""
    # get the headline nodes of each doc tag and save the text associated with that tag
    # iterate on each headline
    for head in doc.find_all('head'):
        text += head.text
        text += " "
    for body in doc.find_all('text'):
        text += body.text
        text += ' '
    #display the document before processing
    print("doc:",text)
     #tokenize and normalize doc text
    normalized = normalize(text)
    # append doc text
    collectionText.append(normalized)


doc: You Don't Need a Weatherman To Know '60s Films Are Here Eds: Also in Monday AMs report. 
   The celluloid torch has been passed to a new
generation: filmmakers who grew up in the 1960s.
   ``Platoon,'' ``Running on Empty,'' ``1969'' and ``Mississippi
Burning'' are among the movies released in the past two years from
writers and directors who brought their own experiences of that
turbulent decade to the screen.
   ``The contemporaries of the '60s are some of the filmmakers of
the '80s. It's natural,'' said Robert Friedman, the senior vice
president of worldwide advertising and publicity at Warner Bros.
   Chris Gerolmo, who wrote the screenplay for ``Mississippi
Burning,'' noted that the sheer passage of time has allowed him and
others to express their feelings about the decade.
   ``Distance is important,'' he said. ``I believe there's a lot of
thinking about that time and America in general.''
   The Vietnam War was a defining experience for many people in the
'60s, shattering th

### Step 2 ###
The list *collectionText* created above, contains the token list for each document.  In the next cell, we collect and store the tf values for each document in it's own dictionary.  We then create another dictionary *docTermWts* where each key is a docid and each value is the corresponding term frequency dictionary for that document. 
* walk through the code below reading the comments
* once you understand what it does, execute the code

In [29]:


# dictionary to store term frequencies for each document key is docid, value is dictionary 
docTermWts = {}
docid = 0
for doc in collectionText:
    docid += 1
    docVocab = {}
    for token in doc:
        # add token to doc vocab or increment count if already seen
        docCnt = docVocab.get(token,0)
        docVocab[token] = docCnt + 1
        # if first time token seen in doc, increment df in collection vocab
        if docCnt == 0:
            vocabulary[token] = vocabulary.get(token, 0) + 1
    # add the tf dict for this document to the frequency dict for the collection
    docTermWts[docid] = docVocab
    

### Step 3 
Next we use tf to rank terms and explore the quality of those terms with respect to their ability to represent the content of a document.  First, we collect the top n terms ranked by tf.  
* Walk through the code below reading the comments
* once you understand what it does, Run the next cell and answer the following questions:

1) Look at the documents printed in step 1.  Do the top 5 terms for each document describe what it is about?

2) If not, try increasing the value of n.  Do more terms do a better job of describing the document content? 

In [30]:
'''takes a term frequency dictionary and an integer specifying how many top terms to extract'''
def sortAndPrintTop(tfDict, numTop):
    sorted = []
    for k, v in tfDict.items():
        item = (v,k)
        sorted.append(item)
    sorted.sort(reverse=True)
    print(sorted[:numTop])
    
#iterate on the term dictionaries for the documents in the collection
for docid in range(1,docNum+1):
    sortAndPrintTop(docTermWts[docid],5)
    

[(76, 'the'), (36, 'in'), (33, 'of'), (31, 'a'), (26, 'to')]
[(63, 'the'), (30, 'of'), (28, 'to'), (26, 'and'), (22, 'a')]
[(12, 'the'), (8, 'in'), (7, 'said'), (7, 'he'), (7, 'fire')]
[(39, 'the'), (20, 'of'), (15, 'in'), (13, 'to'), (13, 'drug')]
[(6, 'the'), (3, 'to'), (3, 'of'), (2, 'that'), (2, 'sen')]
[(45, 'the'), (28, 'in'), (22, 'of'), (21, 'and'), (20, 'to')]
[(20, 'in'), (14, 'and'), (11, 'of'), (10, 'the'), (8, 'were')]


### Step 4

Lets find out whether we can do better using idf to rank document terms.

Go back to step 2 and modify the code to do the following:
1) create an empty dictionary named vocabDF.
2) As the tf dictionary is created for each document, update vocabDF so that it contains key,val pairs where
each term is a key and the value is the document frequency for that term
3) After all documents have been processed and vocabDF is complete, in the cell below, write code to iterate over the vocabDF and update each value so that it stores idf rather than df.


The list *collectionText* contains the token list of each document.  In the next cell, we collect and store the tf values for each document in it's own dictionary.  We then create another dictionary *docTermWts* where each key is a docid and each value is the corresponding term frequency dictionary for that document. 

In [31]:
import math
# calculate idf values for vocabulary


### Step 5
In the cell below, write code to rank the terms in each document according to idf values and print the top 10. Then answer the following questions:
* How well does idf do at selecting good terms for representing what the document is about?

In [None]:
# rank document terms via idf and display the top 15 for each document

### Step 6
We should now have have 
- docTermWts, a dictionary where each key is a docid and its value is a dictionary containing the frequency for each term occuring in that document.
- vocabDF, a dictionary where each key is a unique term in the collection vocabulary and its value is the idf for that term

In the next cell, write code that does the following:
- For each document, update its term frequency dictionary so that the value stored for each term is it's tf.idf value rather than its tf value.  Note that you should not have to process the original text again, but merely iterate over the term dictionarys for each document.
- write code to iterate over the term dictionaries for each document and print the top 15 terms.

Answer these questions: 
1) Which of the term rankings are best for describing document content?
2) Why do you think this is the case?  e.g. why does the ranking you selecting perform better than the others

In [32]:
# calculate tf.idf weights for every term in every document

# sort each documents terms by tf.idf and print the top 15