#Key Word Generator

In the last [tutorial](http://nbviewer.ipython.org/github/Chuphay/hadoop/blob/master/tutorials/Inverted%20Document%20Search.ipynb) we looked at how to create an Inverted Key data structure, where we were able to quickly find which documents contained certain works.

We will use the output of that Map Reduce job to help find the key words in a specific document. If you have not already run that algorithm you should do so now.

The output of that program will yield a text file with lines similar to the following:

```
this hdfs://localhost:54310/user/hduser/words/file0.txt 1 hdfs://localhost:54310/user/hduser/words/file4.txt 1	
with hdfs://localhost:54310/user/hduser/words/file0.txt 1	
is hdfs://localhost:54310/user/hduser/words/file0.txt 2 hdfs://localhost:54310/user/hduser/words/file5.txt 1 
```

Which I produced using a very limited data set. 

To reiterate what we are looking at, at the start of each line there is a word, this word is then followed by the files in which it appeared and the number of times it appeared in that file.

We will now use this data to find the key words in a document.


###TF-IDF

To find the key words in a document we are generally interested in two things:

1.) How many times does a particular word show up in the given document. (i.e., if a word appears more often it is probably more important)

2.) How many times does a particular word show up in all the documents scanned so far. 

The second condition is used to weight the importance of a given word. For example, if the word "the" appears several times in a document, is it important? Probably not, because it also appeared in many other documents.

TF-IDF is the way we balance these two ideas. TF-IDF stands for Term Frequency - Inverse Document Frequency. Term frequency refers to the number of times a particular word showed up in the document. Document frequency, on the other hand, refers to how many times that word showed up in all documents divided by the number of words in all documents. 

Let's take a look at a small example to see how this is done:

```
doc1: "This is a sentence about a tree."

doc2: "This is a sentence about a frog."

```

To calculate the document frequency of "a", we see that "a" appears 4 times and there is a total of 14 words. Therefore the document frequency of "a" is 4/14.

The inverted document frequency, simply inverts the document frequency, so the inverted document frequency of "a" would be 14/4.

To calculate the TF-IDF score of, for example doc2, we simply iterate over every word to find its frequency within the document and then multiply it by the inverted document frequency fot that word:

```
This: 1*(14/2) = 7
is: 1*(14/2) = 7
a: 2*(14/4) = 7
sentence: 1*(14/2) = 7
about: 1*(14/2) = 7
trees: 1*(14/1) = 14
```

And what we see is that the TF-IDF score for "trees" is much higher than any other word. We also want to notice that even though "a" appears twice in doc2, because it appears so often in all of the texts, it's TF-IDF score is low. 


While this version of TF-IDF will certainly work, what one will often use instead of this is the log of the document frequency, so that insted of 
```
trees: 1*(14/1) = 14
```
we would have
```
trees: 1*log(14/1) = 2.6390573296152584
```
We do this because the frequency of words follows [Zipf's Law](http://en.wikipedia.org/wiki/Zipf%27s_law).

###Implementation

Now that we have seen how to calculate the TF-IDF score, let's get in to the implementation of the map reduce program. AS always, I recommend you program this up yourself, but you can use my code as a guide.

Here's my mapper:

In [None]:
import sys
import re

if(len(sys.argv) != 2):
    print "Proper usage: python tfidf_mapper.py textfile.txt"
    sys.exit(1)

myFile = open(sys.argv[1]).read()
words = [i for i in re.split(r'\W+',myFile.lower()) if i]
myWords = {}
for word in words:
    try:
        myWords[word]['num'] += 1
    except KeyError:
        myWords[word] = {'num': 1, 'tf': 0}

#print myWords

total = 0
for line in sys.stdin:
    internal_words = line.split()
    internal_numbers = [int(myNum) for i, myNum in enumerate(internal_words[1:]) if i%2 == 1]
    l = sum(internal_numbers)
    total += l
    try:
        myWords[internal_words[0]]['tf'] = l
    except KeyError:
        pass

l = len(words)
for i in myWords:
    print i, myWords[i]['num'], l, myWords[i]['tf'], total


And here is my reducer:

In [None]:
import sys
import numpy as np

myDict = {}

for line in sys.stdin:
    word, num, tot_num, tf, tf_tot = line.split()
    num = int(num)
    tf = int(tf)
    tot_num = int(tot_num)
    tf_tot = int(tf_tot)

    try:
        myDict[word]['tf'] += tf
        myDict[word]['tf_tot'] += tf_tot
    except KeyError:
        myDict[word] = {'num': num, 'tot_num': tot_num, 'tf': tf, 'tf_tot': tf_tot}



tfidf_score = []
word = []
for i in myDict:
    word.append(i)
    num = float(myDict[i]['num'])
    bigN = float(myDict[i]['tf_tot'])
    smallN = float(myDict[i]['tf'])
    myLog = np.log(bigN/(smallN+1))
    tfidf_score.append(num*myLog)


out = sorted(zip(tfidf_score, word), reverse = True)
for i in out:
    print i[1], i[0]


###Project Gutenberg

[Project Gutenberg](https://www.gutenberg.org/) is a website with a tom of free open-source books. We were able to download a alarge selection of books from that site in order to run the key-generator algorithm with them.

First we used the inverted term map reduce job from the last tutorial to set up an inverted term data structure. After that we used the TF-IDF Map-Reduce program presented in this tutorial, and used it on "Moby Dick". Here are the ten words that key word generator algorithm found:

