# Inverted Document Search

###Introduction

In this tutorial, we explore how to find documents that contain a specific word. 

One way to do that is to simply scan through every document and see if the word is contained in that document. This will work well if you are only going to do this type of search once or twice, but what if this kind of search is something that you need to do often?

We need a better way to do this type of search, so that we can first create a data structure that supports this type of search. Then once we make the data set, we can quickly find which documents are contained, rather than having to search through all the documents all together every time. 

###An Example

Let's take a look at an example, to make sure we know what we are talking about. Suppose you have the following three documents with their content:

<p style ="color:red"> Document 1:</p> The big cat ate the small dog.

<p style ="color:red">Document 2:</p> My cat has black eyes.

<p style ="color:red">Document 3:</p> For lunch we ate black beans.



If we were to search for "cat", both document 1 and document 2 would be returned, whereas if we searched for lunch, only document 3 would be returned.

So Now, what we want to do is first create a data structure that will help us do this kind of search more easily.


So basically we want to ask our algorithm given keywords which documents will match our query the best. So we want to make an inverteed index

<table>
<tr><td style = "color:blue">Words</td><td style="color:red" colspan=3>Documents</td><tr>
<tr><td>the</td><td></td><td>doc 1</td><td>doc 1</td><tr>
<tr><td>big</td><td></td><td>doc 1</td><tr>
<tr><td>cat</td><td></td><td>doc 1</td><td>doc 2</td><tr>
<tr><td>ate</td><td></td><td>doc 1</td><td>doc 3</td><tr>
<tr><td>small</td><td></td><td>doc 1</td><tr>
<tr><td>dog</td><td></td><td>doc 1</td><tr>
<tr><td>my</td><td></td><td>doc 2</td><tr>
<tr><td>has</td><td></td><td>doc 2</td><tr>
<tr><td>black</td><td></td><td>doc 2</td><td>doc 3</td><tr>
<tr><td>eyes</td><td></td><td>doc 2</td><tr>
<tr><td>for</td><td></td><td>doc 3</td><tr>
<tr><td>lunch</td><td></td><td>doc 3</td><tr>
<tr><td>we</td><td></td><td>doc 3</td><tr>
<tr><td>beans</td><td></td><td>doc 3</td><tr>
</table>




After making this inverted index, it is now easy to search for which documents contain a specific word without having to revert to scanning all the documents. This is why I call this kind of search an inverted document search, because insted of searching a document for certain words, we are going to search for which words are contained in a document... (if that made any sense.)

###The Algorithm

If you have been following along with the other tutorials in this series, you should probably farly confident in how to program a map-reduce job to create the inverted index, each mapper will scan through a document and find the words in that document. For example, if the first mapper was in charge of reading over the first document in our example above, then the output from the mapper would be:

<p style ="color:red"> the -- Document 1 -- 2</p>  
<p style ="color:red"> big -- Document 1 -- 1</p>
<p style ="color:red"> cat -- Document 1 -- 1</p>
<p style ="color:red"> ate -- Document 1 -- 1</p>
<p style ="color:red"> small -- Document 1 -- 1</p>
<p style ="color:red"> dog -- Document 1 -- 1</p>

Where the 2 following "the" occurs because 2 occured twice in document 1.

The reducer, obviously combines the results together. 

As always, I recommend you try to program this up yourself, but as a source of reference, here's my code:

Mapper:


In [None]:
import sys
import os
import re

file_name = os.environ['mapreduce_map_input_file'] #This allows us to know which file we are currently mapping upon
file_start = os.environ['mapreduce_map_input_start'] #We are not using this, but it may be useful for other tasks
file_length = os.environ['mapreduce_map_input_length'] #We are not using this either

myDict = {}
for line in sys.stdin:
    words = [i for i in re.split(r'\W+',line.lower()) if i]
    for word in words:
        try:
            myDict[word] += 1
        except KeyError:
            myDict[word] = 1

for word in myDict:
    print word, file_name, myDict[word]

And here's the reducer:

In [None]:
import sys

data = {}
for line in sys.stdin:
    word, file_name, num = line.split()
    try:
        data[word][file_name] = num
    except KeyError:
        data[word] = {file_name:num}

for word in data:
    out = [word]
    for file_name in data[word]:
        out.append(file_name)
        out.append(data[word][file_name])
    out = " ".join(out)
    print out 

###Term Search

Now, let us write a mapreduce job to search for the documents that contain a certain word or even better words.

What we first have to do is copy the output from the previous step into a new folder, because we will need to access that data obviously. 

After doing that, it is fairly straightforward to search for at least one word, I will leave it up to you how to search for more than one word, and the implications that that contains.

In fact, this is a mapreduce job that does not need any reducers. I believe this is the first time we have done a map reduce job without a reducer, so the way we call the streaming file will change.

Here is the mapper:

In [None]:
import sys

if(len(sys.argv)<2):
    print "Need to supply a list of words to search for"
    sys.exit(1)
list_of_words = sys.argv[1:]

for line in sys.stdin:
    words = line.split()
    if words[0] in list_of_words:
    print line


We do not need a reducer for this map reduce job, and we specify this by calling hadoop in the following way:

```
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar -D mapred.reduce.tasks=0 -file term_finder.py -mapper "python term_finder.py is cat coffee" -input /user/hduser/terms/* -output /user/hduser/out_tests2
```
Finally, the output for this example, using a small data set is like


```
is hdfs://localhost:54310/user/hduser/words/file0.txt 1 hdfs://localhost:54310/user/hduser/words/file5.txt 1 hdfs://localhost:54310/user/hduser/words/file4.txt 1	
	
cat hdfs://localhost:54310/user/hduser/words/file4.txt 1	
```