# Program Overview
This script is part of a Hadoop map-reduce framework for reading through the OCR output of the London Times Digital Archive by Gale Cengage Learning and returns a heat map showing the relative frequency with which a give word appears on the front page. This is the map portion of the framework and is meant to be used with the accompanying reducer and final image producer.

Originally produced to be used with a Big Data Analysis course at the Digital Humanities Summer Institute in 2016. Orignial idea by Pawel Pomorski.  Initial coding by John Simpson.

# General Methodology
This mapper leaves behind the ability to walk the XML trees of the OCR files in favour of the line by line processing required to use the streaming interface of Hadoop, which reads from standard in and writes to standard out.  Reading line by line, the mapper first uses a regular expression (regex) to find text that match the XML layout for returned words from the OCR output that additionally match the word that is being searched for.  When a match is made the groups of characters captured by the regex are extracted.  The numbers that set the grid location rows where the word appeared are used to set the range for the production of keys and the column numbers that set the grid location range are the values attached to those keys.  The key-value outputs are printed to the screen (standard out) where Hadoop will be able to grab them and pass them to the reducer.

## Declaring this to be a Python Script
While it is likely obvious within the Jupyter notebook environment in which this script was originally produced that this is a _Python_ script it will not necessarily be obvious to the other programs that will use this script in a production environment.  This first line tell those programs to use Python to interpret what follows.  Note that the first two characters are usually referred to as a "shebang" and they are used to do exactly this, make important declarations about how scripts and programs should be handled.

In [None]:
#!/usr/bin/env python

## Python 2 to 3 compatibility

Hadoop environments may use Python ranging from 2.6 to 3+.  To maximize compatibility we are using Python 2.6 printing syntax.  This first line of code ensures that it is parsed properly by Python 3+.

In [None]:
from __future__ import print_function

In [None]:


import sys
import re
import numpy as np

searchTerm = "war"
searchTerm = searchTerm.lower()
regEx = re.compile(r"<wd pos=\"(\d+,\d+,\d+,\d+)\">([\w\'\",\[\]\{\}\(\)\.]+)</wd>")

# will need RegEx since using STDIN and not lxml so won't have access to tree
# General process:
# 1. match line to <wd pos="(\d+,\d+,\d+,\d+)">([\w'",\[\]\(\)\{\}\.])+</wd>
# 2. extract word group.
# 3. convert word group to lowercase.
# 4. if word group matches searchTerm then output the coordinates
# 5. next line

# input comes from STDIN (standard input), the equivalent of the keyboard or passing text
for line in sys.stdin:
    try:
        # remove leading and trailing whitespace and end of line characters
        line = line.strip(' \n')
        m = re.search(regEx, line)
        if m != None:
            #print(m.group(0),m.group(1),m.group(2))
            if m.group(2).lower() == searchTerm:
                # split m.group(1) at the commas
                coords = [int(i) for i in m.group(1).split(',')] #nice use of list comprehension to convert to ints
                # create a numpy array that is one row
                for row in range(coords[0],coords[2]+1):
                    print('%s\t%s,%s') % (row, coords[1],coords[3])
                # for each row, items [0] to [2], increment
                # cells called out by items [1] to [3]
                # print(m.group(1))
                # print(coords, type(coords[0]))

    except:
        continue