# Regular Expressions exercises

In this exercise you will extract structured data out of an OCR transcript of a historical report. 

The first step is to read in the file! Replace the path given here so that your
file opens correctly.

In [None]:
fh = open('/PATH/TO/YOUR/FILE/morbidity.txt', 'r', encoding='utf-8')
data = fh.read()
fh.close()
print(data)

Before we begin, here is a list of the bits of regular expression that will prove useful.

Flags
-----
Regular expressions usually take some set of *flags* that alter how the expression is treated. The most useful one to know about is the one we used:

    re.I   (Case-insensitive: don't pay attention to upper- or lowercase)
    re.M   (The text is a multi-line file; make ^ and $ work on every line.)
    

Characters, metacharacters, and patterns
----------------------------------------

A regular expression is a *pattern* specified using *characters* and *metacharacters*. A character is, well, any old thing that can appear in a text file. A metacharacter is a character that doesn't get treated as itself, but rather signals to the regular expression engine that you want to express something more complicated. Typical metacharacters are:

    .       (Match any character)
    [,+;]   (Match the character if it is any of the things inside the [])
    [^abc]  (Match anything *except* a, b, or c)
    (abc)   (Make abc a group: apply any of the following to the whole thing.)
    +       (Match the previous character or group one or more times)
    {3}     (Match the previous character or group exactly three times)
    {1,4}   (Match the previous character or group between 1 and 4 times)
    *       (Match the previous character or group zero or more times)
    ?       (If the previous character or group isn't there, treat the pattern as a match anyway)
    \       (The thing that follows is a metacharacter (if normally not) or a character (if normally meta))
    
So this means that:

* `(abc)+` will match `abc` or `abcabc` but not `abac`.
* `[abc]+` will match `a` or `b` or `abac` or really any combination of a, b, and c.
* If you want to match anything at all, you match `.*`. 
* If you want to match anything except the empty string, you match `.+`.
* If you want to match a period, you match `\. `; for a plus sign, `\+ `.

There are a few more metacharacters that you should know about:

    \w   (match a "word" character, which is usually A-Z, a-z, 0-9, and _)
    \d   (match a "digit" character, which is generally 0-9)
    \s   (match any sort of "space" character, including space, tab, carriage return, etc.)
    
    ^   The beginning (of the string, or of the line with re.M)
    $   The end (of the string, or of the line with re.M)



Problem 1.
----------
Get rid of everything before STATISTICAL blah blah. You could use a regular
expression to do this, but you don't have to! You might also use the string
'find' method. If you have a string called 'message' with the contents
'Hello how are you today?', then you can look for the word 'you' like this:

    index = message.find('you')
    
and it will return the index of the string where the 'you' starts. If there
is no 'you' in the message, find() will return -1. You can use this index
to trim the string, if you like, using slicing.

    everything_after = message[index:]  # returns 'you today?'
    
You might want to print the first 100 characters or so of `data` when you are done,
to make sure you've got it right.

In [None]:
'YOUR CODE HERE'

Problem 2.
----------

Look for the header lines that have the date of the report (which was
21 Feb 1908). You'll need a regular expression for this part, and you'll 
have some dirty data! You should find a way to get the whole line that has 
this date (that is to say, the whole line and not just the date!), and 
there are 4 lines.

In [None]:
import re

for item in re.finditer('YOUR EXPRESSION HERE', data, re.M):
    print(item.group(0))

Problem 3.
----------

Now that you have found all these lines, you can remove them.
HINT: You should use `re.sub()` for this.

In [None]:
data = 'YOUR CODE HERE (hint: use re.sub() where you used print() above.)'

Problem 4.
----------

You can also remove these lines about "Received out of regular order"
in much the same way. When you've done this, print out the data to
make sure that they have disappeared.

In [None]:
data = re.sub('^.*Received out of regular order.*$', '', data, flags=re.M)

# Make sure they are all gone
index = data.find('Received out of')
print(index)

Problem 5.
----------

Now the fun part begins. First let's look for the cities, which as you can see are
set apart with an em-dash. Here is an em-dash that you can copy and paste
if you don't know how to type one:
 —
See if you can get a list of all the place names
in the remaining text. Use `re.finditer()` to do this.

In [None]:
count = 0

for item in re.finditer('CITY NAME BEFORE EM-DASH', data, re.M):
    print(item.group(1))
    count += 1
    
print(count)

Problem 6.
----------

Neat. But now do you notice that some of the lines have two place names?
We have to decide how to deal with that, and also how to deal with the
periods at the end, when they are there. Let's get the place names into
a format that looks like 

    Arizona 
or 

    Iowa, Cedar Rapids
Hint: for each match, you can use `re.sub()` to get rid of any periods,
and then you can use `re.search` to see if there are any — characters left.
If there are, you can split the matched text again and join it up with
the comma.

When you're done, print each place name to see if it looks right.

In [None]:
ffor place in re.finditer('YOUR REGEXP FROM ABOVE', data, re.M):
    placename = 'THIS IS A GOOD PLACE TO USE re.sub() ON place'
    if 'MAYBE USE re.search() HERE TO SEE IF THERE IS STILL AN EM-DASH':
        'IF THERE IS, SPLIT IT UP AND JOIN IT WITH A COMMA'
    print(placename)

Problem 7.
----------

Now we need a good regular expression that will pick out numbers, even when
they have commas in the middle. Try it out. You should be able to find 511
numbers.

* Hint: even if the number has a comma, it should end with a digit!
* Hint: we should be able to find single-digit numbers too!

In [None]:
all_numbers = re.findall('YOUR EXPRESSION HERE', data)
for num in all_numbers:
    print(num)
print(len(all_numbers))

Intermission.
-------------

For the rest of this to make sense, we are going to need to get each of
the records on one line. We can do that by looking for line breaks (we
write a line break as `\n `), and getting rid of any line breaks that have
any non-whitespace since the last line break. I've done this for you, but
take a good look and try to understand what is going on here before you
move on.

In [None]:
records = []
datalines = re.split("\n", data)
current_record = ''
for line in datalines:
    if re.search('^\s*$', line):
        if current_record is not '':
            records.append(current_record)
            current_record = ''
    else:
        current_record += line

for record in records:
    print(record)

Problem 8.
----------
So far we have been working with a giant blob (well, string) of data; now we
are working with a list of records. Now we will be operating our regular expressions
on each record in the list, rather than on the entire file.

We are going to try to pick out the total number of deaths from each record, and
add them up at the end to get a complete mortality rate.

Look again at the file, and see how these
are expressed. The list you print should have 53 entries.

* Hint: look at California.
* Hint: look at New Jersey.

In [None]:
entries_found = 0
total_deaths = 0

for record in records:
    for match in re.finditer('YOUR REGULAR EXPRESSION HERE', record):
        total = 'A GROUP OF YOUR MATCH'
        # Now you need to remove any commas in the number so that
        # you can convert it to an int and add it to the totals.
        total = re.sub('YOUR CODE HERE')
        # Well done, carry on.
        total_deaths += int(total)
        entries_found += 1
print("Found %d of 53 records" % entries_found)
print("All deaths in total: %d" % total_deaths)