In [1]:
import re
import geonamescache
from unidecode import unidecode
import pandas as pd

gc = geonamescache.GeonamesCache()

## Headlines

The first thing we'll do is read in the headlines we'll be examining. We'll be cleaning them up by removing trailing whitespace and converting from Unicode to ASCII for ease-of-processing later.

In [2]:
with open("data/headlines.txt") as headline_file:
    # Remove any unnecessary leading whitespace for consistency
    headlines = [ unidecode(line.strip()) for line in headline_file ]

## Regular Expressions

We have to determine the country and/or city in any given headline (if there is one!). We'll do that by creating many Regular Expressions based on each city/country that we'll use to parse each headline later.

In [3]:
countries = gc.get_countries_by_names().keys()
# Pad the country regex string with word boundary meta-characters to avoid false matches
country_regex_strgen = (r"\b{}\b".format(unidecode(country)) for country in countries)
country_regexes = [re.compile(country, re.ASCII) for country in country_regex_strgen]

cities = gc.get_cities()
# Same as the country regex strings, add word boundary meta-characters
city_regex_strgen = (r"\b{}\b".format(unidecode(cities[id]['name'])) for id in cities.keys())
city_regexes = [re.compile(city, re.ASCII) for city in city_regex_strgen]

### Regex: Longest Match

If location names are similar, we may get multiple matches. We want to filter through these to ensure we're finding the correct location. We'll do this by only choosing the longest matching string.

e.g. If we have the headline "Zika confirmed in Miami Beach", the regexes for "Miami" and "Miami Beach" will both match, but since "Miami Beach" is longer, we'll associate the headline with Miami beach.

In [4]:
def find_longest_match(regex_list, line):
    longest_match = ""
    for regex in regex_list:
        result = regex.search(line)
        if result:
            if len(result.group(0)) > len(longest_match):
                longest_match = result.group(0)
    # Return None only if we have an empty string
    return longest_match if longest_match else None

In [5]:
data = dict(headline=[], country=[], city=[])
for headline in headlines:
    data['headline'].append(headline)
    data['country'].append(find_longest_match(country_regexes, headline))
    data['city'].append(find_longest_match(city_regexes, headline))

df = pd.DataFrame.from_dict(data)