So far, through a combination of approaches, I have managed to tag 79% of articles as relevant

- Fail cases are either appropriate (i.e. irrelevant) or use more complex phrasing
- Can some of these approaches be combined into an overarching function?

In [751]:
import pandas as pd
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
pd.set_option('display.max_colwidth', -1)
import re
from collections import Counter
import itertools

In [457]:
nlp = spacy.load('en')

In [757]:
# Data source is 290 downloaded articles from the Training Data
df = pd.read_csv('https://s3-us-west-1.amazonaws.com/simon.bedford/d4d/article_contents.csv')
df = df.fillna('')

In [459]:
# Specified reporting terms from challenge description
reporting_terms = [
    'displaced', 'evacuated', 'forced to flee', 'homeless', 'in relief camp',
    'sheltered', 'relocated', 'destroyed housing', 'partially destroyed housing',
    'uninhabitable housing'
]

In [460]:
# Specified reporting units from challenge description
reporting_units = {
    'people': ['people', 'persons', 'individuals', 'children', 'inhabitants', 'residents', 'migrants'],
    'households': ['families', 'households', 'houses', 'homes']
}

In [752]:
direct_phrases = []
nouns = 'people|persons|families|individuals|children|inhabitants|residents|migrants|villagers'
nouns = nouns.split("|")
verbs = 'evacuated|displaced|fled|forced to flee|relocated|forced to leave'
verbs = verbs.split("|")

for n, v in list(itertools.product(nouns, verbs)):
    direct_phrases.append(n + " " + v)
    direct_phrases.append(v + " " + n)

In [754]:
nouns = 'houses|homes'
nouns = nouns.split("|")
verbs = 'destroyed|damaged|flooded|inundated|lost|collapsed|submerged|washed away|affected|demolished'
verbs = verbs.split("|")

for n, v in list(itertools.product(nouns, verbs)):
    direct_phrases.append(n + " " + v)
    direct_phrases.append(v + " " + n)

In [463]:
#
housing_units = re.compile('houses|homes')
housing_impacts = re.compile("destroyed|damaged|flooded|inundated|lost|collapsed|submerged|washed away|affected|demolished")

In [521]:
#
people_units = re.compile('people|persons|families|individuals|children|inhabitants|residents|migrants')
people_impacts = re.compile('evacuated|displaced|fled|forced to flee|relocated|forced to leave')

In [642]:
units_regex = re.compile('households|houses|homes|people|persons|families|individuals|children|inhabitants|residents|migrants|villagers')
impacts_regex = re.compile('destroyed|damaged|flooded|inundated|lost|collapsed|submerged|washed away|affected|demolished|evacuated|displaced|fled|forced to flee|relocated|forced to leave')

In [546]:
def clean_string(s):
    return s.replace('\xa0', '')

In [771]:
# This is to test for phrases that are direct combinations of reporting units and terms, for example
# evaucated people / people evacuated
# destroyed houses / houses destroyed
# It is possible that some of these combinations could occur in irrelevant documents

def check_initial_combinations(article):
    article = article.lower()
    regex = re.compile("|".join(direct_phrases))
    if re.search(regex, article):
        return True

In [475]:
# Noun phrase with housing units and housing impacts, examples:
# at least 60 homes were destroyed across three districts
# mark gunning said 116 houses in wye river and separation creek had been destroyed
# the landslide, which covered about 2 sq km (0.8 sq miles), damaged at least 11 homes, xinhua reported.
# as more than 8,000 people were evacuated from their homes,
# some 2,500 people were evacuated from hard-hit grimma, near leipzig.

def get_noun_phrase_sentences(article, units_regex, impacts_regex):
    sentences = []
    doc = nlp(u"{}".format(article.lower()))
    for s in doc.sents:
        d = nlp(u"{}".format(s))
        for np in d.noun_chunks:
            if re.search(units_regex, np.text) and re.search(impacts_regex, np.root.head.text):
                sentences.append(str(s))
    return sentences

def check_noun_phrases(article, units_regex, impacts_regex):
    doc = nlp(u"{}".format(article.lower()))
    for s in doc.sents:
        d = nlp(u"{}".format(s))
        for np in d.noun_chunks:
            if re.search(units_regex, np.text) and re.search(impacts_regex, np.root.head.text):
                return True

In [680]:
# Combinations of relevant units and impacts, exmples:
# provide assistance to impacted and displaced families
# accommodation had been provided for about 600 displaced residents.
# it is expected that displaced families will need relief supplies

def get_simple_combinations_sentences(article, units_regex, impacts_regex):
    sentences = []
    doc = nlp(u"{}".format(article.lower()))
    for s in doc.sents:
        d = nlp(u"{}".format(s))
        for np in d.noun_chunks:
            if re.search(units_regex, np.text) and re.search(impacts_regex, np.text):
                sentences.append(str(s))
    return sentences

def check_simple_combinations(article, units_regex, impacts_regex):
    doc = nlp(u"{}".format(article.lower()))
    for s in doc.sents:
        d = nlp(u"{}".format(s))
        for np in d.noun_chunks:
            if re.search(units_regex, np.text) and re.search(impacts_regex, np.text):
                return True
            if re.search(units_regex, np.text) and re.search(impacts_regex, " ".join([l.text for l in np.rights])):
                return True

In [662]:
# Units as passive subjects, examples:
# Hundreds of homes have been destroyed in Algeria‘s southern city of Tamanrasset...
# confirming that 15 families have been evacuated to the town hall as a precaution against collapse.

def get_passive_subject_sentences(article, units_regex, impacts_regex):
    sentences = []
    doc = nlp(u"{}".format(article.lower()))
    for s in doc.sents:
        d = nlp(u"{}".format(s))
        for token in d:
            if re.search(impacts_regex, str(token)):
                children = [t for t in token.children]
                for c in children:
                    if c.dep_ in ('nsubjpass', 'nsubj'):
                        obj = " ".join([str(a) for a in c.subtree])
                        if re.search(units_regex, obj):
                            sentences.append(s)
    return sentences

def check_passive_subject(article, units_regex, impacts_regex):
    sentences = []
    doc = nlp(u"{}".format(article.lower()))
    for s in doc.sents:
        d = nlp(u"{}".format(s))
        for token in d:
            if re.search(impacts_regex, str(token)):
                children = [t for t in token.children]
                for c in children:
                    if c.dep_ in ('nsubjpass', 'nsubj'):
                        obj = " ".join([str(a) for a in c.subtree])
                        if re.search(units_regex, obj):
                            return True

In [563]:
# Examples
# displaced residents said they couldn't believe how quickly the situation escalated
# around 20,000 people had to be evacuated from their homes

def get_test1_sentences(article, units_regex, impacts_regex):
    sentences = []
    doc = nlp(u"{}".format(article.lower()))
    for s in doc.sents:
        d = nlp(u"{}".format(s))
        for token in d:
            if re.search(impacts_regex, str(token)):
                ancestors = [t for t in token.ancestors]
                for a in ancestors:
                    if a.dep_ == 'ROOT':
                        children = [c for c in a.children]
                        for c in children:
                            if c.dep_ == 'nsubj' and re.search(units_regex, str(c)):
                                sentences.append(s)
    return sentences

def test1(article, units_regex, impacts_regex):
    doc = nlp(u"{}".format(article.lower()))
    for s in doc.sents:
        d = nlp(u"{}".format(s))
        for token in d:
            if re.search(impacts_regex, str(token)):
                ancestors = [t for t in token.ancestors]
                for a in ancestors:
                    if a.dep_ == 'ROOT':
                        children = [c for c in a.children]
                        for c in children:
                            if c.dep_ == 'nsubj' and re.search(units_regex, str(c)):
                                return True

In [737]:
# Simple tests based upon combinations of words
# Unlikely to occur elsewhere
# Examples 'left homeless'
simple_phrases = [
    'left homeless',
    'families homeless',
    'people homeless',
    'residents homeless',
    'evacuate their homes',
    'left their homes',
    'fled to relief camps',
    'flee from their homes',
    'people evacuated',
    'houses damaged',
    'houses_submerged'
]
simple_phrases_regex = re.compile("|".join(simple_phrases))
def check_simple_phrases(article, simple_phrases_regex=simple_phrases_regex):
    if re.search(simple_phrases_regex, article.lower()):
        return True

In [758]:
def check_relevance(article):
    article = clean_string(article)
    if check_initial_combinations(article):
        return True
    if check_simple_phrases(article):
        return True
    if check_simple_combinations(article, units_regex, impacts_regex):
        return True
    if check_noun_phrases(article, units_regex, impacts_regex):
        return True
    if check_passive_subject(article, units_regex, impacts_regex):
        return True
    if test1(article, units_regex, impacts_regex):
        return True
    return False

In [759]:
df['is_relevant'] = df['content'].apply(check_relevance)

In [770]:
total = len(df)
relevant = (df['is_relevant'] == 1).sum()
print ("{:.0f}% identified as relevant".format(relevant/total * 100))

79% identified as relevant


#### Fail Cases

In [769]:
for i, row in (df[df['is_relevant'] == 0].head(10)).iterrows():
    print(row['content'])
    print("\n")

UN Office for the Coordination of Humanitarian Affairs:  To learn more about OCHA's activities, please visit http://unocha.org/.


When you boat your way somewhere, even if just for a day out of the city,...


The Disaster Management Information System (DMIS) is a web-based working tool made accessible only to Red Cross and Red Crescent staff working in National Societies, delegations and Geneva headquarters. It is a system from which users will be able to access:  The DMIS project started in February 2001 as a follow up to Strategy 2010 and in response to the need for informed decisions, speed and efficient operational readiness. DMIS is the result of a major effort made by the Federation in addressing the complexity of information exchange in the humanitarian community and to support an efficient disaster preparedness and response for the whole Federation's Red Cross and Crescent network at a global level. DMIS continues in the same vein with the successor Strategy 2020.




The Disa