Skip to content

Extracting Facts from Text

Simon Bedford edited this page Feb 24, 2017 · 2 revisions

This tutorial imagines that you are working in a Jupyter notebook within the notebooks directory.

Getting set up:

First, let's import the libraries / classes that we'll need:

import spacy
import os
import sys

spacy is for initializing an NLP engine for working with text. We need os and sys to be able to import our own custom modules.

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
from internal_displacement.interpreter import Interpreter
from internal_displacement.interpreter import Report
from internal_displacement.article import Article

Here we are importing three custom classes:

  • Interpreter provides functionality for extracting facts and other information from articles
  • Report is what we will use for creating and saving reports
  • Article for working with articles

Natural Language Processing is language-specific, and hence we need to load a pre-trained model for English that can be used for text processing:

nlp = spacy.load('en')

Initialize the key lists of reporting units and terms that we wish to use:

Structural Terms and Units
structure_reporting_terms = ['destroyed', 'damaged', 'swept', 'collapsed', 'flooded', 'washed',
                            'inundated', 'evacuate']
structure_reporting_units = ["home", "house", "hut", "dwelling", "building", "shop", "business", "apartment",
                                     "flat", "residence"]
Person-related Terms and Units
person_reporting_terms = ['displaced', 'evacuated', 'forced', 'flee', 'homeless', 'relief camp', 'sheltered',
                        'relocated', 'stranded', 'stuck', 'stranded', "killed", "dead", "died", "drown"]
person_reporting_units = ["families", "person", "people", "individuals", "locals", "villagers", "residents",
                            "occupants", "citizens", "households", "life"]
Disaster Identification Terms
relevant_article_terms = ['Rainstorm', 'hurricane', 'tornado', 'rain', 'storm', 'earthquake']
relevant_article_lemmas = [t.lemma_ for t in nlp(" ".join(relevant_article_terms))]

The last line of code converts the specific terms to their lemmatised form; this enables us to compare words even if they appear in slightly different forms within the text.

Set the path to the data folder:

data_path = '../data'

Initialize an interpreter:

interpreter = Interpreter(nlp, person_reporting_terms, structure_reporting_terms, person_reporting_units,
                          structure_reporting_units, relevant_article_lemmas, data_path)

Now that we are all set-up, we can use the interpreter to process some text.

Example 1

story = 'ALGIERS (AA) – Hundreds of homes have been destroyed in Algeria‘s southern city of Tamanrasset following several days of torrential rainfall, a local humanitarian aid official said Wednesday.  The city was pounded by rainfall from March 19 to March 24, according to Ghanom Sudani, a member of a government-appointed humanitarian aid committee.  He added that heavy rains had destroyed as many as 400 residences.  “Hundreds of families have had to leave their homes after they were inundated with water,” Sudani told The Anadolu Agency.  www.aa.com.tr/en  Last month neighbouring Tunisia experienced heavy rainfall and flooding in Jendouba City.'

article = Article(story, '', '', '', '', '', '')

here we are wrapping the text in the Article class in order to be able to use some of the specific functionality from Interpreter, although initialized with most of the attributes as empty strings.

Check what language the text is:

language = interpreter.check_language(article)
print(article.language)

en

Get the ISO codes for countries mentioned in the text:

countries = interpreter.extract_countries(article)
print(countries)

['DZ', 'TN']

Extract reports from the text:

reports = interpreter.process_article_new(article.content)
print(len(reports))

1

Visualize the article and its reports:

ALGIERS (AA) – Hundreds of homes have been destroyed in Algeria‘s southern city of Tamanrasset following several days of torrential rainfall, a local humanitarian aid official said Wednesday. The city was pounded by rainfall from March 19 to March 24, according to Ghanom Sudani, a member of a government-appointed humanitarian aid committee. He added that heavy rains had destroyed as many as 400 residences. “Hundreds of families have had to leave their homes after they were inundated with water,” Sudani told The Anadolu Agency. www.aa.com.tr/en Last month neighbouring Tunisia experienced heavy rainfall and flooding in Jendouba City.

for r in reports:
    r.display()

Location: ['Tamanrasset'] DateTime: ['March 19'] EventTerm: destroy SubjectTerm: residence Quantity: 400

Example 2

Here we have an example of text that is poorly formatted:

"Due to high intensity of rainfall, Mekong River has swell and caused flooding to the surrounding areas. More flooding is expected if the rain continues. The provinces affected so far includes: Kampong Cham, Kratie, Stung Treng and Kandal12 out of Cambodia's 25 cities and provinces are suffering from floods caused by monsoon rains and Mekong River floodingIMPACT45 dead16,000 families were affected and evacuated3,080 houses inundated44,069 hectares of rice field were inundated5,617 hectares of secondary crops were inundatedRESPONSEThe local authorities provided response to the affected communities. More impact assessment is still conducted by provincial and national authorities.The government also prepared 200 units of heavy equipment in Phnom Penh and the provinces of Takeo, Svay Rieng, Oddar Meanchey and Battambang to divert water or mitigate overflows from inundated homes and farmland"

Prior to trying to extract reports we need to clean it up:

story = interpreter.cleanup(story)

"Due to high intensity of rainfall, Mekong River has swell and caused flooding to the surrounding areas. More flooding is expected if the rain continues. The provinces affected so far includes: Kampong Cham, Kratie, Stung Treng and Kandal. 12 out of Cambodia's 25 cities and provinces are suffering from floods caused by monsoon rains and Mekong River flooding. IMPACT. 45 dead. 16,000 families were affected and evacuated. 3,080 houses inundated. 44,069 hectares of rice field were inundated. 5,617 hectares of secondary crops were inundated. RESPONSE. The local authorities provided response to the affected communities. More impact assessment is still conducted by provincial and national authorities.The government also prepared 200 units of heavy equipment in Phnom Penh and the provinces of Takeo, Svay Rieng, Oddar Meanchey and Battambang to divert water or mitigate overflows from inundated homes and farmland"

Get the ISO codes for countries mentioned in the text:

countries = interpreter.extract_countries(article)
print(countries)

['KH']

Extract reports from the text:

reports = interpreter.process_article_new(article.content)
print(len(reports))

2

for r in reports:
    r.display()

Location: ['Cambodia'] DateTime: [] EventTerm: inundate SubjectTerm: house Quantity: 3,080

Location: ['Cambodia'] DateTime: [] EventTerm: affect SubjectTerm: family Quantity: 16,000