This is a summary of the current 'best' approaches for extracting the relevant terms as requested by IDMC:
- Reporting term
- Reporting unit
- Displacement figure
- Location & Country

For reference we have:
1. Set of ~130 pre-labelled examples provided by IDMC (~118 in English)
2. Test set of excerpts that have been hand-labelled

Note: in many cases the relevant labels are quite subjective given the vagaries of language; furthermore in some cases multiple 'terms' or 'units' can be mentioned in the same sentence or excerpt; similarly there can often be ambiguous locations within the text

In general, in order to choose the most likely or relevant terms from the text, we are using the following heuristics:

1. Things affecting 'People' take precedence over things affecting Structures
2. Where clarification is given in brackets it is ignored (i.e., 1,000 families (3,000 people) should give 1,000 families)
3. Reports of destroyed structures take precedence over reports of damaged structures
4. If there is still ambiguity, the first possible report is chosen
5. Where multiple possible countries are mentioned:
    - The most common country is chosen, or
    - The first country is chosen

### Reporting Unit: People or Households

Overall Approach: Combine output of rules-based and machine learning model

Machine Learning Model: Multinomial Naive Bayes
Optimized based on ~130 training examples provided
Grid Search for Best Alpha parameter

Text Pre-Processing:
    
1. Remove named entities (People, Dates, Locations, Organizations, Numbers)
2. Remove text in brackets
3. Clean unusual characters etc.

Features:
    
Words extracted using count vectorizer; stop words removed

In [4]:
%load_ext autoreload
%autoreload 2
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
import pandas as pd
import numpy as np
import spacy
import gensim
import re
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
import sklearn.pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from internal_displacement.interpreter import Interpreter
from internal_displacement.extracted_report import convert_quantity
from internal_displacement.excerpt_helper import Helper
from internal_displacement.excerpt_helper import MeanEmbeddingVectorizer

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [53]:
from sklearn.svm import SVC

In [2]:
nlp = spacy.load('en')
person_reporting_terms = [
    'displaced', 'evacuated', 'forced', 'flee', 'homeless', 'relief camp',
    'sheltered', 'relocated', 'stranded', 'stuck', 'accommodated']

structure_reporting_terms = [
    'destroyed', 'damaged', 'swept', 'collapsed',
    'flooded', 'washed', 'inundated', 'evacuate'
]

person_reporting_units = ["families", "person", "people", "individuals", "locals", "villagers", "residents",
                            "occupants", "citizens", "households"]

structure_reporting_units = ["home", "house", "hut", "dwelling", "building"]

relevant_article_terms = ['Rainstorm', 'hurricane',
                          'tornado', 'rain', 'storm', 'earthquake']
relevant_article_lemmas = [t.lemma_ for t in nlp(
    " ".join(relevant_article_terms))]

data_path = '../data'

In [5]:
interpreter = Interpreter(nlp, person_reporting_terms, structure_reporting_terms, person_reporting_units,
                          structure_reporting_units, relevant_article_lemmas, data_path,
                          model_path='../internal_displacement/classifiers/default_model.pkl',
                          encoder_path='../internal_displacement/classifiers/default_encoder.pkl')

In [6]:
# Initializer the helper
helper = Helper(nlp, '../internal_displacement/classifiers/unit_vectorizer.pkl', 
               '../internal_displacement/classifiers/unit_model.pkl',
               '../internal_displacement/classifiers/term_vectorizer.pkl',
               '../internal_displacement/classifiers/term_model.pkl',
               '../internal_displacement/classifiers/terem_svc.pkl')

In [7]:
w2v = gensim.models.KeyedVectors.load_word2vec_format('../data/GoogleNews-vectors-negative300.bin', binary=True)

In [8]:
# Load the training data, remove non-English excerpts
df = pd.read_csv("../data/IDMC_fully_labelled.csv", encoding='latin1')
df['lang'] = df['Excerpt'].apply(lambda x: interpreter.check_language(x))
df = df[df['lang'] == 'en'].copy()

In [9]:
# Clean excerpts
df['cleaned_text'] = df['Excerpt'].apply(lambda x: helper.cleanup(x))

In [19]:
df['reports'] = df['cleaned_text'].apply(lambda x: interpreter.process_article_new(x))
df['most_likely_report'] = df['reports'].apply(lambda x: helper.get_report(x))

## Reporting Unit

#### Use grid-search to optimize for MultiNomial Naive Bayes alpha param

In [10]:
clf = MultinomialNB()

In [11]:
# Words as features
vectorizer = CountVectorizer(analyzer = "word", binary = True, 
                             ngram_range=(1,1))

In [25]:
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_text'], df['Reporting unit'], test_size=0.2, random_state=42)

In [26]:
X_train_features = vectorizer.fit_transform(X_train)

In [27]:
parameters = dict(alpha=[0, 0.1, 1, 10])
cv = GridSearchCV(clf, param_grid=parameters)
cv.fit(X_train_features, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': [0, 0.1, 1, 10]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score=True, scoring=None, verbose=0)

In [28]:
X_test_features = vectorizer.transform(X_test)
y_predictions = cv.predict(X_test_features)

In [29]:
print(sklearn.metrics.classification_report( y_test, y_predictions ))

             precision    recall  f1-score   support

 Households       1.00      0.50      0.67         6
     People       0.86      1.00      0.92        18

avg / total       0.89      0.88      0.86        24



#### Results of only using the rules

In [31]:
reports = [interpreter.process_article_new(x) for x in X_test]
top_reports = [helper.get_report(x) for x in reports]

In [32]:
predicted_unit = [r[2] for r in top_reports]

In [33]:
print(sklearn.metrics.classification_report( y_test, predicted_unit ))

             precision    recall  f1-score   support

                  0.00      0.00      0.00         0
 Households       1.00      0.33      0.50         6
     People       1.00      0.39      0.56        18

avg / total       1.00      0.38      0.55        24



  'recall', 'true', average, warn_for)


#### Results of using a combination

In [35]:
combined_unit = []

for p1, p2 in zip(y_predictions, predicted_unit):
    combined_unit.append(helper.combine_predictions(p1, p2))

In [36]:
print(sklearn.metrics.classification_report( y_test, combined_unit ))

             precision    recall  f1-score   support

 Households       1.00      0.50      0.67         6
     People       0.86      1.00      0.92        18

avg / total       0.89      0.88      0.86        24



## Reporting Term

Reporting term can be one of 10 categories (i.e. Displaced, Evacuated, In Relief Camp etc)

A similar approach is taken as per reporting unit (i.e. combining outcome of hand-crafted rules & classifier).

In this case, the classifier is even harder to train as many of the categories have very few or 0 examples

#### Train Multinomial NB

In [86]:
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_text'], df['Reporting term'], test_size=0.2, random_state=42)

In [87]:
clf = MultinomialNB()
vectorizer_1 = CountVectorizer(analyzer = "word", binary = True, 
                             ngram_range=(2,2))

In [88]:
X_train_features = vectorizer_1.fit_transform(X_train)

In [89]:
parameters = dict(alpha=[0, 0.1, 1, 10])
cv1 = GridSearchCV(clf, param_grid=parameters)
cv1.fit(X_train_features, y_train)



GridSearchCV(cv=None, error_score='raise',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': [0, 0.1, 1, 10]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score=True, scoring=None, verbose=0)

In [90]:
X_test_features_1 = vectorizer_1.transform(X_test)
y_predictions = cv1.predict(X_test_features_1)

In [91]:
print(sklearn.metrics.classification_report( y_test, y_predictions ))

                   precision    recall  f1-score   support

Destroyed Housing       1.00      0.50      0.67         4
        Displaced       0.67      0.80      0.73         5
        Evacuated       0.67      0.91      0.77        11
   Forced to Flee       0.00      0.00      0.00         2
        Relocated       1.00      0.50      0.67         2

      avg / total       0.69      0.71      0.67        24



  'precision', 'predicted', average, warn_for)


#### Train Linear SVC Using Word2Vec Features

In [92]:
vectorizer_2 = MeanEmbeddingVectorizer(w2v)

In [93]:
clf = SVC(probability=True, kernel='linear')

In [94]:
X_train_features = vectorizer_2.transform(X_train)

In [95]:
parameters = dict(C=[0.01, 0.1, 1, 10])
cv2 = GridSearchCV(clf, param_grid=parameters)
cv2.fit(X_train_features, y_train)



GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [0.01, 0.1, 1, 10]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score=True, scoring=None, verbose=0)

In [96]:
X_test_features_2 = vectorizer_2.transform(X_test)
y_predictions_2 = cv2.predict(X_test_features_2)

In [97]:
print(sklearn.metrics.classification_report( y_test, y_predictions_2 ))

                   precision    recall  f1-score   support

Destroyed Housing       0.67      0.50      0.57         4
        Displaced       1.00      0.60      0.75         5
        Evacuated       0.61      1.00      0.76        11
   Forced to Flee       0.00      0.00      0.00         2
        Relocated       0.00      0.00      0.00         2

      avg / total       0.60      0.67      0.60        24



  'precision', 'predicted', average, warn_for)


#### Combine the two classifier probabilities

In [98]:
p1 = cv1.predict_proba(X_test_features_1)
p2 = cv2.predict_proba(X_test_features_2)

In [101]:
p_combined = helper.combine_probabilities(p1, p2, cv1.best_estimator_.classes_)

In [102]:
print(sklearn.metrics.classification_report( y_test, p_combined ))

                   precision    recall  f1-score   support

Destroyed Housing       1.00      0.50      0.67         4
        Displaced       0.71      1.00      0.83         5
        Evacuated       0.79      1.00      0.88        11
   Forced to Flee       0.00      0.00      0.00         2
        Relocated       1.00      0.50      0.67         2

      avg / total       0.76      0.79      0.74        24



  'precision', 'predicted', average, warn_for)


## Location & Country Extraction

Country extraction approach is currently based on Spacy entity extraction and hand-crafted rules for translated extracted locations into 3-digit ISO country codes

If more than one country is identified in the fragment, the rules for choosing the country are:

1. Country that occurs the highest number of times or,
2. Country that appears first in the text

#### Test on the provided pre-labelled training data

Note: in some cases a country code is still given, even though there are no locations mentioned in the excerpt
We ignore these cases for testing purposes (the information likely comes from a different source that will not
be included in the specific NLP test)

In [105]:
df = pd.read_excel('../data/IDMC_fully_labelled.csv.xlsx')

In [106]:
### Remove those lines where no location is mentioned
df = df[df['Mentions'] == 'Y'].copy()

In [107]:
def get_country_code(text):
    '''Extract country code from ; separated values'''
    return text.split(';')[1]

In [108]:
df['true_code'] = df['Location (Country)'].apply(lambda x: get_country_code(x))

In [109]:
df['locations'] = df['Excerpt'].apply(lambda x: interpreter.extract_countries(interpreter.cleanup(x)))

In [110]:
df['top_location'] = df['locations'].apply(lambda x: helper.choose_country(x))

In [111]:
df['extracted_code'] = df['top_location'].apply(lambda x: x[1])

In [112]:
precision, recall, f1, support = sklearn.metrics.precision_recall_fscore_support( df['true_code'], df['extracted_code'], average='weighted')

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [113]:
print("Precision: {}, Recall: {}, F1: {}, Support: {}".format(precision, recall, f1, support))

Precision: 0.8187748783724015, Recall: 0.7105263157894737, F1: 0.7461506329927383, Support: None


## Extracted Quantity

Extracted quantity is obtained in two ways:
    
1. Using the quantity extracted in the most likely report
2. Using a second rule based approach that also takes into account the predicted Unit (People / Households) and attempts to find the nearest number to that unit in the text
3. If neither of the above approaches yields a result, then returns the largest number found in the text

In [64]:
# Load the training data, remove non-English excerpts
df = pd.read_excel("../data/IDMC_fully_labelled.csv.xlsx", encoding='latin1')
df['lang'] = df['Excerpt'].apply(lambda x: interpreter.check_language(x))
df = df[df['lang'] == 'en'].copy()

In [65]:
df = df[['Excerpt', 'Displacement figure', 'Reporting unit']].copy()

#### Test 1: Existing hand-crafted rules

In [66]:
df['cleaned_excerpt'] = df['Excerpt'].apply(lambda x: remove_brackets(x))

In [67]:
df['extracted_report'] = df['cleaned_excerpt'].apply(lambda x: get_report(x))

In [68]:
df['extracted_quantity'] = df['extracted_report'].apply(lambda x: x[0])

In [69]:
df['extracted_quantity'] = df['extracted_quantity'].fillna(0)
df['extracted_quantity'] = df['extracted_quantity'].astype(int)

In [70]:
precision, recall, f1, support = sklearn.metrics.precision_recall_fscore_support(df['Displacement figure'], df['extracted_quantity'], average='weighted')

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [71]:
print("Precision: {}, Recall: {}, F1: {}, Support: {}".format(precision, recall, f1, support))

Precision: 0.5413223140495868, Recall: 0.5206611570247934, F1: 0.5234159779614325, Support: None


#### Test 2: Additional set of rules taking into account the Unit (People vs Households)

In [75]:
df['new_extracted_figure'] = df[['Excerpt', 'Reporting unit']].apply(lambda x: helper.get_number(x['Excerpt'], x['Reporting unit']), axis=1)

In [76]:
df['new_extracted_figure'] = df['new_extracted_figure'].fillna(0)
df['new_extracted_figure'] = df['new_extracted_figure'].astype(int)

In [77]:
precision, recall, f1, support = sklearn.metrics.precision_recall_fscore_support(df['Displacement figure'], df['new_extracted_figure'], average='weighted')

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [78]:
print("Precision: {}, Recall: {}, F1: {}, Support: {}".format(precision, recall, f1, support))

Precision: 0.7252066115702479, Recall: 0.7107438016528925, F1: 0.7099567099567099, Support: None


#### Test 3: Combine the outputs of approaches 1 & 2

In [80]:
df['combined_quantity'] = df[['extracted_quantity', 'new_extracted_figure']].apply(lambda x: interpreter.combine_quantities(x['extracted_quantity'], x['new_extracted_figure']), axis=1)

In [81]:
precision, recall, f1, support = sklearn.metrics.precision_recall_fscore_support(df['Displacement figure'], df['combined_quantity'], average='weighted')

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [82]:
print("Precision: {}, Recall: {}, F1: {}, Support: {}".format(precision, recall, f1, support))

Precision: 0.7603305785123967, Recall: 0.7603305785123967, F1: 0.7548209366391185, Support: None
