This is a summary of the current 'best' approaches for extracting the relevant terms as requested by IDMC:
- Reporting term
- Reporting unit
- Displacement figure
- Location & Country

For reference we have:
1. Set of ~130 pre-labelled examples provided by IDMC (~118 in English)
2. Test set of excerpts that have been hand-labelled

Note: in many cases the relevant labels are quite subjective given the vagaries of language; furthermore in some cases multiple 'terms' or 'units' can be mentioned in the same sentence or excerpt; similarly there can often be ambiguous locations within the text

In general, in order to choose the most likely or relevant terms from the text, we are using the following heuristics:

1. Things affecting 'People' take precedence over things affecting Structures
2. Where clarification is given in brackets it is ignored (i.e., 1,000 families (3,000 people) should give 1,000 families)
3. Reports of destroyed structures take precedence over reports of damaged structures
4. If there is still ambiguity, the first possible report is chosen
5. Where multiple possible countries are mentioned:
    - The most common country is chosen, or
    - The first country is chosen

### Reporting Unit: People or Households

Overall Approach: Combine output of rules-based and machine learning model

Machine Learning Model: Multinomial Naive Bayes
Optimized based on ~130 training examples provided
Grid Search for Best Alpha parameter

Text Pre-Processing:
    
1. Remove named entities (People, Dates, Locations, Organizations, Numbers)
2. Remove text in brackets
3. Clean unusual characters etc.

Features:
    
Words extracted using count vectorizer; stop words removed

In [1]:
%load_ext autoreload
%autoreload 2
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
import pandas as pd
import numpy as np
import spacy
import re
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
import sklearn.pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from internal_displacement.interpreter import Interpreter
from internal_displacement.extracted_report import convert_quantity

In [2]:
nlp = spacy.load('en')
person_reporting_terms = [
    'displaced', 'evacuated', 'forced', 'flee', 'homeless', 'relief camp',
    'sheltered', 'relocated', 'stranded', 'stuck', 'accommodated']

structure_reporting_terms = [
    'destroyed', 'damaged', 'swept', 'collapsed',
    'flooded', 'washed', 'inundated', 'evacuate'
]

person_reporting_units = ["families", "person", "people", "individuals", "locals", "villagers", "residents",
                            "occupants", "citizens", "households"]

structure_reporting_units = ["home", "house", "hut", "dwelling", "building"]

relevant_article_terms = ['Rainstorm', 'hurricane',
                          'tornado', 'rain', 'storm', 'earthquake']
relevant_article_lemmas = [t.lemma_ for t in nlp(
    " ".join(relevant_article_terms))]

data_path = '../data'

In [3]:
interpreter = Interpreter(nlp, person_reporting_terms, structure_reporting_terms, person_reporting_units,
                          structure_reporting_units, relevant_article_lemmas, data_path,
                          model_path='../internal_displacement/classifiers/default_model.pkl',
                          encoder_path='../internal_displacement/classifiers/default_encoder.pkl')

In [4]:
# Load the training data, remove non-English excerpts
df = pd.read_csv("../data/IDMC_fully_labelled.csv", encoding='latin1')
df['lang'] = df['Excerpt'].apply(lambda x: interpreter.check_language(x))
df = df[df['lang'] == 'en'].copy()

In [5]:
def remove_brackets(text):
    '''Remove items in brackets
    '''
    text = re.sub(r'(\(.+\))', '', text)
    text = re.sub(r'\s{2}', ' ', text)
    return text
    
def cleanup(text):
        '''Clean common errors in the text and remove irrelevant tokens and stop words'''
        text = re.sub(r'([a-zA-Z0-9])(IMPACT)', r'\1. \2', text)
        text = re.sub(r'([a-zA-Z0-9])(RESPONSE)', r'\1. \2', text)
        text = re.sub(r'(IMPACT)([a-zA-Z0-9])', r'\1. \2', text)
        text = re.sub(r'(RESPONSE)([a-zA-Z0-9])', r'\1. \2', text)
        text = re.sub(r'([a-zA-Z])(\d)', r'\1. \2', text)
        text = re.sub(r'(\d)\s(\d)', r'\1\2', text)
        text = text.replace('\r', ' ')
        text = text.replace('  ', ' ')
        text = text.replace("peole", "people")
        output = ''
        for char in text:
            if char in string.printable:
                output += char
        output = remove_irrelevant_tokens(output)
        return output

def remove_irrelevant_tokens(text):
    '''Remove phrases in brackets and irrelevant tokens (named entities and stop words)
    Return lemmatized text'''
    text = remove_brackets(text)
    output = []
    doc = nlp(text)
    for token in doc:
        if test_token(token):
            output.append(token)
    return " ".join([t.lemma_ for t in output])
    
def test_token(token):
    '''Test tokens for exclusion'''
    if token.like_num:
        return False
    elif token.ent_type_ in ('LOC', 'GPE', 'PERSON', 'ORG', 'DATE', 'FAC', 'NORP'):
        return False
    elif token.is_stop:
        return False
    else:
        return True

In [6]:
df = df[['Excerpt', 'Reporting unit']].copy()

In [7]:
# Clean excerpts using functions above
df['cleaned_text'] = df['Excerpt'].apply(lambda x: cleanup(x))

#### Use grid-search to optimize for MultiNomial Naive Bayes alpha param

In [8]:
clf = MultinomialNB()

In [9]:
# Words as features
vectorizer = CountVectorizer(analyzer = "word", binary = True, 
                             ngram_range=(1,1))

In [10]:
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_text'], df['Reporting unit'], test_size=0.2, random_state=42)

In [11]:
X_train = vectorizer.fit_transform(X_train)

In [12]:
parameters = dict(alpha=[0, 0.1, 1, 10])
cv = GridSearchCV(clf, param_grid=parameters)
cv.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': [0, 0.1, 1, 10]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score=True, scoring=None, verbose=0)

In [13]:
X_test = vectorizer.transform(X_test)
y_predictions = cv.predict(X_test)

In [14]:
print(sklearn.metrics.classification_report( y_test, y_predictions ))

             precision    recall  f1-score   support

 Households       1.00      0.50      0.67         6
     People       0.86      1.00      0.92        18

avg / total       0.89      0.88      0.86        24



In [15]:
cv.best_params_

{'alpha': 1}

In [16]:
### Train the classifier on all the training data
clf = MultinomialNB(alpha=1)
vectorizer = CountVectorizer(analyzer = "word", binary = True, 
                             ngram_range=(1,1))
X_train = vectorizer.fit_transform(df['cleaned_text'])
y_train = df['Reporting unit']

clf.fit(X_train, y_train)

MultinomialNB(alpha=1, class_prior=None, fit_prior=True)

#### Run tests on the actual test data;

Note, this data is hand tagged which in many cases is subjective
There may be errors, or the tags may not agree with IDMC defined-tags

In [17]:
### Load the data and perform the same cleanup procedure
test_df = pd.read_excel("../data/IDETECT_test_dataset - NLP.csv.xlsx")
test_df['cleaned_text'] = test_df['excerpt'].apply(lambda x: cleanup(x))

#### Test 1: Results of only using the Multinomial NB classifier

In [18]:
X_test = vectorizer.transform(test_df['cleaned_text'])
y_test = test_df['Unit']

In [19]:
predicted_y = clf.predict(X_test)
print(classification_report( y_test, predicted_y ))

             precision    recall  f1-score   support

 Households       0.75      0.24      0.37        37
     People       0.91      0.99      0.95       275

avg / total       0.89      0.90      0.88       312



#### Test 2: Results of only using the hand-crafted rules

In [26]:
def minimum_loc(spans):
    '''Find the first character location in text for each report
    '''
    locs = []
    for s in spans:
        if s['type'] != 'loc':
            locs.append(s['start'])
    return min(locs)

def choose_report(reports):
    '''Choose report based on the heuristics mentioned in the first cell
    '''
    people_reports = []
    household_reports_1 = []
    household_reports_2 = []
    
    for r in reports:
        if r.subject_term == "People":
            people_reports.append(r)
        elif r.subject_term == "Households":
            if r.event_term in ("Partially Destroyed Housing", "Uninhabitable Housing"):
                household_reports_2.append(r)
            else:
                household_reports_1.append(r)
    if len(people_reports) > 0:
        report = first_report(people_reports)
    elif len(household_reports_1) > 0:
        report = first_report(household_reports_1)
    elif len(household_reports_2) > 0:
        report = first_report(household_reports_2)
    else:
        report = reports[0]
    
    return report
    

def first_report(reports):
    '''Choose the first report based on location in text'''
    report_locs = []
    for report in reports:
        report_locs.append((report, minimum_loc(report.tag_spans)))
    return sorted(report_locs, key=lambda x: x[1])[0][0]

def get_report(excerpt):
    '''Get reports based on Excerpt and choose the most relevant one'''
    reports = interpreter.process_article_new(excerpt)
    if len(reports) > 0:
        report = choose_report(reports)
        return report.quantity, report.event_term, report.subject_term, report.locations
    else:
        return 0, '', '', ''

In [22]:
# Try and clean the text and remove brackets
test_df['cleaned_excerpt'] = test_df['excerpt'].apply(lambda x: cleanup(remove_brackets(x)))

In [27]:
# Extract the report
test_df['extracted_report'] = test_df['cleaned_excerpt'].apply(lambda x: get_report(x))

In [28]:
# Get the predicted unit from the extracted report
test_df['predicted_unit'] = test_df['extracted_report'].apply(lambda x: x[2])

In [29]:
print(classification_report( y_test, test_df['predicted_unit'] ))

             precision    recall  f1-score   support

                  0.00      0.00      0.00         0
 Households       1.00      0.32      0.49        37
     People       0.97      0.63      0.76       275

avg / total       0.97      0.60      0.73       312



  'recall', 'true', average, warn_for)


#### Test 3: Combine classifier and hand-crafted results into single prediction
1. If they both agree -> done
2. If no rules-based prediction -> use classifier result
3. If both predictions exist but they differ -> use hand-crafted rules

In [30]:
def combine_predictions(classifier, rules):
    if classifier == rules:
        return classifier
    elif not rules or rules == '':
        return classifier
    else:
        return rules

In [31]:
combined_predictions = []
for p1, p2 in zip(predicted_y, test_df['predicted_unit']):
    combined_predictions.append(combine_predictions(p1, p2))

In [32]:
print(classification_report( y_test, combined_predictions))

             precision    recall  f1-score   support

 Households       0.90      0.51      0.66        37
     People       0.94      0.99      0.96       275

avg / total       0.93      0.94      0.93       312



## Country Extraction

Country extraction approach is currently based on Spacy entity extraction and hand-crafted rules for translated extracted locations into 3-digit ISO country codes

If more than one country is identified in the fragment, the rules for choosing the country are:

1. Country that occurs the highest number of times or,
2. Country that appears first in the text

#### Test on the provided pre-labelled training data

Note: in some cases a country code is still given, even though there are no locations mentioned in the excerpt
We ignore these cases for testing purposes (the information likely comes from a different source that will not
be included in the specific NLP test)

In [33]:
# Load the training data, remove non-English excerpts
df = pd.read_excel("../data/IDMC_fully_labelled.csv.xlsx", encoding='latin1')
df['lang'] = df['Excerpt'].apply(lambda x: interpreter.check_language(x))
df = df[df['lang'] == 'en'].copy()

In [35]:
### Remove those lines where no location is mentioned
df = df[df['Mentions'] == 'Y'].copy()

In [34]:
def get_country_code(text):
    '''Extract country code from ; separated values'''
    return text.split(';')[1]

In [36]:
df['true_code'] = df['Location (Country)'].apply(lambda x: get_country_code(x))

In [37]:
def choose_country(countries, first_country=''):
    '''Choose country out of possible countries
    Either return the most commonly mentioned country
    or the first country found
    '''
    
    if len(countries) == 0:
        return ''
    if len(countries) <= 1:
        return countries.most_common()[0][0]
    else:
        country_counts = list(countries.values())
        max_count = max(country_counts)
        if country_counts.count(max_count) > 1:
            return first_country
        else:
            return countries.most_common()[0][0]

def get_location(excerpt):
    '''Extract the country from a text fragment by
    identifying all possible countries and selecting
    from among them
    '''
    excerpt = interpreter.cleanup(excerpt)
    possible_countries, first_country = interpreter.extract_countries(excerpt)
    chosen_country = choose_country(possible_countries, first_country)
    return chosen_country

In [38]:
df['extracted_code'] = df['Excerpt'].apply(lambda x: get_location(x))

In [39]:
precision, recall, f1, support = sklearn.metrics.precision_recall_fscore_support( df['true_code'], df['extracted_code'], average='weighted')

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [40]:
print("Precision: {}, Recall: {}, F1: {}, Support: {}".format(precision, recall, f1, support))

Precision: 0.8293752283522104, Recall: 0.782608695652174, F1: 0.8009130965652705, Support: None


#### Test on the self-labelled testing data

In [41]:
test_df['extracted_code'] = test_df['excerpt'].apply(lambda x: get_location(x))

In [42]:
precision, recall, f1, support = sklearn.metrics.precision_recall_fscore_support(test_df['Country'].fillna(''), test_df['extracted_code'], average='weighted')

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [43]:
print("Precision: {}, Recall: {}, F1: {}, Support: {}".format(precision, recall, f1, support))

Precision: 0.8543503949753949, Recall: 0.7596153846153846, F1: 0.7767415657104066, Support: None


### Reporting Term

Reporting term can be one of 10 categories (i.e. Displaced, Evacuated, In Relief Camp etc)

A similar approach is taken as per reporting unit (i.e. combining outcome of hand-crafted rules & classifier).

In this case, the classifier is even harder to train as many of the categories have very few or 0 examples

In [44]:
# Load the training data, remove non-English excerpts
df = pd.read_excel("../data/IDMC_fully_labelled.csv.xlsx", encoding='latin1')
df['lang'] = df['Excerpt'].apply(lambda x: interpreter.check_language(x))
df = df[df['lang'] == 'en'].copy()

In [45]:
df = df[['Excerpt', 'Reporting term']].copy()

In [46]:
df['cleaned_text'] = df['Excerpt'].apply(lambda x: cleanup(x))

In [47]:
clf = MultinomialNB()

In [48]:
vectorizer = CountVectorizer(analyzer = "word", binary = True, 
                             ngram_range=(2,2))

In [49]:
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_text'], df['Reporting term'], test_size=0.2, random_state=42)

In [50]:
X_train = vectorizer.fit_transform(X_train)

In [51]:
parameters = dict(alpha=[0, 0.1, 1, 10])
cv = GridSearchCV(clf, param_grid=parameters)
cv.fit(X_train, y_train)



GridSearchCV(cv=None, error_score='raise',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': [0, 0.1, 1, 10]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score=True, scoring=None, verbose=0)

In [52]:
X_test = vectorizer.transform(X_test)
y_predictions = cv.predict(X_test)

In [53]:
print(sklearn.metrics.classification_report( y_test, y_predictions ))

                   precision    recall  f1-score   support

Destroyed Housing       0.00      0.00      0.00         2
        Displaced       1.00      0.80      0.89         5
        Evacuated       0.71      0.92      0.80        13
   Forced to Flee       0.00      0.00      0.00         1
         Homeless       0.00      0.00      0.00         1
   In Relief Camp       0.00      0.00      0.00         1
        Relocated       1.00      1.00      1.00         1
        Sheltered       0.00      0.00      0.00         1

      avg / total       0.61      0.68      0.63        25



  'precision', 'predicted', average, warn_for)


In [54]:
cv.best_params_

{'alpha': 1}

In [55]:
### Train the classifier on all the training data
clf = MultinomialNB(alpha=1)
vectorizer = CountVectorizer(analyzer = "word", binary = True, 
                             ngram_range=(2,2))
X_train = vectorizer.fit_transform(df['cleaned_text'])
y_train = df['Reporting term']

clf.fit(X_train, y_train)

MultinomialNB(alpha=1, class_prior=None, fit_prior=True)

### Run tests on the actual test data;

Note, this data is hand tagged which in many cases is subjective
There may be errors, or the tags may not agree with IDLT defined-tags

#### Test 1: Results of only using the Multinomial NB classifier

In [56]:
X_test = vectorizer.transform(test_df['cleaned_text'])
y_test = test_df['Term']

In [57]:
predicted_y = clf.predict(X_test)

In [58]:
print(classification_report( y_test, predicted_y ))

                             precision    recall  f1-score   support

          Destroyed Housing       0.44      0.40      0.42        10
                  Displaced       0.93      0.68      0.79       182
                  Evacuated       0.26      0.96      0.41        46
             Forced to Flee       0.00      0.00      0.00        39
                   Homeless       1.00      0.50      0.67         2
             In Relief Camp       0.00      0.00      0.00         8
Partially Destroyed Housing       0.00      0.00      0.00         2
                  Relocated       0.00      0.00      0.00         5
                  Sheltered       0.00      0.00      0.00        18

                avg / total       0.60      0.55      0.54       312



  'precision', 'predicted', average, warn_for)


#### Test 2: Results of only using the hand crafted rules

In [60]:
test_df['predicted_term'] = test_df['extracted_report'].apply(lambda x: x[1])

In [61]:
print(classification_report( y_test, test_df['predicted_term'] ))

                             precision    recall  f1-score   support

                                  0.00      0.00      0.00         0
          Destroyed Housing       1.00      0.20      0.33        10
                  Displaced       0.95      0.66      0.78       182
                  Evacuated       1.00      0.70      0.82        46
             Forced to Flee       0.81      0.64      0.71        39
                   Homeless       0.00      0.00      0.00         2
             In Relief Camp       0.00      0.00      0.00         8
Partially Destroyed Housing       0.00      0.00      0.00         2
                  Relocated       1.00      0.20      0.33         5
                  Sheltered       0.00      0.00      0.00        18

                avg / total       0.85      0.58      0.68       312



  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


#### Test 3: Results of combining classifier predictions and hand crafted rules

In [62]:
combined_predictions = []
for p1, p2 in zip(predicted_y, test_df['predicted_term']):
    combined_predictions.append(combine_predictions(p1, p2))

In [63]:
print(classification_report( y_test, combined_predictions))

                             precision    recall  f1-score   support

          Destroyed Housing       0.57      0.40      0.47        10
                  Displaced       0.94      0.74      0.82       182
                  Evacuated       0.34      0.96      0.50        46
             Forced to Flee       0.81      0.64      0.71        39
                   Homeless       1.00      0.50      0.67         2
             In Relief Camp       0.00      0.00      0.00         8
Partially Destroyed Housing       0.00      0.00      0.00         2
                  Relocated       1.00      0.20      0.33         5
                  Sheltered       0.00      0.00      0.00        18

                avg / total       0.74      0.67      0.67       312



  'precision', 'predicted', average, warn_for)


### Extracted Quantity

Extracted quantity is obtained in two ways:
    
1. Using the quantity extracted in the most likely report
2. Using a second rule based approach that also takes into account the predicted Unit (People / Households) and attempts to find the nearest number to that unit in the text
3. If neither of the above approaches yields a result, then returns the largest number found in the text

In [64]:
# Load the training data, remove non-English excerpts
df = pd.read_excel("../data/IDMC_fully_labelled.csv.xlsx", encoding='latin1')
df['lang'] = df['Excerpt'].apply(lambda x: interpreter.check_language(x))
df = df[df['lang'] == 'en'].copy()

In [65]:
df = df[['Excerpt', 'Displacement figure', 'Reporting unit']].copy()

#### Test 1: Existing hand-crafted rules

In [66]:
df['cleaned_excerpt'] = df['Excerpt'].apply(lambda x: remove_brackets(x))

In [67]:
df['extracted_report'] = df['cleaned_excerpt'].apply(lambda x: get_report(x))

In [68]:
df['extracted_quantity'] = df['extracted_report'].apply(lambda x: x[0])

In [69]:
df['extracted_quantity'] = df['extracted_quantity'].fillna(0)
df['extracted_quantity'] = df['extracted_quantity'].astype(int)

In [70]:
precision, recall, f1, support = sklearn.metrics.precision_recall_fscore_support(df['Displacement figure'], df['extracted_quantity'], average='weighted')

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [71]:
print("Precision: {}, Recall: {}, F1: {}, Support: {}".format(precision, recall, f1, support))

Precision: 0.5413223140495868, Recall: 0.5206611570247934, F1: 0.5234159779614325, Support: None


#### Test 2: Additional set of rules taking into account the Unit (People vs Households)

In [73]:
person_units = ["person", "people", "individuals", "locals", "villagers", "residents",
                "occupants", "citizens", "IDP"]

household_units = ["home", "house", "hut", "dwelling", "building", "families", "households"]

person_lemmas =[t.lemma_ for t in nlp(" ".join(person_units))]
household_lemmas =[t.lemma_ for t in nlp(" ".join(household_units))]

In [74]:
def get_closest_number(possible_numbers, possible_units):
    '''Get closest number in text to word that most closely matches the given unit'''
    if len(possible_units) > 0:
        first_unit_place = possible_units[0][1]
        diffs = [(p, first_unit_place - n) for p, n in possible_numbers if first_unit_place - n > 0]
        if len(diffs) > 0:
            return sorted(diffs, key=lambda x: x[1])[0][0]
    return 0

def get_number(text, unit):
    '''Get number from text based on the given unit type'''
    # Remove brackets
    text = re.sub(r'(\(.+\))', '', text)
    text = re.sub(r'\s{2}', ' ', text)
    text = re.sub(r'(\d)\s(\d)', r'\1\2', text)
    lemmas = []
    if unit == "People":
        lemmas = person_lemmas
    else:
        lemmas = household_lemmas
    doc = nlp(text)
    possible_numbers = []
    possible_units = []
    for token in doc:
        if token.like_num:
            quantity = convert_quantity(token.text)
            if quantity:
                possible_numbers.append((quantity, token.i))
        if token.lemma_ in lemmas:
            possible_units.append((token, token.i))
    
    closest_num = get_closest_number(possible_numbers, possible_units)
    if closest_num and closest_num > 0:
        return closest_num
    elif len(possible_numbers) > 0:
        highest_possible = sorted(possible_numbers, key=lambda x: x[0])[0][0]
        return highest_possible
    else:
        return 0

In [75]:
df['new_extracted_figure'] = df[['Excerpt', 'Reporting unit']].apply(lambda x: get_number(x['Excerpt'], x['Reporting unit']), axis=1)

In [76]:
df['new_extracted_figure'] = df['new_extracted_figure'].fillna(0)
df['new_extracted_figure'] = df['new_extracted_figure'].astype(int)

In [77]:
precision, recall, f1, support = sklearn.metrics.precision_recall_fscore_support(df['Displacement figure'], df['new_extracted_figure'], average='weighted')

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [78]:
print("Precision: {}, Recall: {}, F1: {}, Support: {}".format(precision, recall, f1, support))

Precision: 0.7252066115702479, Recall: 0.7107438016528925, F1: 0.7099567099567099, Support: None


#### Test 3: Combine the outputs of approaches 1 & 2

In [79]:
def combine_quantities(r1, r2):
    '''Combine output of reports-based rules & new rules'''
    if r1 and r1 > 0:
        return r1
    elif r2:
        return r2
    else:
        return 0

In [80]:
df['combined_quantity'] = df[['extracted_quantity', 'new_extracted_figure']].apply(lambda x: combine_quantities(x['extracted_quantity'], x['new_extracted_figure']), axis=1)

In [81]:
precision, recall, f1, support = sklearn.metrics.precision_recall_fscore_support(df['Displacement figure'], df['combined_quantity'], average='weighted')

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [82]:
print("Precision: {}, Recall: {}, F1: {}, Support: {}".format(precision, recall, f1, support))

Precision: 0.7603305785123967, Recall: 0.7603305785123967, F1: 0.7548209366391185, Support: None


#### Finally, test this combined approach on the NLP test dataset

In [84]:
test_df['rules_1_quantity'] = test_df['extracted_report'].apply(lambda x: x[0])
test_df['rules_2_quantity'] = test_df[['excerpt', 'predicted_unit']].apply(lambda x: get_number(x['excerpt'], x['predicted_unit']), axis=1)

In [85]:
test_df['combined_quantity'] = test_df[['rules_1_quantity', 'rules_2_quantity']].apply(lambda x: combine_quantities(x['rules_1_quantity'], x['rules_2_quantity']), axis=1)

In [87]:
test_df['combined_quantity'] = test_df['combined_quantity'].astype(int)

In [90]:
precision, recall, f1, support = sklearn.metrics.precision_recall_fscore_support(test_df['Quantity'], test_df['combined_quantity'], average='weighted')

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [91]:
print("Precision: {}, Recall: {}, F1: {}, Support: {}".format(precision, recall, f1, support))

Precision: 0.8279914529914529, Recall: 0.7307692307692307, F1: 0.7614097367773838, Support: None
