Analysis of our NRE v2 extractor's extractions on the finance data set (39,833 extractions from first 4000 documents).

In [1]:
import numpy as np
import pandas as pd
import unicodedata

In [2]:
# Read in data
columns = ["sentence", "arg1", "verb", "relation", "arg2", "value_1", "unit_1", "value_2", "unit_2", \
           "q_measure_1", "quantulum_v1", "quantulum_u1", "q_measure_2", "quantulum_v2", "quantulum_u2", "arg1_f", "rel_f"]

df = pd.read_csv('finance_NewExtractions.csv', names = columns)
len(df)

39833

In [4]:
# Only keep strings containing a capital letter (Assumption: interesting entities are proper nouns)
def capital_filter(text):
    if any(letter.isupper() for letter in str(text)):
        return text

In [7]:
# Apply filters
df['arg1_f'] = df.arg1_f.apply(lambda x: capital_filter(x))
df['value_1'] = pd.to_numeric(df['value_1'], errors='coerce')
df['value_2'] = pd.to_numeric(df['value_2'], errors='coerce')
df2 = df[df.value_1.notna() & df.arg1_f.notna()].reset_index(drop=True)
len(df2)

23055

In [8]:
# Standardize common units
df2.loc[df2['unit_1'] == 'percent', 'unit_1'] = '%'
df2.loc[df2['unit_1'] == 'per cent', 'unit_1'] = '%'
df2.loc[df2['unit_1'] == 'dollars', 'unit_1'] = '$'
df2.loc[(df2.rel_f == 'barrel') & (df2.unit_1 == '$'), 'unit_1'] = '$  per barrel'
df2.loc[df2['unit_1'] == '$  per barrel', 'unit_1'] = '$ per barrel'
df2.loc[(df2.rel_f == 'barrel') & (df2.unit_1 == '$ per barrel'), 'rel_f'] = np.NaN

In [9]:
# Group data on arg1, relation, and unit
# Count > 50
gdf1 = df2.groupby(["arg1_f", "rel_f","unit_1"])['value_1'].agg(["min","max","count"]).sort_values("count", ascending=False).query("count > 50")
gdf1[gdf1['min'] != gdf1['max']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,min,max,count
arg1_f,rel_f,unit_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Dow Jones Industrial Average,points,%,0.01,2.67,293
Nasdaq Composite,points,%,0.0,3.5,252
S&P,points,%,0.01,1.44,100
Japan,MSCI broadest index,%,0.05,10.0,92
.DJI,points,%,0.01,3.29,89
Net income,share,$,10000000.0,19040000000.0,64
Dow e - minis,points,%,0.02,1.74,51
Nasdaq 100 e - minis,points,%,0.01,2.4,51
S&P 500 e - minis,points,%,0.01,1.86,51


In [10]:
# Dollars per barrel: interquartile range of value
# Count > 10
gdf2 = df2.query("unit_1 == '$ per barrel'").groupby(["arg1_f","unit_1"])["value_1"].agg(["count"])
gdf2['25%'] = df2.groupby(["arg1_f","unit_1"])['value_1'].quantile(q=0.25)
gdf2['75%'] = df2.groupby(["arg1_f","unit_1"])['value_1'].quantile(q=0.75)
gdf2.query("count > 10").sort_values("count", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,25%,75%
arg1_f,unit_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Brent crude futures,$ per barrel,84,63.8075,71.4425
U.S. crude,$ per barrel,82,53.765,60.17
U.S. West Texas Intermediate ( WTI ) crude futures,$ per barrel,60,54.1925,63.4075
Brent,$ per barrel,45,62.01,71.4
Brent crude,$ per barrel,41,66.9,70.38
U.S. crude futures,$ per barrel,28,55.7075,61.2225
Brent crude LCOc1 futures,$ per barrel,22,58.7875,63.74
U.S. West Texas Intermediate ( WTI ) crude oil futures,$ per barrel,22,55.8625,57.375
Brent futures,$ per barrel,17,62.28,69.72
U.S. West Texas Intermediate ( WTI ) crude CLc1 futures,$ per barrel,17,48.52,52.31


In [18]:
# Nasdaq Composite stock value
# COUNT > 10
gdf3 = df2[df2.arg1.str.contains('Nasdaq Composite')].groupby(["verb","unit_1"])['value_1'].agg(["count"])
gdf3['25%'] = df2.groupby(["verb","unit_1"])['value_1'].quantile(q=0.25)
gdf3['75%'] = df2.groupby(["verb","unit_1"])['value_1'].quantile(q=0.75)
gdf3.query("count > 10")

Unnamed: 0_level_0,Unnamed: 1_level_0,count,25%,75%
verb,unit_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
added,%,96,0.2375,0.9
added,points,86,19.07,83.67
dropped,%,96,0.46,3.2
dropped,points,88,13.11,67.79
gained,%,26,0.3,3.4
gained,points,26,5.79,27.77
recorded,new highs,67,22.0,58.25
recorded,new lows,67,9.0,37.25
was down,%,50,0.14,0.8
was down,points,50,8.44,62.385


In [19]:
# Dow Jones Industrial Average stock value
# COUNT > 10
gdf3 = df2[df2.arg1.str.contains('Dow Jones Industrial Average')].groupby(["verb","unit_1"])['value_1'].agg(["count"])
gdf3['25%'] = df2.groupby(["verb","unit_1"])['value_1'].quantile(q=0.25)
gdf3['75%'] = df2.groupby(["verb","unit_1"])['value_1'].quantile(q=0.75)
gdf3.query("count > 10")

Unnamed: 0_level_0,Unnamed: 1_level_0,count,25%,75%
verb,unit_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
fell,%,92,0.6,4.8
fell,points,89,41.4525,190.1275
rose,%,101,0.5,3.825
rose,points,98,38.2575,174.4775
was,a.m. ET,37,9.0,11.0
was,p.m. ET,25,12.0,12.0
was down,%,57,0.14,0.8
was down,points,51,8.44,62.385
was up,%,60,0.245,0.955
was up,points,57,13.045,78.885


In [23]:
#Example sentence containing 'Dow Jones Industrial Average'.
df2.iloc[87].sentence

'The Dow Jones Industrial Average fell 17.78 points, or 0.06%, to 27,341.38, the S&P 500 fell 10.24 points, or 0.34%, to 3,004.06 and the Nasdaq Composite dropped 36.96 points, or 0.45%, to 8,221.23.'

#### Conclusions

Produces some interesting results, but difficult to get correct NRE for Financial news:

Relation noun-phrase is often missing - need to infer implicit relation which is not feasible in rule based system.

Unit may also be missing and require inference.

Entities (e.g. stock names/codes, fuels) are less likely to be correctly extracted.