Analysis of our NRE v2 extractor's extractions on the astrophysics data set (36,596 extractions).

This notebook aims to show how NREs can be used to extract facts for multiple entities + relation combinations simultaneously.

In [1]:
import numpy as np
import pandas as pd
import unicodedata

In [2]:
# Read data
columns = ["sentence", "arg1", "verb", "relation", "arg2", "value_1", "unit_1", "value_2", "unit_2", \
           "q_measure_1", "quantulum_v1", "quantulum_u1", "q_measure_2", "quantulum_v2", "quantulum_u2", "arg1_f", "rel_f"]
# _f in the name indicates filtered to remove stopwords

df = pd.read_csv('astro_NewExtractions.csv', names = columns)
len(df)

36596

In [3]:
# Apply filters
df['value_1'] = pd.to_numeric(df['value_1'], errors='coerce')
df['value_2'] = pd.to_numeric(df['value_2'], errors='coerce')
df2 = df[df.value_1.notna() & df.arg1_f.notna()].reset_index(drop=True)
len(df2)

25872

In [4]:
# Standardize common units
df2.loc[df2['unit_1'] == 'percent', 'unit_1'] = '%'
df2.loc[df2['quantulum_u1'] == 'kilometre', 'unit_1'] = 'kilometers'

In [18]:
gdf1.count()

min      432
max      432
count    432
dtype: int64

In [12]:
# Group data on arg1, relation, and unit
# Count > 10
gdf1 = df2.groupby(["arg1_f","rel_f","unit_1"])['quantulum_v1'].agg(["min","max","count"]).sort_values("count", ascending=False).query("count > 1")
gdf1[gdf1['min'] != gdf1['max']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,min,max,count
arg1_f,rel_f,unit_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
orbit,inclination,°,0.00000,79.500,623
collaborative asteroid lightcurve link,diameter,kilometers,0.90000,216.100,469
lightcurve analysis,brightness variation,magnitude,0.05000,25.155,119
lightcurve analysis,rotation period,hours,2.42870,488.063,109
lightcurve analysis,brightness amplitude,magnitude,0.05000,1.000,81
lightcurve analysis,- defined rotation period,hours,2.59990,85.000,64
s - type asteroid,rotation period,hours,2.60000,16.190,29
asteroid,diameter,kilometers,1.10000,317.000,23
analysis,rotation period,hours,2.92270,93.000,21
analysis,brightness amplitude,magnitude,0.02000,1.300,20


In [6]:
# ASTEROID ROTATION PERIODS
# COUNT > 10
gdf2 = df2[df2.arg1.str.contains('asteroid') & df2.relation.str.contains('rotation period')]
gdf2.groupby(["arg1_f","unit_1"])['quantulum_v1'].agg(["min","max","count"]).query("count > 10").sort_values("count", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count
arg1_f,unit_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
s - type asteroid,hours,2.5,130.0,39
carbonaceous c - type asteroid,hours,5.0,236.6,21
x - type asteroid,hours,3.4,23.5,16
stony s - type asteroid,hours,3.0,97.4,14


In [7]:
# COUNT > 2, Showing filtered relation phrase
gdf2.groupby(["arg1_f", "rel_f","unit_1"])['quantulum_v1'].agg(["min","max","count"]).query("count > 2").sort_values("count", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,min,max,count
arg1_f,rel_f,unit_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
s - type asteroid,rotation period,hours,2.6,16.19,29
carbonaceous c - type asteroid,rotation period,hours,5.7,24.98,14
x - type asteroid,rotation period,hours,5.5,23.5,11
stony s - type asteroid,rotation period,hours,3.4,97.4,10
dark d - type asteroid,rotation period,hours,6.6,20.0,9
asteroid,rotation period,hours,5.1,12.05,4
c - type asteroid,rotation period,hours,8.0,19.8,4
d - type asteroid,rotation period,hours,8.96,22.7,4
dark c - type asteroid,rotation period,hours,11.68,16.0,4
primitive p - type asteroid,rotation period,hours,11.0,17.1,4


In [8]:
# ASTEROID DIAMETERS
# COUNT > 2
gdf3 = df2[df2.arg1.str.contains('asteroid') & df2.relation.str.contains('diameter')]
gdf3.groupby(["arg1_f", "rel_f","unit_1"])['quantulum_v1'].agg(["min","max","count"]).query("count > 2").sort_values("count", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,min,max,count
arg1_f,rel_f,unit_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
collaborative asteroid lightcurve link,diameter,kilometers,0.9,216.1,469
asteroid,diameter,kilometers,1.1,317.0,23
collaborative asteroid lightcurve link,larger diameter,kilometers,5.7,44.22,13
collaborative asteroid lightcurve link,smaller diameter,kilometers,2.19,56.0,5


In [9]:
# D-TYPE ASTEROID
# COUNT > 10
gdf4 = df2[df2.arg1.str.contains('d[- ]+type +asteroid')]
gdf4.groupby(["rel_f","unit_1"])['quantulum_v1'].agg(["min","max","count"]).query("count > 10").sort_values("count", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count
rel_f,unit_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
rotation period,hours,6.6,22.7,16


In [10]:
# S-TYPE ASTEROID
# COUNT > 2
gdf5 = df2[df2.arg1.str.contains('s[- ]+type +asteroid')]
gdf5.groupby(["rel_f","unit_1"])['quantulum_v1'].agg(["min","max","count"]).query("count > 2").sort_values("count", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count
rel_f,unit_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
rotation period,hours,2.6,97.4,49
long rotation period,hours,24.0,130.0,6
short rotation period,hours,2.5,4.9,5


In [11]:
# STAR
# COUNT > 2
gdf6 = df2[df2.arg1.str.contains('star')]
gdf6.groupby(["rel_f","unit_1"])['quantulum_v1'].agg(["min","max","count"]).query("count > 2").sort_values("count", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count
rel_f,unit_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sun,times mass,0.45,6.7,22
sun,times luminosity,0.4,852.0,9
photosphere,times suns luminosity,1.5,1346.0,7
mass,%,0.1,60.0,6
metallicity,%,94.0,144.0,5
suns radius,%,16.0,96.0,4
position angle,°,135.0,155.538,3


#### Conclusions

Produces interesting results, especially in cell 7, with rotation periods of various asteroid types.

However is still limited by extractor performance.