In [1]:
%load_ext autoreload
%autoreload 2
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))
    
def colorize(string,color="red"):
     return f"<span style=\"color:{color}\">{string}</span>"

# Problem description

### Subtask2: Detecting antecedent and consequence

Indicating causal insight is an inherent characteristic of counterfactual. To further detect the causal knowledge conveyed in counterfactual statements, subtask 2 aims to locate antecedent and consequent in counterfactuals.
 
According to (Nelson Goodman, 1947. The problem of counterfactual conditionals), a counterfactual statement can be converted to a contrapositive with a true antecedent and consequent. Consider example “Her post-traumatic stress could have been avoided if a combination of paroxetine and exposure therapy had been prescribed two months earlier”; it can be transposed into “because her post-traumatic stress was not avoided, (we know) a combination of paroxetine and exposure therapy was not prescribed”. Such knowledge can be not only used for analyzing the specific statement but also be accumulated across corpora to develop domain causal knowledge (e.g., a combination of paroxetine and exposure may help cure post-traumatic stress).
 
Please note that __in some cases there is only an antecedent part while without a consequent part in a counterfactual statement__. For example, "Frankly, I wish he had issued this order two years ago instead of this year", in this sentence we could only get the antecedent part. In our subtask2, when locating the antecedent and consequent part, please set '-1' as consequent starting index (character index) and ending index (character index) to refer that there is no consequent part in this sentence. For details, please refer to the 'Evaluation' on this website.



In [2]:
!ls

analyze_roberta_large.ipynb  data_statistics.ipynb


In [3]:
import pandas as pd
!pwd
df = pd.read_csv('../../.data/semeval2020_5/train_task2.csv')

/home/ifajcik/PycharmProjects/semeval2020_task5/scripts/statistics


We have this amount of data:

In [4]:
len(df)

3551

In [5]:
import random
i = random.randint(0,len(df))
print(df.iloc[i])
print("-"*50)
print(df["sentence"].iloc[i])
print("-"*50)
print(df["antecedent"].iloc[i])
print("-"*50)
print(df["consequent"].iloc[i])

sentenceID                                                       202696
sentence              I wish there were 10 Daily Kos-style sites to ...
antecedent            I wish there were 10 Daily Kos-style sites to ...
consequent                                                           {}
antecedent_startid                                                    0
antecedent_endid                                                     82
consequent_startid                                                   -1
consequent_endid                                                     -1
Name: 2696, dtype: object
--------------------------------------------------
I wish there were 10 Daily Kos-style sites to cover the universe of great Democrats.
--------------------------------------------------
I wish there were 10 Daily Kos-style sites to cover the universe of great Democrats
--------------------------------------------------
{}


In [18]:
import random
i = random.randint(0,len(df))
s = df.loc[df["sentenceID"]==203483]
#print(s)
print("-"*50)
print(s["sentence"].iloc[0])
print("-"*50)
print(s["antecedent"].iloc[0])
print("-"*50)
print(s["consequent"].iloc[0])

--------------------------------------------------
"Only last year Pfizer tried a tax inversion, an unsuccessful merger with AstraZeneca that would have shifted Pfizer's tax home to Britain."
--------------------------------------------------
an unsuccessful merger with AstraZeneca that would have shifted Pfizer's tax home to Britain
--------------------------------------------------
would have shifted Pfizer's tax home to Britain


In [7]:
df["antecedent"].iloc[0]

'if the stimulus bill had become hamstrung by a filibuster threat or recalcitrant conservadems'

In [8]:
df["consequent"].iloc[0]

'I don\'t think any of us---even economic gurus like Paul Krugman---really, truly understand just how bad it could\'ve gotten "on Main Street"'

In [9]:
df["sentence"].iloc[0][df["consequent_startid"].iloc[0]:df["consequent_endid"].iloc[0]]

'I don\'t think any of us---even economic gurus like Paul Krugman---really, truly understand just how bad it could\'ve gotten "on Main Street'

Check whether all indices fit the annotation  
_Note: annotation indices are inclusive!_

In [10]:
for i in range(len(df)):
    assert df["sentence"].iloc[i][df["antecedent_startid"].iloc[i]:df["antecedent_endid"].iloc[i]+1] \
        == df["antecedent"].iloc[i]
    if df["consequent_startid"].iloc[i]>0:
        assert df["sentence"].iloc[i][df["consequent_startid"].iloc[i]:df["consequent_endid"].iloc[i]+1] \
        == df["consequent"].iloc[i]

__Consequent part might not always exist!__

In [11]:
df.loc[df['consequent_startid'] == -1]

Unnamed: 0,sentenceID,sentence,antecedent,consequent,antecedent_startid,antecedent_endid,consequent_startid,consequent_endid
9,200009,Thanks for the article on this new term that f...,wish all your articles were worthy of praise,{},62,105,-1,-1
16,200016,Raise whatever ya got handy and wish a Happy B...,Raise whatever ya got handy and wish a Happy B...,{},0,111,-1,-1
17,200017,"Investors should indeed bear some of the pain,...",she would have done better to venture into thi...,{},51,117,-1,-1
21,200021,"Later, Mr Chaffetz told Fox News his comment c...","his comment could have been phrased more ""smoo...",{},33,83,-1,-1
24,200024,I wish he would have said more at the State of...,I wish he would have said more at the State of...,{},0,55,-1,-1
...,...,...,...,...,...,...,...,...
3529,203529,That system has given rise to the so-called Ch...,"he had previously admitted to drug possession,...",{},254,354,-1,-1
3539,203539,"I wish that anything I said mattered, to anyone.","I wish that anything I said mattered, to anyone",{},0,46,-1,-1
3542,203542,"bu t the problem with that, is other people's ...",i just wish we could have all you idiot's abor...,{},67,133,-1,-1
3545,203545,"""In the past, I should have tried to talk him ...",I should have tried to talk him out of it,{},14,54,-1,-1


It does not exist in this number of cases

In [12]:
df_without_conseq = df.loc[df['consequent_startid'] == -1]
print(f"{len(df_without_conseq)} / {len(df)}")

520 / 3551


Lets check what are the lengths of sentences, and how much sentences without consequent correlate with length.

In [13]:
all_lens = [len(s.split())  for s in df["sentence"].values.tolist()]
no_conseq_lens = [len(s.split())  for s in df_without_conseq["sentence"].values.tolist()]

In [14]:
all_lens

[45,
 18,
 82,
 30,
 16,
 50,
 30,
 14,
 39,
 21,
 33,
 25,
 24,
 24,
 43,
 59,
 20,
 31,
 38,
 17,
 37,
 14,
 22,
 41,
 13,
 22,
 19,
 27,
 34,
 26,
 56,
 22,
 24,
 19,
 36,
 25,
 21,
 21,
 27,
 41,
 35,
 36,
 39,
 27,
 36,
 15,
 25,
 54,
 32,
 20,
 14,
 33,
 26,
 54,
 48,
 11,
 27,
 21,
 29,
 24,
 36,
 13,
 27,
 36,
 40,
 18,
 37,
 16,
 20,
 23,
 37,
 22,
 24,
 35,
 16,
 36,
 35,
 58,
 18,
 29,
 9,
 57,
 49,
 16,
 47,
 43,
 32,
 43,
 30,
 29,
 22,
 35,
 21,
 18,
 51,
 30,
 36,
 28,
 33,
 31,
 40,
 33,
 18,
 54,
 21,
 19,
 6,
 27,
 29,
 17,
 17,
 8,
 20,
 14,
 29,
 38,
 17,
 15,
 21,
 12,
 50,
 35,
 13,
 29,
 18,
 43,
 35,
 31,
 28,
 42,
 22,
 30,
 42,
 23,
 24,
 32,
 31,
 37,
 17,
 22,
 27,
 18,
 7,
 18,
 74,
 20,
 24,
 34,
 15,
 48,
 13,
 46,
 24,
 30,
 31,
 25,
 23,
 51,
 26,
 16,
 54,
 30,
 38,
 38,
 68,
 19,
 22,
 16,
 25,
 58,
 35,
 14,
 51,
 21,
 31,
 26,
 38,
 34,
 35,
 27,
 18,
 38,
 19,
 21,
 18,
 13,
 18,
 32,
 42,
 30,
 28,
 25,
 37,
 57,
 37,
 23,
 16,
 23,
 35,
 18,
 26,

In [15]:
import matplotlib.pyplot as plt

values1 = all_lens
values2= no_conseq_lens
bins=100
_range=(0,max(all_lens))

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111) 
ax.hist(values1, alpha=0.5, bins=bins, range=_range, color= 'b', label='All sentences')
ax.hist(values2, alpha=0.5, bins=bins, range=_range, color= 'r', label='Sentences without consequent')
ax.legend(loc='upper right', prop={'size':14})
plt.show()

<Figure size 800x600 with 1 Axes>

Distribution is skewed a little bit toward smaller values, but there does not seem to be any big correlation here...