# 1 Exploratory Data Analysis

In [1]:
import json
import numpy as np

from collections import defaultdict as dd

### Read in data

In [2]:
with open ('../data/raw/train-claims.json') as f:
    train_claims = json.load(f)

with open ('../data/raw/dev-claims.json') as f:
    dev_claims = json.load(f)

with open ('../data/raw/dev-claims-baseline.json') as f:
    dev_claims_baseline = json.load(f)

with open ('../data/raw/evidence.json') as f:
    evidence = json.load(f)

with open ('../data/raw/test-claims-unlabelled.json') as f:
    test_claims_unlabelled = json.load(f)

### Have a look at structure of train_claims and how many there are

In [7]:
train_claims

{'claim-1937': {'claim_text': 'Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.',
  'claim_label': 'DISPUTED',
  'evidences': ['evidence-442946', 'evidence-1194317', 'evidence-12171']},
 'claim-126': {'claim_text': 'El Niño drove record highs in global temperatures suggesting rise may not be down to man-made emissions.',
  'claim_label': 'REFUTES',
  'evidences': ['evidence-338219', 'evidence-1127398']},
 'claim-2510': {'claim_text': 'In 1946, PDO switched to a cool phase.',
  'claim_label': 'SUPPORTS',
  'evidences': ['evidence-530063', 'evidence-984887']},
 'claim-2021': {'claim_text': 'Weather Channel co-founder John Coleman provided evidence that convincingly refutes the concept of anthropogenic global warming.',
  'claim_label': 'DISPUTED',
  'evidences': ['evidence-1177431',
   'evidence-782448',
   'evidence-540069',
   'evidence-352655',
   'evidence-1007867']},
 'claim-2449'

In [8]:
len(train_claims)

1228

### Have a look at structure of dev_claims and how many there are

In [9]:
dev_claims

{'claim-752': {'claim_text': '[South Australia] has the most expensive electricity in the world.',
  'claim_label': 'SUPPORTS',
  'evidences': ['evidence-67732', 'evidence-572512']},
 'claim-375': {'claim_text': 'when 3 per cent of total annual global emissions of carbon dioxide are from humans and Australia prod\xaduces 1.3 per cent of this 3 per cent, then no amount of emissions reductio\xadn here will have any effect on global climate.',
  'claim_label': 'NOT_ENOUGH_INFO',
  'evidences': ['evidence-996421',
   'evidence-1080858',
   'evidence-208053',
   'evidence-699212',
   'evidence-832334']},
 'claim-1266': {'claim_text': 'This means that the world is now 1C warmer than it was in pre-industrial times',
  'claim_label': 'SUPPORTS',
  'evidences': ['evidence-889933', 'evidence-694262']},
 'claim-871': {'claim_text': '“As it happens, Zika may also be a good model of the second worrying effect — disease mutation.',
  'claim_label': 'NOT_ENOUGH_INFO',
  'evidences': ['evidence-422399

In [10]:
len(dev_claims)

154

### Have a look at structure of test_claims and how many there are

In [11]:
dev_claims_baseline

{'claim-752': {'claim_text': '[South Australia] has the most expensive electricity in the world.',
  'claim_label': 'NOT_ENOUGH_INFO',
  'evidences': ['evidence-67732',
   'evidence-572512',
   'evidence-909871',
   'evidence-596058',
   'evidence-66394',
   'evidence-212071']},
 'claim-375': {'claim_text': 'when 3 per cent of total annual global emissions of carbon dioxide are from humans and Australia prod\xaduces 1.3 per cent of this 3 per cent, then no amount of emissions reductio\xadn here will have any effect on global climate.',
  'claim_label': 'NOT_ENOUGH_INFO',
  'evidences': ['evidence-832334',
   'evidence-699212',
   'evidence-1080858',
   'evidence-242107',
   'evidence-208053',
   'evidence-996421']},
 'claim-1266': {'claim_text': 'This means that the world is now 1C warmer than it was in pre-industrial times',
  'claim_label': 'SUPPORTS',
  'evidences': ['evidence-315434',
   'evidence-198055',
   'evidence-694262',
   'evidence-418670',
   'evidence-286139',
   'eviden

In [12]:
len(dev_claims_baseline)

154

### Have a look at structure of evidence and how many there are

In [13]:
evidence

{'evidence-0': 'John Bennet Lawes, English entrepreneur and agricultural scientist',
 'evidence-1': 'Lindberg began his professional career at the age of 16, eventually moving to New York City in 1977.',
 'evidence-2': "``Boston (Ladies of Cambridge)'' by Vampire Weekend",
 'evidence-3': 'Gerald Francis Goyer (born October 20, 1936) was a professional ice hockey player who played 40 games in the National Hockey League.',
 'evidence-4': 'He detected abnormalities of oxytocinergic function in schizoaffective mania, post-partum psychosis and how ECT modified oxytocin release.',
 'evidence-5': 'With peak winds of 110 mph (175 km/h) and a minimum pressure of 972 mbar (hPa ; 28.71 inHg), Florence was the strongest storm of the 1994 Atlantic hurricane season.',
 'evidence-6': 'He is currently a professor of piano at the University of Wisconsin -- Madison since August 2000.',
 'evidence-7': 'In addition to known and tangible risks, unforeseeable black swan extinction events may occur, presenti

In [14]:
len(evidence)

1208827

### Have a look at structure of test_claims unlabelled (which we will later refer to as 'future' because we will make our own test set) and how many there are

In [15]:
test_claims_unlabelled

{'claim-2967': {'claim_text': 'The contribution of waste heat to the global climate is 0.028 W/m2.'},
 'claim-979': {'claim_text': '“Warm weather worsened the most recent five-year drought, which included the driest four-year period on record in terms of statewide precipitation.'},
 'claim-1609': {'claim_text': 'Greenland has only lost a tiny fraction of its ice mass.'},
 'claim-1020': {'claim_text': '“The global reef crisis does not necessarily mean extinction for coral species.'},
 'claim-2599': {'claim_text': 'Small amounts of very active substances can cause large effects.'},
 'claim-2110': {'claim_text': "They changed the name from 'global warming' to 'climate change'"},
 'claim-1135': {'claim_text': 'Scientists confirm a mass bleaching event on the Great Barrier Reef this year has killed more corals than ever before, with more than two thirds destroyed across large swathes of the biodiverse site.'},
 'claim-712': {'claim_text': '“Instead of a three-foot increase in ocean levels b

In [16]:
len(test_claims_unlabelled)

153

## Have a look at how long each of the claims are, because bert only allows 512 words (not to mention subword tokenisation eating up more positions)

In [17]:
sent_length = []
n = 0
max_len = 0

for claim in train_claims:
    n += 1
    sent_length.append(len(train_claims[claim]['claim_text']))

    if len(train_claims[claim]['claim_text']) > max_len:
        max_len = len(train_claims[claim]['claim_text'])
        print(train_claims[claim]['claim_text'], '\n')

print(sum(sent_length)/n)
print(max_len)

Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life. 

The last time the planet was even four degrees warmer, Peter Brannen points out in The Ends of the World, his new history of the planet’s major extinction events, the oceans were hundreds of feet higher. 

When stomata-derived CO2 (red) is compared to ice core-derived CO2 (blue), the stomata generally show much more variability in the atmospheric CO2 level and often show levels much higher than the ice cores. 

The research also revealed how large areas of the polar ice caps could collapse and significant changes to ecosystems could see the Sahara Desert become green and the edges of tropical forests turn into fire-dominated savanna. 

Study finds low probability of both very low and very high climate sensitivities, and its lower estimate (as compared to the IPCC) is based on a new temperature reconstruction of the Last Glacial Maxim

In [19]:
# look at the distribution of lengths
np.percentile(sent_length, [0, 5, 10, 25, 50, 75, 90, 95, 100])

array([ 26.  ,  44.35,  57.7 ,  82.  , 115.  , 156.  , 198.  , 233.  ,
       332.  ])

### Look at distribution of labels, so to have 0R to compare final classifier against

For train

In [20]:
from collections import defaultdict as dd

label_distribution = dd(int)
evidence_number = dd(int)

for claim in train_claims:
    n += 1
    label_distribution[train_claims[claim]['claim_label']] += 1

    evidence_number[len(train_claims[claim]['evidences'])] += 1

print(label_distribution)
print(evidence_number)

defaultdict(<class 'int'>, {'DISPUTED': 124, 'REFUTES': 199, 'SUPPORTS': 519, 'NOT_ENOUGH_INFO': 386})
defaultdict(<class 'int'>, {3: 191, 2: 223, 5: 477, 1: 210, 4: 127})


For dev

In [23]:
## DISTRIBUTION OF LABELS


label_distribution = dd(int)
evidence_number = dd(int)

for claim in dev_claims:
    n += 1
    label_distribution[dev_claims[claim]['claim_label']] += 1

    evidence_number[len(dev_claims[claim]['evidences'])] += 1

print(label_distribution)
print(evidence_number)

defaultdict(<class 'int'>, {'SUPPORTS': 68, 'NOT_ENOUGH_INFO': 41, 'REFUTES': 27, 'DISPUTED': 18})
defaultdict(<class 'int'>, {2: 29, 5: 52, 4: 16, 3: 26, 1: 31})


#### Have a look at how long each of the evidence are, because bert only allows 512 words (not to mention subword tokenisation eating up more positions)

In [21]:
## LENGTH of evidence

sent_length = []
n = 0
max_len = 0

for id in evidence:
    n += 1
    sent_length.append(len(evidence[id]))

    if len(evidence[id]) > max_len:
        max_len = len(evidence[id])
        print(evidence[id], '\n')

print(sum(sent_length)/n)
print(max_len)

John Bennet Lawes, English entrepreneur and agricultural scientist 

Lindberg began his professional career at the age of 16, eventually moving to New York City in 1977. 

Gerald Francis Goyer (born October 20, 1936) was a professional ice hockey player who played 40 games in the National Hockey League. 

He detected abnormalities of oxytocinergic function in schizoaffective mania, post-partum psychosis and how ECT modified oxytocin release. 

With peak winds of 110 mph (175 km/h) and a minimum pressure of 972 mbar (hPa ; 28.71 inHg), Florence was the strongest storm of the 1994 Atlantic hurricane season. 

He is best known as author of The Prize : The Epic Quest for Oil, Money, and Power (1991) and The Quest : Energy, Security, and the Remaking of the Modern World (2011). 

The Academic Chronicle (Московско-Академическая летопись, Moskovskaya akademicheskaya letopis) or Suzdal ' Chronicle (Суздальская летопись, Suzdalskaya Letopis) is a late 15th-century compilation of other Russian-l

In [22]:
# look at the distribution of lengths
np.percentile(sent_length, [0, 5, 10, 25, 50, 75, 90, 95, 100])

array([1.000e+00, 4.600e+01, 5.400e+01, 7.400e+01, 1.060e+02, 1.500e+02,
       2.000e+02, 2.370e+02, 3.148e+03])

### Check used evidences in Original Train and Dev

In [None]:
seen_evidence = set()

for claim in train_claims:
    for evid in train_claims[claim]['evidences']:
        seen_evidence.add(evid)

for claim in dev_claims:
    for evid in dev_claims[claim]['evidences']:
        seen_evidence.add(evid)

In [None]:
len(seen_evidence)

3443