# FEVER Database Structures (4x)

I created four different database structures to access the wikipedia and claim data. The WikiDatabase gives access to the basic information from the wikipedia pages and the ClaimDatabase gives access to the basic information from the claims. The text and claims from these databases are not splitted yet. The WikiDatabaseNgrams and ClaimDatabaseNgrams database give access to the tokenized text of the wikipedia pages and claims. With the Spacy package is costly. Therefore, the tokenized text is stored for different methods of tokenization.

- WikiDatabase
- ClaimDatabase
- WikiDatabaseNgrams
- ClaimDatabaseNgrams

# Wikipedia Database

In [1]:
import os 

from wiki_database import WikiDatabaseSqlite

import config

path_wiki_pages = os.path.join(config.ROOT, config.DATA_DIR, config.WIKI_PAGES_DIR, 'wiki-pages')
path_wiki_database_dir = os.path.join(config.ROOT, config.DATA_DIR, config.DATABASE_DIR)

wiki_database = WikiDatabaseSqlite(path_wiki_database_dir, path_wiki_pages)


WikiDatabase
***finished***


In [30]:
document_id = 1
title = '1986 NBA Finals'

## Title of Wikipedia page

In [4]:
wiki_database.get_item(input_type='id', input_value=document_id, output_type='title')

'1986 NBA Finals'

## Text of Wikipedia page

In [5]:
wiki_database.get_item(input_type='id', input_value=document_id, output_type='text')

"The 1986 NBA Finals was the championship round of the 1985 -- 86 NBA season . It pitted the Eastern Conference champion Boston Celtics against the Western Conference champion Houston Rockets , in a rematch of the 1981 Finals ( only Allen Leavell and Robert Reid remained from the Rockets ' 1981 team ) . The Celtics defeated the Rockets four games to two to win their 16th NBA championship . The championship would be the Celtics ' last until the 2008 NBA Finals . Larry Bird was named the Finals MVP .   On another note , this series marked the first time the `` NBA Finals '' branding was officially used , as they dropped the `` NBA World Championship Series '' branding which had been in use since the beginning of the league , though it had been unofficially called the `` NBA Finals '' for years .   Until the 2011 series , this was the last time the NBA Finals had started before June . Since game three , all NBA Finals games have been played in June . Starting with the following year , the

## List of sentences Wikipedia page

In [6]:
wiki_database.get_item(input_type='id', input_value=document_id, output_type='lines')

['The 1986 NBA Finals was the championship round of the 1985 -- 86 NBA season .',
 "It pitted the Eastern Conference champion Boston Celtics against the Western Conference champion Houston Rockets , in a rematch of the 1981 Finals -LRB- only Allen Leavell and Robert Reid remained from the Rockets ' 1981 team -RRB- .",
 'The Celtics defeated the Rockets four games to two to win their 16th NBA championship .',
 "The championship would be the Celtics ' last until the 2008 NBA Finals .",
 'Larry Bird was named the Finals MVP .',
 '',
 '',
 "On another note , this series marked the first time the `` NBA Finals '' branding was officially used , as they dropped the `` NBA World Championship Series '' branding which had been in use since the beginning of the league , though it had been unofficially called the `` NBA Finals '' for years .",
 '',
 '',
 'Until the 2011 series , this was the last time the NBA Finals had started before June .',
 'Since game three , all NBA Finals games have been pl

## From title to document id

In [7]:
wiki_database.get_item(input_type='title', input_value=title, output_type='id')

1

# Claim Database

In [8]:
import os

from claim_database import ClaimDatabase

import config

path_raw_data = os.path.join(config.ROOT, config.DATA_DIR, config.RAW_DATA_DIR)
path_database_dir = os.path.join(config.ROOT, config.DATA_DIR, config.DATABASE_DIR)
claim_data_set = 'dev' # 'dev' or 'train'

claim_database = ClaimDatabase(path_dir_database=path_database_dir,
                               path_raw_data_dir=path_raw_data,
                               claim_data_set=claim_data_set,
                               wiki_database=wiki_database)


ClaimDatabase
***finished***


In [9]:
claim_id = 1

## claim id to original id of claim
The numbers of the claims in the provided dataset are extremely high and appear random. I renumbered them from zero till the number of claims. The following line can be used to retrieve the original claim number.

In [10]:
claim_database.get_item(input_type='id', input_value=claim_id, output_type='id_number')

194462

## verifiable

A claim can be 'VERIFIABLE' or 'NOT VERIFIABLE'. The output_type 'verifiable_str' is the string version of the flag and the output_type 'verifiable_int' is the integer version of the flag. The integer flag can be used to train the sentence retrieval model. By storing the integer version directly, there is no confusion later in the model.

In [12]:
claim_database.get_item(input_type='id', input_value=claim_id, output_type='verifiable_str')

'NOT VERIFIABLE'

In [13]:
claim_database.get_item(input_type='id', input_value=claim_id, output_type='verifiable_int')

0

## label

A claim can be 'NOT ENOUGH INFO', 'SUPPORTS' or 'REFUTES'. The output_type 'label_str' is the string version of the flag and the output_type 'label_int' is the integer version of the flag. The integer flag can be used to train the label prediction stage. By storing the integer version directly, there is no confusion later in the model.

In [14]:
claim_database.get_item(input_type='id', input_value=claim_id, output_type='label_str')

'NOT ENOUGH INFO'

In [15]:
claim_database.get_item(input_type='id', input_value=claim_id, output_type='label_int')

2

## text of claim

The text of a claim can be recovered with the following line.

In [16]:
claim_database.get_item(input_type='id', input_value=claim_id, output_type='claim')

'Tilda Swinton is a vegan.'

# evidence

In [17]:
claim_id=7

The evidence of a claim is retrieved with the following line. I created a class Evidence as well, which allows easy handling of the evidence, since it consists of a nested list.

In [18]:
claim_database.get_item(input_type='id', input_value=claim_id, output_type='evidence')

[[[260471, 258880, 4307072, 3]], [[260473, 258882, 4307072, 3]]]

In [19]:
from claim_database import Evidence

evidence = Evidence(claim_database.get_item(input_type='id', input_value=claim_id, output_type='evidence'))

## nr annotators

In [20]:
evidence.get_nr_annotators()

2

## nr evidence sentences for annotator X

In [21]:
evidence.get_nr_evidence_sentences(annotator_nr=0)

1

## evidence document and sentence number

In [22]:
annotator_nr = 0
sentence_nr = 0
document_nr, sentence_nr = evidence.get_evidence(annotator_nr=annotator_nr, sentence_nr=sentence_nr)
document_nr, sentence_nr

(4307072, 3)

# N-Grams Claim Database

- method_tokenization (options): 'tokenize', 'tag', 'lower'
- n_gram (options): 1, 2
- delimiter_option (options): True, False

In [23]:
from claim_database_n_grams import ClaimDatabaseNgrams

method_tokenization = 'tokenize' 
n_gram = 1
delimiter_option = True
claim_database_n_grams = ClaimDatabaseNgrams(path_dir_database=path_database_dir,
                                             claim_database=claim_database,
                                             method_tokenization=method_tokenization,
                                             n_gram=n_gram,
                                             delimiter_option=delimiter_option)

ClaimDatabaseNgrams
***finished***


In [24]:
claim_id = 1

In [25]:
claim_database_n_grams.get_item(input_type='id', input_value=claim_id, output_type='claim')

['Tilda', 'Swinton', 'is', 'a', 'vegan', '.']

In [26]:
from claim_database_n_grams import ClaimDatabaseNgrams

method_tokenization = 'tag' 
n_gram = 1
delimiter_option = True
claim_database_n_grams = ClaimDatabaseNgrams(path_dir_database=path_database_dir,
                                             claim_database=claim_database,
                                             method_tokenization=method_tokenization,
                                             n_gram=n_gram,
                                             delimiter_option=delimiter_option)

ClaimDatabaseNgrams
***finished***


In [27]:
claim_database_n_grams.get_item(input_type='id', input_value=claim_id, output_type='claim')

['PROPN', 'PROPN', 'VERB', 'DET', 'NOUN', 'PUNCT']

# N-Grams Wikipedia Database

- method_tokenization (options): 'tokenize', 'tag', 'lower'
- n_gram (options): 1, 2
- delimiter_option (options): True, False

In [23]:
from wiki_database_n_grams import WikiDatabaseNgrams

method_tokenization = 'tokenize'
n_gram = 2
delimiter_option = False
wiki_database_n_grams = WikiDatabaseNgrams(path_dir_database=path_wiki_database_dir,
                                             wiki_database=wiki_database,
                                             method_tokenization=method_tokenization,
                                             n_gram=n_gram,
                                             delimiter_option=delimiter_option)


WikiDatabaseNgrams
***finished***


In [24]:
document_id=1


## Title

In [25]:
wiki_database_n_grams.get_item(input_type='id', input_value=document_id, output_type='title')

[['1986', 'NBA'], ['NBA', 'Finals']]

## Text

In [26]:
wiki_database_n_grams.get_item(input_type='id', input_value=document_id, output_type='text')

[['The', '1986'],
 ['1986', 'NBA'],
 ['NBA', 'Finals'],
 ['Finals', 'was'],
 ['was', 'the'],
 ['the', 'championship'],
 ['championship', 'round'],
 ['round', 'of'],
 ['of', 'the'],
 ['the', '1985'],
 ['1985', '--'],
 ['--', '86'],
 ['86', 'NBA'],
 ['NBA', 'season'],
 ['season', '.'],
 ['.', 'It'],
 ['It', 'pitted'],
 ['pitted', 'the'],
 ['the', 'Eastern'],
 ['Eastern', 'Conference'],
 ['Conference', 'champion'],
 ['champion', 'Boston'],
 ['Boston', 'Celtics'],
 ['Celtics', 'against'],
 ['against', 'the'],
 ['the', 'Western'],
 ['Western', 'Conference'],
 ['Conference', 'champion'],
 ['champion', 'Houston'],
 ['Houston', 'Rockets'],
 ['Rockets', ','],
 [',', 'in'],
 ['in', 'a'],
 ['a', 'rematch'],
 ['rematch', 'of'],
 ['of', 'the'],
 ['the', '1981'],
 ['1981', 'Finals'],
 ['Finals', '('],
 ['(', 'only'],
 ['only', 'Allen'],
 ['Allen', 'Leavell'],
 ['Leavell', 'and'],
 ['and', 'Robert'],
 ['Robert', 'Reid'],
 ['Reid', 'remained'],
 ['remained', 'from'],
 ['from', 'the'],
 ['the', 'Rocket

## Lines

In [27]:
wiki_database_n_grams.get_item(input_type='id', input_value=document_id, output_type='lines')

[[['The', '1986'],
  ['1986', 'NBA'],
  ['NBA', 'Finals'],
  ['Finals', 'was'],
  ['was', 'the'],
  ['the', 'championship'],
  ['championship', 'round'],
  ['round', 'of'],
  ['of', 'the'],
  ['the', '1985'],
  ['1985', '--'],
  ['--', '86'],
  ['86', 'NBA'],
  ['NBA', 'season'],
  ['season', '.']],
 [['It', 'pitted'],
  ['pitted', 'the'],
  ['the', 'Eastern'],
  ['Eastern', 'Conference'],
  ['Conference', 'champion'],
  ['champion', 'Boston'],
  ['Boston', 'Celtics'],
  ['Celtics', 'against'],
  ['against', 'the'],
  ['the', 'Western'],
  ['Western', 'Conference'],
  ['Conference', 'champion'],
  ['champion', 'Houston'],
  ['Houston', 'Rockets'],
  ['Rockets', ','],
  [',', 'in'],
  ['in', 'a'],
  ['a', 'rematch'],
  ['rematch', 'of'],
  ['of', 'the'],
  ['the', '1981'],
  ['1981', 'Finals'],
  ['Finals', '-LRB-'],
  ['-LRB-', 'only'],
  ['only', 'Allen'],
  ['Allen', 'Leavell'],
  ['Leavell', 'and'],
  ['and', 'Robert'],
  ['Robert', 'Reid'],
  ['Reid', 'remained'],
  ['remained', 'f