# Using KERMIT - Building dataset -

This first notebook explains how to construct the syntactic input for KERMIT_system via KERMIT encoder.

There are also links from where to download used datasets or you can use a dataset of your choice.

## Install Packages
Before starting, it is essential to have the following requirements:
- stanford-corenlp-full-2018-10-05 : which will be used to build the trees in parenthetical form.
- KERMIT : that it is obvious to have it but we specify it anyway.

In [2]:
#Install stanfordcorenlp 
#!pip install stanfordcorenlp
#!wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
#!unzip stanford-corenlp-full-2018-10-05.zip && y

#Install KERMIT
#!git clone https://github.com/ART-Group-it/kerMIT
#!pip install ./kerMIT/kerMIT
#!apt install openjdk-8-jdk --assume-yes


# Import Statments

In [1]:
import time, pickle, ast, os
from tqdm import tqdm
import pandas as pd
import numpy as np

#script for reading/writing trees
from scripts.script import readP, writeTree
#script for build DTK
from scripts.script import createDTK
#script for parse sentences
from scripts.script import parse

In [3]:
!wget http://160.80.97.56:8006/dtk_trees_multiNLI_dev_sentence_2_tot.pkl

--2021-01-24 13:13:22--  http://160.80.97.56:8006/dtk_trees_multiNLI_dev_sentence_2_tot.pkl
Connecting to 160.80.97.56:8006... failed: Connection refused.


## Download  Dataset

The datasets used in our work have been randomly sampled as follows:

* *Note: in this tutorial we use ag_news (train and test set) but remember that the user can choose others as he prefers.*

In [2]:
MULTI_NLI = "../../samir/MultiNLI_with_REL_final.csv"
HANS = "../../samir/HANS_with_REL_final.csv"

data_train = pd.read_csv(MULTI_NLI) #nrows=75000 skiprows=75001


In [3]:
len(data_train)

30000

In [4]:
data_train.head()

Unnamed: 0.1,Unnamed: 0,gold_label,sentence1_binary_parse,sentence2_binary_parse,sentence1_parse,sentence2_parse,sentence1,sentence2,pairID,heuristic,subcase,template,s1_rel,s2_rel
0,0,non-entailment,( ( The president ) ( ( advised ( the doctor )...,( ( The doctor ) ( ( advised ( the president )...,(ROOT (S (NP (DT The) (NN president)) (VP (VBD...,(ROOT (S (NP (DT The) (NN doctor)) (VP (VBD ad...,The president advised the doctor .,The doctor advised the president .,ex0,lexical_overlap,ln_subject/object_swap,temp1,(ROOT (S (NP (DT The) (NN-- REL1 -- president-...,(ROOT (S (NP (DT The) (NN-- REL3 -- doctor-- R...
1,1,non-entailment,( ( The student ) ( ( saw ( the managers ) ) ....,( ( The managers ) ( ( saw ( the student ) ) ....,(ROOT (S (NP (DT The) (NN student)) (VP (VBD s...,(ROOT (S (NP (DT The) (NNS managers)) (VP (VBD...,The student saw the managers .,The managers saw the student .,ex1,lexical_overlap,ln_subject/object_swap,temp1,(ROOT (S (NP (DT The) (NN-- REL1 -- student-- ...,(ROOT (S (NP (DT The) (NNS-- REL3 -- managers-...
2,2,non-entailment,( ( The presidents ) ( ( encouraged ( the bank...,( ( The banker ) ( ( encouraged ( the presiden...,(ROOT (S (NP (DT The) (NNS presidents)) (VP (V...,(ROOT (S (NP (DT The) (NN banker)) (VP (VBD en...,The presidents encouraged the banker .,The banker encouraged the presidents .,ex2,lexical_overlap,ln_subject/object_swap,temp1,(ROOT (S (NP (DT The) (NNS-- REL1 -- president...,(ROOT (S (NP (DT The) (NN-- REL3 -- banker-- R...
3,3,non-entailment,( ( The senators ) ( ( supported ( the actor )...,( ( The actor ) ( ( supported ( the senators )...,(ROOT (S (NP (DT The) (NNS senators)) (VP (VBD...,(ROOT (S (NP (DT The) (NN actor)) (VP (VBD sup...,The senators supported the actor .,The actor supported the senators .,ex3,lexical_overlap,ln_subject/object_swap,temp1,(ROOT (S (NP (DT The) (NNS-- REL1 -- senators-...,(ROOT (S (NP (DT The) (NN-- REL3 -- actor-- RE...
4,4,non-entailment,( ( The actors ) ( ( avoided ( the bankers ) )...,( ( The bankers ) ( ( avoided ( the actors ) )...,(ROOT (S (NP (DT The) (NNS actors)) (VP (VBD a...,(ROOT (S (NP (DT The) (NNS bankers)) (VP (VBD ...,The actors avoided the bankers .,The bankers avoided the actors .,ex4,lexical_overlap,ln_subject/object_swap,temp1,(ROOT (S (NP (DT The) (NNS-- REL1 -- actors-- ...,(ROOT (S (NP (DT The) (NNS-- REL3 -- bankers--...


## Building parenthetical trees and encode in Universal Syntactic Embeddings
Here the loaded dataset is processed, transformed into tree form and encoded via kerMIT encoder.


In realtime the trees are saved on file, a log file is made showing the number of processed rows of the dataset and the encoded trees are saved in pickle format.

## Building parenthetical trees for Training Set

In [7]:
#insert here your dataset name
name = "MultiNLI_final"
dataset_name = f'{name}_REL_sencence_1'


name = 'dtk_trees_'+dataset_name+'.pkl'
name2 = 'log_'+dataset_name+'.txt'
name3 = 'dt_'+dataset_name+'.txt'

i = 0
cont = 0
listTree = []
newList = []
oldList = []

tree = "(S)"
treeException = createDTK(tree)


for line in tqdm(data_train['s1_rel']):
    cont += 1
    try: 
        #tree = (parse(line))
        tree = line
        treeDTK = createDTK(tree)
    except Exception:
        tree, treeDTK = "(S)", treeException
    
    listTree.append(treeDTK)   
    #write parenthetical tree
    writeTree(name3,tree)
    #every 5000 shafts saves the corresponding DTKs
    if i>5000:
        time.sleep(1)
        if os.path.isfile(name):
            #append new encoded tree in pickle file            
            oldList = readP(name) 
            newList = oldList+listTree
        else:
            newList = listTree
        
        f=open(name, 'wb')
        
        for x in newList:
            pickle.dump(x, f)
        f.close()

        f=open(name2, "a+")
        f.write(str(cont)+'..')
        f.close()
                  
        i = 0
        listTree = []
        newList = []
        oldList = []         
    else:
        i +=1
    
    if cont == 74998:
        f=open(name2, "a+")
        f.write(str(cont)+'..!!!')
        f.close()
    if cont == 150000:
        f=open(name2, "a+")
        f.write(str(cont)+'..!!!')
        f.close()
    if cont == 300000:
        f=open(name2, "a+")
        f.write(str(cont)+'..!!!')
        f.close()

#checking consistency
if os.path.isfile(name):
    oldList = readP(name) 
    newList = oldList+listTree
else:
    newList = listTree      
f=open(name, 'wb')
for x in newList:
    pickle.dump(x, f)
f.close()

100%|██████████| 30000/30000 [14:00<00:00, 35.67it/s]


In [None]:
#68352

## Building parenthetical trees for Test Set

In [None]:
#insert here your dataset name
dataset_name = 'multiNLI_test_matched'


name = 'dtk_trees_'+dataset_name+'.pkl'
name2 = 'log_'+dataset_name+'.txt'
name3 = 'dt_'+dataset_name+'.txt'

i = 0
cont = 0
listTree = []
newList = []
oldList = []

tree = "(S)"
treeException = createDTK(tree)


for line in tqdm(data_test['sentence1']):
    cont += 1
    try: 
        tree = (parse(line))
        treeDTK = createDTK(tree)
    except Exception:
        tree, treeDTK = "(S)", treeException
    
    listTree.append(treeDTK)   
    #write parenthetical tree
    writeTree(name3,tree)
    #every 5000 shafts saves the corresponding DTKs
    if i>5000:
        time.sleep(1)
        if os.path.isfile(name):
            #append new encoded tree in pickle file
            oldList = readP(name) 
            newList = oldList+listTree
        else:
            newList = listTree
        
        f=open(name, 'wb')
        
        for x in newList:
            pickle.dump(x, f)
        f.close()

        f=open(name2, "a+")
        f.write(str(cont)+'..')
        f.close()
                  
        i = 0
        listTree = []
        newList = []
        oldList = []         
    else:
        i +=1

#checking consistency
if os.path.isfile(name):
    oldList = readP(name) 
    newList = oldList+listTree
else:
    newList = listTree      
f=open(name, 'wb')
for x in newList:
    pickle.dump(x, f)
f.close()