# Using KERMIT - Building dataset -

This first notebook explains how to construct the syntactic input for KERMIT_system via KERMIT encoder.

There are also links from where to download used datasets or you can use a dataset of your choice.

## Install Packages
Before starting, it is essential to have the following requirements:
- stanford-corenlp-full-2018-10-05 : which will be used to build the trees in parenthetical form.
- KERMIT : that it is obvious to have it but we specify it anyway.

In [None]:
#Install stanfordcorenlp 
#!pip install stanfordcorenlp
#!wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
#!unzip stanford-corenlp-full-2018-10-05.zip

#Install KERMIT
#!git clone https://github.com/ART-Group-it/kerMIT


# Import Statments

In [1]:
import time, pickle, ast, os
from tqdm import tqdm
import pandas as pd
import numpy as np

#script for reading/writing trees
from scripts.script import readP, writeTree
#script for build DTK
from scripts.script import createDTK
#script for parse sentences
from scripts.script import parse

## Download  Dataset

The datasets used in our work have been randomly sampled as follows:

* *Note: in this tutorial we use ag_news (train and test set) but remember that the user can choose others as he prefers.*

In [4]:
#download dataset ag_news

#!wget wget https://data.deepai.org/agnews.zip
#!unzip agnews.zip

#! wget "https://multi-classification.s3.eu-central-1.amazonaws.com/dbpedia_csv.tar.gz"
#! wget "https://multi-classification.s3.eu-central-1.amazonaws.com/yelp_review_polarity_csv.tar.gz"
#! wget "https://multi-classification.s3.eu-central-1.amazonaws.com/yelp_review_full_csv.tar.gz"

data_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')


#if you want to sample the dataset in this way you exchange the examples and then take the first n lines

#data = data_original.iloc[np.random.permutation(len(data_original))]
#data = data[:70000]



In [7]:
data_train.head()

Unnamed: 0,Class Index,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


## Building parenthetical trees and encode in Universal Syntactic Embeddings
Here the loaded dataset is processed, transformed into tree form and encoded via kerMIT encoder.


In realtime the trees are saved on file, a log file is made showing the number of processed rows of the dataset and the encoded trees are saved in pickle format.

## Building parenthetical trees for Training Set

In [None]:
#insert here your dataset name
dataset_name = 'ag_news_train'


name = 'dtk_trees_'+dataset_name+'.pkl'
name2 = 'log_'+dataset_name+'.txt'
name3 = 'dt_'+dataset_name+'.txt'

i = 0
cont = 0
listTree = []
newList = []
oldList = []

tree = "(S)"
treeException = createDTK(tree)


for line in tqdm(data_train['Description']):
    cont += 1
    try: 
        tree = (parse(line))
        treeDTK = createDTK(tree)
    except Exception:
        tree, treeDTK = "(S)", treeException
    
    listTree.append(treeDTK)   
    #write parenthetical tree
    writeTree(name3,tree)
    #every 5000 shafts saves the corresponding DTKs
    if i>5000:
        time.sleep(1)
        if os.path.isfile(name):
            #append new encoded tree in pickle file            
            oldList = readP(name) 
            newList = oldList+listTree
        else:
            newList = listTree
        
        f=open(name, 'wb')
        
        for x in newList:
            pickle.dump(x, f)
        f.close()

        f=open(name2, "a+")
        f.write(str(cont)+'..')
        f.close()
                  
        i = 0
        listTree = []
        newList = []
        oldList = []         
    else:
        i +=1

#checking consistency
if os.path.isfile(name):
    oldList = readP(name) 
    newList = oldList+listTree
else:
    newList = listTree      
f=open(name, 'wb')
for x in newList:
    pickle.dump(x, f)
f.close()

## Building parenthetical trees for Test Set

In [None]:
#insert here your dataset name
dataset_name = 'ag_news_test'


name = 'dtk_trees_'+dataset_name+'.pkl'
name2 = 'log_'+dataset_name+'.txt'
name3 = 'dt_'+dataset_name+'.txt'

i = 0
cont = 0
listTree = []
newList = []
oldList = []

tree = "(S)"
treeException = createDTK(tree)


for line in tqdm(data_test['Description']):
    cont += 1
    try: 
        tree = (parse(line))
        treeDTK = createDTK(tree)
    except Exception:
        tree, treeDTK = "(S)", treeException
    
    listTree.append(treeDTK)   
    #write parenthetical tree
    writeTree(name3,tree)
    #every 5000 shafts saves the corresponding DTKs
    if i>5000:
        time.sleep(1)
        if os.path.isfile(name):
            #append new encoded tree in pickle file
            oldList = readP(name) 
            newList = oldList+listTree
        else:
            newList = listTree
        
        f=open(name, 'wb')
        
        for x in newList:
            pickle.dump(x, f)
        f.close()

        f=open(name2, "a+")
        f.write(str(cont)+'..')
        f.close()
                  
        i = 0
        listTree = []
        newList = []
        oldList = []         
    else:
        i +=1

#checking consistency
if os.path.isfile(name):
    oldList = readP(name) 
    newList = oldList+listTree
else:
    newList = listTree      
f=open(name, 'wb')
for x in newList:
    pickle.dump(x, f)
f.close()