# Data augmentation

## 1 Current data statistics

### We read in the files of queries, logical forms, and schema, and categorize them by length; within the same length, there would be subcategories

In [1]:
import numpy as np
import random
import math

logic_category_len = dict()
query_len = dict()
schema_len = dict()
with open('./rand.lo') as f_lo:
    with open('./rand.qu') as f_qu:
        with open('./rand.fi') as f_fi:
            logic_line, query_line, schema_line = f_lo.readline(), f_qu.readline(), f_fi.readline()
            while logic_line and query_line and schema_line:
                logic = logic_line.split()
#                 if len(logic) == 13:
#                     if logic[4] == 'less':
#                         logic[0] = 'argmax'
#                     else:
#                         logic[0] = 'argmin'
#                     logic.insert(2, logic[3])
                length = len(logic)
#                 if length ==0:
#                     continue
                if length not in logic_category_len:
                    logic_category_len[length] = []
                    query_len[length] = []
                    schema_len[length] = []
                logic_category_len[length].append(logic_line)
                query_len[length].append(query_line)
                schema_len[length].append(schema_line)
                logic_line, query_line, schema_line = f_lo.readline(), f_qu.readline(), f_fi.readline()
for key in logic_category_len.keys():
    value = logic_category_len[key]
    print 'length = %d, total examples: %d' %(key, len(value))

length = 2, total examples: 2
length = 4, total examples: 156
length = 6, total examples: 1253
length = 7, total examples: 4
length = 8, total examples: 624
length = 10, total examples: 687
length = 11, total examples: 697
length = 12, total examples: 488


Have a look at the data:

In [2]:
for i in range(len(logic_category_len[7])):
    print query_len[7][i]
    #print logic_category_len[14][i]

what is the difference between the nations with the most and least amount of bronze medals

how long in years has the this world series been occurring

what is the difference between the nations with the most and least amount of gold medals

what is the difference between the nations with the most and least amount of silver medals



### Now we collect all different schema in a list for later use

In [3]:
schema_collect = []
with open('./rand.fi') as f_fi:
    for line in f_fi:
        if line in schema_collect:
            continue
        schema_collect.append(line)
    
for schema in schema_collect:
    print schema

Nation Rank Gold Silver Bronze Total

Name Year_inducted Position Apps Goals

Year 1st_Venue 2nd_Venue 3rd_Venue 4th_Venue 5th_Venue 6th_Venue

Player Matches Innings Runs Average 100s 50s Games_Played Field_Goals Free_Throws Points

Team County Wins Years_won Areas Prices

Country Masters U.S._Open The_Open PGA Total

Swara Position Short_name Notation Mnemonic

State No._of_candidates No._of_elected Total_no._of_seats_in_Assembly Year_of_Election

Discipline Amanda Bernie Javine_H Julia Michelle

Nation Name Position League_Apps League_Goals FA_Cup_Apps FA_Cup_Goals Total_Apps Total_Goals

Menteri_Besar Took_office Left_office Party



## 2 Data Preparation and Generation

### Next we do some data generation, the first goal is to double our current data size (8k~10k) 

As we previously did some work in the file ./data_prep/categorization.txt, we have several different sentences for a single length category. For each sentence structure, we first see whether it could applied to all or several schema, or just a single schema; then we tag each sentence, and for 'field' and 'value', we do data recombination for both query and logical forms; finally we add noise and replace synonyms in the queries to further complicate the sentence structrue.

Let's start with the easiest length = 4:

In [4]:
import os,sys,inspect

import tagger as tg
import tag_utils as tu
from nltk.parse import stanford
from nltk import tree

currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir)
os.environ['STANFORD_PARSER'] = '/Users/richard_xiong/Documents/DeepLearningMaster/deep_parser'
os.environ['STANFORD_MODELS'] = '/Users/richard_xiong/Documents/DeepLearningMaster/deep_parser'

parser = stanford.StanfordParser(model_path='/Users/richard_xiong/Documents/DeepLearningMaster/deep_parser/englishPCFG.ser.gz')

#parsequery = "which nation has less than 6 <field:1> but its <field:2> medals are more than 14 "
#parsequery = "when the <field:1> was beijing and <field:2> was dubai , which city was the most recent <field:4>"
#parsequery = "for <field:0> with more than 400 <field:1> and <field:2> less than 14 , <field:0> has the most <field:3>"
# parsequery = "which state had the largest <field:1>, and its <field:2> are within 12 and 15"
# dependency_tree = parser.raw_parse_sents(('Hello, My name is Melroy', parsequery))

# for line in dependency_tree[1]:
#     line.draw()

Importing GloVe pretrained word vectors
		reading 10000 lines from GloVe file
		reading 20000 lines from GloVe file
		reading 30000 lines from GloVe file
		reading 40000 lines from GloVe file
		reading 50000 lines from GloVe file
Replacing GloVe word vectors as initialization


In order to find which value corresponds to which field, we need to first find:
1. the lowest common ancestor for each (value, field) pairs
2. for each value, all different ancestors are belong to different levels, the deepest one, which should be the subtree for all the others, would contain the correspondence pair

Possible functions:

leaf_treeposition(self, index) ---> return: The tree position of the ``index``-th leaf in this
            tree.  I.e., if ``tp=self.leaf_treeposition(i)``, then
            ``self[tp]==self.leaves()[i]``.

treeposition_spanning_leaves(self, start, end) ---> The tree position of the lowest descendant of this
            tree that dominates ``self.leaves()[start:end]``.

convert(cls, tree) ---> to subtype of Tree, say, ParentTree

e.g.
(0, 0, 1, 0, 0, 1, 0)
(0, 0, 1, 0, 1, 1, 0, 0)
(0, 0, 1, 2, 0, 0, 0)
(0, 0, 1, 2, 1, 1, 0, 0)

In [5]:
schema_collect[2] = "State Year_of_Election No._of_candidates No._of_elected Total_no._of_seats_in_Assembly \n"
schema_collect[7] = "Year 1st_Venue 2nd_Venue 3rd_Venue 4th_Venue 5th_Venue 6th_Venue \n"
schema_collect[3] = "Team Years_won County Wins Areas Prices \n"
schema_collect[4] = "Player Matches Innings Runs Average 100s 50s Games_Played Field_Goals Free_Throws Points \n"

schema_collect[6] = "Discipline Amanda Bernie Javine_H Julia Michelle \n"
schema_collect[8] = "Swara Position Short_name Notation Mnemonic \n"
schema_collect[7] = "Nation Name Position League_Apps League_Goals FA_Cup_Apps FA_Cup_Goals Total_Apps Total_Goals \n"
schema_collect[9] = "Year 1st_Venue 2nd_Venue 3rd_Venue 4th_Venue 5th_Venue 6th_Venue \n"

for schema in schema_collect:
    print schema

Nation Rank Gold Silver Bronze Total

Name Year_inducted Position Apps Goals

State Year_of_Election No._of_candidates No._of_elected Total_no._of_seats_in_Assembly 

Team Years_won County Wins Areas Prices 

Player Matches Innings Runs Average 100s 50s Games_Played Field_Goals Free_Throws Points 

Country Masters U.S._Open The_Open PGA Total

Discipline Amanda Bernie Javine_H Julia Michelle 

Nation Name Position League_Apps League_Goals FA_Cup_Apps FA_Cup_Goals Total_Apps Total_Goals 

Swara Position Short_name Notation Mnemonic 

Year 1st_Venue 2nd_Venue 3rd_Venue 4th_Venue 5th_Venue 6th_Venue 

Menteri_Besar Took_office Left_office Party



### Conventions 
1. "o" stands for "ordinal" values, refering to schema_collect[0:4]
2. "n" stands for "numerical" values, refering to schema_collect[4:8]
3. "s" stands for "string" values, refering to schema_collect[7:]

In [6]:
collect4max = """which country has the most pga championships
which country had the most number of wins
which country won the largest haul of bronze medals
who was the last de player
which nation received the largest amount of gold medals
the team with the most gold medals
which nation was ranked last
the country that won the most medals was
what is the largest matches amount""".split('\n')

collect4min = """who was the first nation
what is the name of the first nation on this chart
what is the name of the swara that holds the first position
which country had the least bronze medals
who scored the least on whitewater_kayak
which state has the top no._of_elected amount
who was the top scorer in innings
what is the top listed player
who is the top ranked nation""".split('\n')

print collect4max

lo4max = 'select <field>:0 argmax <field>:1'
lo4min = 'select <field>:0 argmin <field>:1'

print schema_collect[5], collect4max[0]
tagged2, field_corr, value_corr, quTemp, _ = \
            tg.sentTagging_tree(parser, collect4max[0], schema_collect[5])
print field_corr
print value_corr 
print quTemp
print lo4max

['which country has the most pga championships', 'which country had the most number of wins', 'which country won the largest haul of bronze medals', 'who was the last de player', 'which nation received the largest amount of gold medals', 'the team with the most gold medals', 'which nation was ranked last', 'the country that won the most medals was', 'what is the largest matches amount']
Country Masters U.S._Open The_Open PGA Total
which country has the most pga championships
['<field>', 'Country']
['<field>', 'PGA']
[(5, 1)]
[]
[(1, 0)]
[]
Country PGA
<nan> <nan>
which <field>:0 has the most <field>:1 championships
select <field>:0 argmax <field>:1


### Note:
Each sentence could then be turned into a query tempelate after tagging. Now we have the logical template, query template, and several available schema (annotated by 'o' 'n' 's'), so combined with the field_corr and value_corr files we should be able to generate multiple sentences according to several schema.

In [10]:
def schemaRecommend(field_corr_old, special_code):
    ''' From the old generated field correspondence (string), transform to a new field correspondence, 
        represented by a list value_types, and from the set of value types to get the possible schemas 
        (PLURALS) that could use for augmentation later (check that all the types in field_corr_new 
        should be in each schema)
        arguments --- special_code: might be used to indicate that the schema is not tranferrable.
        return --- field_corr: a list of value_types
                   schemas: several schema that the template could augment to, each contain all the
                   value_types needed; also see 'special_code'.
    '''
    return field_corr_new, schemas

field_corr_new = ['string', 'int']
schema_aug = schema_collect[0:8]

def augment4max(quTemp, loTemp, field_corr, schema_aug):
    ''' Data augmentation from a pair of query template and logical template
        arguments --- field_corr: a list of value_types e.g. ['string','ordinal','int'], each idx should 
                      correspond to the postion in the templates
                      schemas: PLURALS HERE! several schemas that the template could augment to.
        return --- collections of queries, logics, and fields
    '''
    queryCollect, logicCollect, fieldCollect = [], [], []
    config = tu.Config()
    
    # Step 1: preparation
    query = quTemp.split()
    logic = loTemp.split()
    qu_field = []  # positions of field in query
    qu_value = []  # positions of value in query
    lo_field = []  # positions of field in logic
    lo_value = []  # positions of value in logic
    for i in range(len(query)):
        reference = query[i].split(':')
        if len(reference) == 1:
            continue
        print reference
        idx = int(reference[1])
        if reference[0] == '<field>':
            qu_field.append((i, idx))
        else:
            qu_value.append((i, idx))
    print qu_field, qu_value
    for i in range(len(logic)):
        reference = logic[i].split(':')
        if len(reference) == 1:
            continue
        print reference
        idx = int(reference[1])
        if reference[0] == '<field>':
            lo_field.append((i, idx))
        else:
            lo_value.append((i, idx))
    print lo_field, lo_value
    
    # Step 2: augment to different schemas
    for j in range(len(schema_aug)):
        field_corr_dicts = []
        print '=== %d schema ===' %j
        schema = schema_aug[j].split()
        print schema
        # because there could be multiple same-type fields in one sentences, we go over field_corr
        for k in range(len(field_corr)):
            field_corr_dict = dict()
            for i in range(len(schema)):
                field = schema[i]
                #print field
                if schema[i] == 'Total' or schema[i] == 'Average':
                    continue
                value_type = config.field2word[schema[i]]['value_type']
                #print value_type
                if value_type == field_corr[k]:
                    if value_type == 'string':
                        #field_corr_dict[field] = config.field2word[schema[i]]['value_range']
                        field_corr_dict[field] = random.sample(config.field2word[schema[i]]['value_range'], 5)
                    elif value_type == 'int':
                        field_corr_dict[field] = [random.randint(1, 99) for i in range(5)]
                    elif value_type == 'date':
                        field_corr_dict[field] = [random.randint(1970, 2010) for i in range(5)]
                    elif value_type == 'ordinal':
                        field_corr_dict[field] = [random.randint(1, 9) for i in range(5)]
            field_corr_dicts.append(field_corr_dict)
        print field_corr_dicts 
        # now the list of dicts [{str_field1:[], str_field2:[], ...}, {int_field1:[], int_field2:[], ...}]
    return queryCollect, logicCollect, fieldCollect

augment4max(quTemp, lo4max, field_corr_new, schema_aug)

['<field>', '0']
['<field>', '1']
[(1, 0), (5, 1)] []
['<field>', '0']
['<field>', '1']
[(1, 0), (3, 1)] []
=== 0 schema ===
['Nation', 'Rank', 'Gold', 'Silver', 'Bronze', 'Total']
[{'Nation': ['Mongolia', 'US', 'Hungary', 'Bahamas', 'Austria']}, {'Bronze': [41, 71, 14, 17, 24], 'Silver': [85, 29, 6, 89, 95], 'Gold': [13, 22, 13, 37, 51]}]
=== 1 schema ===
['Name', 'Year_inducted', 'Position', 'Apps', 'Goals']
[{'Position': ['Defender', 'Forward', 'QB', 'Goalkeeper', 'DE'], 'Name': ['Ross_Jenkins', 'Ernie_Islip', 'Jack_Byers', 'Harry_Brough', 'Robert_Jones']}, {'Apps': [94, 40, 8, 8, 34], 'Goals': [8, 18, 64, 27, 49]}]
=== 2 schema ===
['State', 'Year_of_Election', 'No._of_candidates', 'No._of_elected', 'Total_no._of_seats_in_Assembly']
[{'State': ['West_Bengal', 'Manipur', 'Andhra_Pradesh', 'Goa', 'Florida']}, {'Total_no._of_seats_in_Assembly': [20, 75, 6, 93, 88], 'No._of_candidates': [45, 56, 54, 39, 43], 'No._of_elected': [51, 39, 40, 36, 85]}]
=== 3 schema ===
['Team', 'Years_won'

([], [], [])