Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bAbI refactored #97

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from
Open

bAbI refactored #97

wants to merge 5 commits into from

Conversation

vincentalbouy
Copy link
Contributor

Ok guys I open this pull request if you want to have a look, I need feedbacks. For now it contains a refactored versions of the work of Ryan. It contains:

  • A modified seq2seq_problem.py
  • A babi.yaml to run babi with LSTM
  • A first draft of BABI problem Class

I think the problem Class needs a lot of improvement, I noticed Ryan didn't use any padding so I am working on a new version with Pack_padded_sequence as @vmarois suggested. Also the story parsing method was not commented so I had a hard time understanding what he did so I am not 100% sure the parsing is correct.

@tkornuta-ibm
Copy link
Contributor

This pull request introduces 3 alerts when merging ddb720b into 68ad04d - view on LGTM.com

new alerts:

  • 1 for Non-callable called
  • 1 for Unused local variable
  • 1 for Missing call to __init__ during object initialization

Comment posted by LGTM.com

@tkornuta-ibm tkornuta-ibm assigned tkornuta-ibm and unassigned vmarois Nov 16, 2018
Copy link
Contributor

@tkornuta-ibm tkornuta-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Rename class

  2. Introduce abstract class and move methods from SeqToSeq to that class

  3. Change param names and default values

  4. Enhance the returned DataDict

  5. Is the configuration file ok, i.e. is the LSTM model (somehow) converging for single task/many tasks/all tasks with all samples/tenK samples?

tf.write(f.read())
return self.parse(file_data, add_punctuation)

def download(self, root, check=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so that whole part should at one point be integrated with Emre's ProblemInitializer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2 download methods?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downloads single file from a given url, the second one downloads (using the first one) file(s?), unpacks it and returns a path to it. IMO this is really 100% ProblemInitializer functionality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I need to know more about that. Will check with Emre




def download_from_url(self, url, path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is responsible for download, we are moving that funtionality to ProblemInitializer and integrating that with ProblemFactory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will to know more about that. Will check with Emre


babi_tasks = list(range(1, 21))

params = {'directory': '/', 'tasks': babi_tasks,'data_type': 'train', 'batch_size': 10,'embedding_type' :'glove.6B.100d', 'ten_thousand_examples': True, 'one_hot_embedding': True, 'truncation_length':50 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we are standardizing parameter names (see #73 ).
In case of data the name of variable should be "data_dir" with default value set to "~/data/babi/"

Please change that according to convention used e.g. in MNIST in init:

self.params.add_default_params({'data_folder': '~/data/mnist',

and then when "simulating" config file:

params.add_config_params({'data_folder': '~/data/mnist',

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

@@ -75,6 +77,86 @@ def evaluate_loss(self, data_dict, logits):
return loss


def to_dictionary_indexes(self, dictionary, sentence):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All those methods look for me as related to text. SeqToSeq problems abstract from text.

I guess it is good to move them to a class higher in the hierarchy (from bAbI), as probably they will be reused by the other multi-question version.

I propose to introduce a new class, inheriting from SeqToSeq, that will contain those methods, and maybe other methods shared with the second version of bAbI...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think those methods can be useful in e.g. machine translation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be moved to TextToText or another class related to NLP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes ok let's go for TextToText




class BABI(SeqToSeqProblem):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of dataset is bAbI (AI with capital letters), and as this is a version with single question in a story (if I understand it correctly), I suggest renaming it to something that will reflect that fact, e.g.
bAbIQASingleQuestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok done


self.data_definitions = {'sequences': {'size': [-1, -1, self.memory_size], 'type': [torch.Tensor]},
'targets': {'size': [-1], 'type': [torch.Tensor]},
'current_question': {'size': [-1, 1], 'type': [list, str]},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is current_question, and why it is singular, while other names are plural?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are all the string questionS so I corrected it to plural

'targets': {'size': [-1], 'type': [torch.Tensor]},
'current_question': {'size': [-1, 1], 'type': [list, str]},
'masks': {'size': [-1], 'type': [torch.Tensor]},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside of that, we got twenty tasks in bAbI. I guess it would be good to store which task qiven sample belongs to.

Similarly as in the CLEVR case, we will need that to analyze how good we are e.g. in stories with two supporting facts. Please notice that people when reporting results on bAbI are providing tables with separate accuracies on all 20 tasks and then one accuracy when trained jointly on all tasks...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ok


self.embedding_type = params['embedding_type']

self.embedding_size = 38
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of embedding is hardcoded, whereas, if I understand it correctly, you can use different ones by setting 'embedding type'. Is that ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes , I put it in the yaml as parameter but the parser couldn't find it. Will try again.


self.batch_size = params['batch_size']

self.memory_size = params['truncation_length']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is that? What is truncated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was padding size for Ryan. I have removed it


self.default_values = {'input_item_size': self.embedding_size , 'output_item_size':self.embedding_size}

self.data_definitions = {'sequences': {'size': [-1, -1, self.memory_size], 'type': [torch.Tensor]},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed that and I do not really understand why you plugged self.memory_size as third dimension...

BATCH_SIZE x SEQ_LEN X ITEM_SIZE

ITEM_SIZE is in this case size (number of bits) returned by the used embedding of a single word. Right?
If not, what is really stored in "sequences". Is this the whole story?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it would be best to separate the story (or context) from the actual question

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then simple LSTM model won't work...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So probably there should be a switch, like "separate_question", that could be parametrized from configuration file.

If true:
sequences = story
question = question
else:
sequences = story + question

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now sequences = story + question



def build_dictionaries_one_hot(self):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description is inconsistent with the name of the function. Is it really creating word embeddings OR is it building dictionaries from all words present in that stories/questions??

Why do we need two methods for it (with one hot and without)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could keep only one. There is a boolean variable self.embedding_type that decides if you want One hot or embedding or not. I though this is a nice feature so I kept it .

@tkornuta-ibm tkornuta-ibm changed the title Babi refactored bAbI refactored Nov 16, 2018
@tkornuta-ibm
Copy link
Contributor

tkornuta-ibm commented Nov 16, 2018

Finally, can we agree that we will use the right name It should be

bAbI

not Babi, BABI etc.

And it is for a reason

https://research.fb.com/downloads/babi/

bAbI from "baby AI"

And besides, it is only bAbI QA, not the whole bAbI, which contains more tasks...

Copy link
Contributor

@vmarois vmarois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand bAbI is a hard problem to implement because of the padding etc.
And this is exactly why this class needs to be clearer, so it should contain documentation, comments and variable names should reflect their use
Have a look at CLEVR and COG for the constraints to respect on a Problem class.
Perhaps we should all agree together what should be the content of a sample in bAbI...

batch_size: &b 1
data_type: train
embedding_type: glove.6B.100d
embedding_size: 50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glove.6B.100d already uses an embedding dimension of site 100, so I do not understand embedding_size: 50

embedding_size: 50
use_mask : false
joint_all: true
one_hot_embedding: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this related to embedding_type: glove.6B.100d& embedding_size: 50?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I just checked. You are right, embedding_type: glove.6B.100d are size 100 , when it is one hot , it size of the dictionary, so I will store size of the dictionary instead

tasks: [1, 2, 3]
ten_thousand_examples: true
truncation_length: 50
directory : ./
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @tkornut indicated, we should change that to data_folder: "~/data/babi/"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

data_type: valid
embedding_type: glove.6B.100d
joint_all: true
one_hot_embedding: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark here as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I just checked. You are right, embedding_type: glove.6B.100d are size 100 , when it is one hot , it size of the dictionary, so I will store size of the dictionary instead

one_hot_embedding: true
tasks: [1, 2, 3]
ten_thousand_examples: true
truncation_length : 50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as @tkornut, is this related to the maximum size of a question? Or is it related to padding?

data = data + self.load_data(tasks=tasks, tenK=self.tenK, add_punctuation=True, data_type='valid',
outmod="embedding")
data = data + self.load_data(tasks=tasks, tenK=self.tenK, add_punctuation=True, data_type='test',
outmod="embedding")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So data contains all the train, val and tests samples? If so, what is the point of data_type: train in the config file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used somewhere else to load the right data. It is not useful to build the dictionaries though, you are right

tf.write(f.read())
return self.parse(file_data, add_punctuation)

def download(self, root, check=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2 download methods?

shutil.copyfileobj(gz, uncompressed)

#Return path to extracted dataset
return os.path.join(path, self.dirname)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method looks quite generic, which is nice, so I would move it up in the classes hierarchy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


from miprometheus.problems.problem import Problem
import torch
from miprometheus.utils.app_state import AppState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need because Problem already has

self.app_state = AppState()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but some classes like TextToTextProblem do not see it somehow

@@ -75,6 +77,86 @@ def evaluate_loss(self, data_dict, logits):
return loss


def to_dictionary_indexes(self, dictionary, sentence):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be moved to TextToText or another class related to NLP


""" build embeddings from the chosen database / Example: glove.6B.100d """

self.language.build_pretrained_vocab(text, vectors=self.embedding_type, tokenize=self.tokenize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding the problem with the size of embeddings...

I think that language (btw. I think this is a bad name, means exactly nothing, if I understand it correctly it is... TextEncoder?) should set and return its value (i.e. size of embeddings)...

@tkornuta-ibm
Copy link
Contributor

This pull request introduces 4 alerts when merging 0c052d4 into 3d4f868 - view on LGTM.com

new alerts:

  • 1 for Non-callable called
  • 1 for Unused local variable
  • 1 for Unused import
  • 1 for Missing call to __init__ during object initialization

Comment posted by LGTM.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants