bAbI refactored #97

vincentalbouy · 2018-11-16T00:47:40Z

Ok guys I open this pull request if you want to have a look, I need feedbacks. For now it contains a refactored versions of the work of Ryan. It contains:

A modified seq2seq_problem.py
A babi.yaml to run babi with LSTM
A first draft of BABI problem Class

I think the problem Class needs a lot of improvement, I noticed Ryan didn't use any padding so I am working on a new version with Pack_padded_sequence as @vmarois suggested. Also the story parsing method was not commented so I had a hard time understanding what he did so I am not 100% sure the parsing is correct.

Release 0.3

Conf and setup - updated version to 0.3.0

tkornuta-ibm · 2018-11-16T01:16:58Z

This pull request introduces 3 alerts when merging ddb720b into 68ad04d - view on LGTM.com

new alerts:

1 for Non-callable called
1 for Unused local variable
1 for Missing call to __init__ during object initialization

Comment posted by LGTM.com

tkornuta-ibm

Rename class
Introduce abstract class and move methods from SeqToSeq to that class
Change param names and default values
Enhance the returned DataDict
Is the configuration file ok, i.e. is the LSTM model (somehow) converging for single task/many tasks/all tasks with all samples/tenK samples?

tkornuta-ibm · 2018-11-16T02:24:53Z

miprometheus/problems/question_context_to_class/babiqa_dataset_single_question.py

+                    tf.write(f.read())
+        return self.parse(file_data, add_punctuation)
+
+    def download(self, root, check=None):


Ok, so that whole part should at one point be integrated with Emre's ProblemInitializer

Why 2 download methods?

One downloads single file from a given url, the second one downloads (using the first one) file(s?), unpacks it and returns a path to it. IMO this is really 100% ProblemInitializer functionality

Ok I need to know more about that. Will check with Emre

tkornuta-ibm · 2018-11-16T02:26:09Z

miprometheus/problems/question_context_to_class/babiqa_dataset_single_question.py

+
+
+
+    def download_from_url(self, url, path):


This code is responsible for download, we are moving that funtionality to ProblemInitializer and integrating that with ProblemFactory

Ok I will to know more about that. Will check with Emre

tkornuta-ibm · 2018-11-16T02:31:40Z

miprometheus/problems/question_context_to_class/babiqa_dataset_single_question.py

+
+    babi_tasks = list(range(1, 21))
+
+    params = {'directory': '/', 'tasks': babi_tasks,'data_type': 'train', 'batch_size': 10,'embedding_type' :'glove.6B.100d', 'ten_thousand_examples': True, 'one_hot_embedding': True, 'truncation_length':50 }


Generally we are standardizing parameter names (see #73 ).
In case of data the name of variable should be "data_dir" with default value set to "~/data/babi/"

Please change that according to convention used e.g. in MNIST in init:

mi-prometheus/miprometheus/problems/image_to_class/mnist.py

Line 91 in 68ad04d

self.params.add_default_params({'data_folder': '~/data/mnist',

and then when "simulating" config file:

mi-prometheus/miprometheus/problems/image_to_class/mnist.py

Line 205 in 68ad04d

params.add_config_params({'data_folder': '~/data/mnist',

tkornuta-ibm · 2018-11-16T02:35:16Z

miprometheus/problems/seq_to_seq/seq_to_seq_problem.py

@@ -75,6 +77,86 @@ def evaluate_loss(self, data_dict, logits):
        return loss


+    def to_dictionary_indexes(self, dictionary, sentence):


All those methods look for me as related to text. SeqToSeq problems abstract from text.

I guess it is good to move them to a class higher in the hierarchy (from bAbI), as probably they will be reused by the other multi-question version.

I propose to introduce a new class, inheriting from SeqToSeq, that will contain those methods, and maybe other methods shared with the second version of bAbI...

Do you think those methods can be useful in e.g. machine translation?

These should be moved to TextToText or another class related to NLP

yes ok let's go for TextToText

tkornuta-ibm · 2018-11-16T02:38:31Z

miprometheus/problems/question_context_to_class/babiqa_dataset_single_question.py

+
+
+
+class BABI(SeqToSeqProblem):


The name of dataset is bAbI (AI with capital letters), and as this is a version with single question in a story (if I understand it correctly), I suggest renaming it to something that will reflect that fact, e.g.
bAbIQASingleQuestion

tkornuta-ibm · 2018-11-16T02:39:31Z

miprometheus/problems/question_context_to_class/babiqa_dataset_single_question.py

+
+        self.data_definitions = {'sequences': {'size': [-1, -1, self.memory_size], 'type': [torch.Tensor]},
+                                 'targets': {'size': [-1], 'type': [torch.Tensor]},
+                                 'current_question': {'size': [-1, 1], 'type': [list, str]},


What is current_question, and why it is singular, while other names are plural?

They are all the string questionS so I corrected it to plural

tkornuta-ibm · 2018-11-16T02:43:45Z

miprometheus/problems/question_context_to_class/babiqa_dataset_single_question.py

+                                 'targets': {'size': [-1], 'type': [torch.Tensor]},
+                                 'current_question': {'size': [-1, 1], 'type': [list, str]},
+                                 'masks': {'size': [-1], 'type': [torch.Tensor]},
+                                 }


Aside of that, we got twenty tasks in bAbI. I guess it would be good to store which task qiven sample belongs to.

Similarly as in the CLEVR case, we will need that to analyze how good we are e.g. in stories with two supporting facts. Please notice that people when reporting results on bAbI are providing tables with separate accuracies on all 20 tasks and then one accuracy when trained jointly on all tasks...

tkornuta-ibm · 2018-11-16T02:50:24Z

miprometheus/problems/question_context_to_class/babiqa_dataset_single_question.py

+
+        self.embedding_type = params['embedding_type']
+
+        self.embedding_size = 38


The size of embedding is hardcoded, whereas, if I understand it correctly, you can use different ones by setting 'embedding type'. Is that ok?

yes , I put it in the yaml as parameter but the parser couldn't find it. Will try again.

tkornuta-ibm · 2018-11-16T02:50:49Z

miprometheus/problems/question_context_to_class/babiqa_dataset_single_question.py

+
+        self.batch_size = params['batch_size']
+
+        self.memory_size = params['truncation_length']


What is that? What is truncated?

I think this was padding size for Ryan. I have removed it

tkornuta-ibm · 2018-11-16T02:53:06Z

miprometheus/problems/question_context_to_class/babiqa_dataset_single_question.py

+
+        self.default_values = {'input_item_size': self.embedding_size , 'output_item_size':self.embedding_size}
+
+        self.data_definitions = {'sequences': {'size': [-1, -1, self.memory_size], 'type': [torch.Tensor]},


We discussed that and I do not really understand why you plugged self.memory_size as third dimension...

BATCH_SIZE x SEQ_LEN X ITEM_SIZE

ITEM_SIZE is in this case size (number of bits) returned by the used embedding of a single word. Right?
If not, what is really stored in "sequences". Is this the whole story?

Agreed, it would be best to separate the story (or context) from the actual question

but then simple LSTM model won't work...

So probably there should be a switch, like "separate_question", that could be parametrized from configuration file.

If true:
sequences = story
question = question
else:
sequences = story + question

Right now sequences = story + question

tkornuta-ibm · 2018-11-16T03:00:30Z

miprometheus/problems/question_context_to_class/babiqa_dataset_single_question.py

+
+
+    def build_dictionaries_one_hot(self):
+


Description is inconsistent with the name of the function. Is it really creating word embeddings OR is it building dictionaries from all words present in that stories/questions??

Why do we need two methods for it (with one hot and without)?

We could keep only one. There is a boolean variable self.embedding_type that decides if you want One hot or embedding or not. I though this is a nice feature so I kept it .

tkornuta-ibm · 2018-11-16T03:11:49Z

Finally, can we agree that we will use the right name It should be

bAbI

not Babi, BABI etc.

And it is for a reason

https://research.fb.com/downloads/babi/

bAbI from "baby AI"

And besides, it is only bAbI QA, not the whole bAbI, which contains more tasks...

vmarois

I understand bAbI is a hard problem to implement because of the padding etc.
And this is exactly why this class needs to be clearer, so it should contain documentation, comments and variable names should reflect their use
Have a look at CLEVR and COG for the constraints to respect on a Problem class.
Perhaps we should all agree together what should be the content of a sample in bAbI...

vmarois · 2018-11-16T15:21:03Z

configs/text2text/babi.yaml

+        batch_size: &b 1
+        data_type: train
+        embedding_type: glove.6B.100d
+        embedding_size: 50


glove.6B.100d already uses an embedding dimension of site 100, so I do not understand embedding_size: 50

vmarois · 2018-11-16T15:21:37Z

configs/text2text/babi.yaml

+        embedding_size: 50
+        use_mask : false
+        joint_all: true
+        one_hot_embedding: true


How is this related to embedding_type: glove.6B.100d& embedding_size: 50?

So I just checked. You are right, embedding_type: glove.6B.100d are size 100 , when it is one hot , it size of the dictionary, so I will store size of the dictionary instead

vmarois · 2018-11-16T15:23:03Z

configs/text2text/babi.yaml

+        tasks: [1, 2, 3]
+        ten_thousand_examples: true
+        truncation_length: 50
+        directory : ./


As @tkornut indicated, we should change that to data_folder: "~/data/babi/"

vmarois · 2018-11-16T15:23:48Z

configs/text2text/babi.yaml

+          data_type: valid
+          embedding_type: glove.6B.100d
+          joint_all: true
+          one_hot_embedding: true


Same remark here as above

So I just checked. You are right, embedding_type: glove.6B.100d are size 100 , when it is one hot , it size of the dictionary, so I will store size of the dictionary instead

vmarois · 2018-11-16T15:37:00Z

configs/text2text/babi.yaml

+          one_hot_embedding: true
+          tasks: [1, 2, 3]
+          ten_thousand_examples: true
+          truncation_length : 50


Same question as @tkornut, is this related to the maximum size of a question? Or is it related to padding?

vmarois · 2018-11-16T16:00:47Z

miprometheus/problems/question_context_to_class/babiqa_dataset_single_question.py

+        data = data + self.load_data(tasks=tasks, tenK=self.tenK, add_punctuation=True, data_type='valid',
+                                     outmod="embedding")
+        data = data + self.load_data(tasks=tasks, tenK=self.tenK, add_punctuation=True, data_type='test',
+                                     outmod="embedding")


So data contains all the train, val and tests samples? If so, what is the point of data_type: train in the config file?

It is used somewhere else to load the right data. It is not useful to build the dictionaries though, you are right

vmarois · 2018-11-16T16:02:16Z

miprometheus/problems/question_context_to_class/babiqa_dataset_single_question.py

+                    tf.write(f.read())
+        return self.parse(file_data, add_punctuation)
+
+    def download(self, root, check=None):


Why 2 download methods?

vmarois · 2018-11-16T16:03:11Z

miprometheus/problems/question_context_to_class/babiqa_dataset_single_question.py

+                            shutil.copyfileobj(gz, uncompressed)
+
+        #Return path to extracted dataset
+        return os.path.join(path, self.dirname)


This method looks quite generic, which is nice, so I would move it up in the classes hierarchy

vmarois · 2018-11-16T16:05:01Z

miprometheus/problems/seq_to_seq/seq_to_seq_problem.py


 from miprometheus.problems.problem import Problem
 import torch
+from miprometheus.utils.app_state import AppState


No need because Problem already has

mi-prometheus/miprometheus/problems/problem.py

Line 136 in 68ad04d

self.app_state = AppState()

ok but some classes like TextToTextProblem do not see it somehow

vmarois · 2018-11-16T16:05:52Z

miprometheus/problems/seq_to_seq/seq_to_seq_problem.py

@@ -75,6 +77,86 @@ def evaluate_loss(self, data_dict, logits):
        return loss


+    def to_dictionary_indexes(self, dictionary, sentence):


These should be moved to TextToText or another class related to NLP

tkornuta-ibm · 2018-11-16T20:53:30Z

miprometheus/problems/question_context_to_class/babiqa_dataset_single_question.py

+
+        """ build embeddings from the chosen database / Example: glove.6B.100d """
+
+        self.language.build_pretrained_vocab(text, vectors=self.embedding_type, tokenize=self.tokenize)


regarding the problem with the size of embeddings...

I think that language (btw. I think this is a bad name, means exactly nothing, if I understand it correctly it is... TextEncoder?) should set and return its value (i.e. size of embeddings)...

tkornuta-ibm · 2018-11-20T02:37:57Z

This pull request introduces 4 alerts when merging 0c052d4 into 3d4f868 - view on LGTM.com

new alerts:

1 for Non-callable called
1 for Unused local variable
1 for Unused import
1 for Missing call to __init__ during object initialization

Comment posted by LGTM.com

tkornuta-ibm and others added 4 commits November 12, 2018 17:25

Merge pull request #46 from IBM/develop

2be45de

Release 0.3

Merge pull request #89 from IBM/develop

f0be173

Conf and setup - updated version to 0.3.0

babi refactored first version

8062b34

babi refactored first version

ddb720b

vincentalbouy requested review from tkornuta-ibm and vmarois November 16, 2018 00:47

vincentalbouy assigned vmarois Nov 16, 2018

tkornuta-ibm assigned tkornuta-ibm and unassigned vmarois Nov 16, 2018

tkornuta-ibm suggested changes Nov 16, 2018

View reviewed changes

tkornuta-ibm reviewed Nov 16, 2018

View reviewed changes

tkornuta-ibm changed the title ~~Babi refactored~~ bAbI refactored Nov 16, 2018

vmarois suggested changes Nov 16, 2018

View reviewed changes

tkornuta-ibm reviewed Nov 16, 2018

View reviewed changes

clean up babi

0c052d4


		babi_tasks = list(range(1, 21))

		params = {'directory': '/', 'tasks': babi_tasks,'data_type': 'train', 'batch_size': 10,'embedding_type' :'glove.6B.100d', 'ten_thousand_examples': True, 'one_hot_embedding': True, 'truncation_length':50 }

		@@ -75,6 +77,86 @@ def evaluate_loss(self, data_dict, logits):
		return loss


		def to_dictionary_indexes(self, dictionary, sentence):


		self.embedding_type = params['embedding_type']

		self.embedding_size = 38


		self.batch_size = params['batch_size']

		self.memory_size = params['truncation_length']


		self.default_values = {'input_item_size': self.embedding_size , 'output_item_size':self.embedding_size}

		self.data_definitions = {'sequences': {'size': [-1, -1, self.memory_size], 'type': [torch.Tensor]},


		""" build embeddings from the chosen database / Example: glove.6B.100d """

		self.language.build_pretrained_vocab(text, vectors=self.embedding_type, tokenize=self.tokenize)

bAbI refactored #97

Are you sure you want to change the base?

bAbI refactored #97

Conversation

vincentalbouy commented Nov 16, 2018

tkornuta-ibm commented Nov 16, 2018

tkornuta-ibm left a comment • edited by vmarois

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincentalbouy Nov 20, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincentalbouy Nov 20, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkornuta-ibm Nov 16, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkornuta-ibm commented Nov 16, 2018 • edited

bAbI

vmarois left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincentalbouy Nov 20, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkornuta-ibm commented Nov 20, 2018

tkornuta-ibm left a comment •

edited by vmarois

vincentalbouy Nov 20, 2018 •

edited

vincentalbouy Nov 20, 2018 •

edited

tkornuta-ibm Nov 16, 2018 •

edited

tkornuta-ibm commented Nov 16, 2018 •

edited

vincentalbouy Nov 20, 2018 •

edited