Reorganize code base directories into pipeline phases #59

lukehsiao · 2018-06-29T18:53:45Z

This PR starts the process of breaking up the codebase into the different phases of the pipeline.

See #56.

Removes the render_tree function and its dependencies.

This was used in Snorkel for their SparkLabelAnnotator.

senwu · 2018-06-29T20:37:58Z

.travis.yml

@@ -42,7 +42,7 @@ before_script:
 - cd tests/e2e/
 - "./download_data.sh"
 - cd ../..
- flake8 fonduer --count --max-line-length=127 --statistics --show-source --ignore=E731,W503,E741,E123
+- flake8 fonduer --count --max-line-length=127 --statistics --show-source --ignore=E731,W503,E741,E123,E203


What do those parameters mean?

Exceptions that we will allow. Full list here [1]. If we format with Black like I would like to, we can probably drop E123.

ok, go ahead!

senwu · 2018-06-29T20:38:51Z

CHANGELOG.rst

@@ -1,7 +1,19 @@
+Version 0.2.0 (coming soon...)


Let's make it 0.1.9?

Decided to go with 0.2.0 because it will not be backwards compatible. We didn't increment the minor version last backward compatible patch, but this one will be much more major, it would be better to go to 0.2.0 to signify the changes.

senwu · 2018-06-29T20:40:36Z

Makefile

@@ -7,7 +7,7 @@ test: dev check
 	pytest tests -v -rsXx	

 check:
-	flake8 fonduer --count --max-line-length=127 --statistics --ignore=E731,W503,E741,E123
+	flake8 fonduer --count --max-line-length=127 --statistics --ignore=E731,W503,E741,E123,E203


Can add some comments in the file?

Hmm, this file is already very short. What do you think requires commenting?

I'll comment the specific errors we are allowing.

senwu · 2018-06-29T20:43:34Z

fonduer/__init__.py

 from fonduer.candidates import CandidateExtractor, OmniFigures, OmniNgrams
+from fonduer.candidates.matchers import (


Can you change the order of these imports to follow the pipeline and leave some comments?

These are just following convention and are alphabetical. I wouldn't change them.

Also this whole file will look very very different after the next PR when we pull imports into submodules instead of having everything right at the root.

senwu · 2018-06-29T20:44:59Z

fonduer/__init__.py

-                              PersonMatcher, RegexMatchEach, RegexMatchSpan,
-                              Union)
-from fonduer.models import Document, Figure, Meta, Phrase, candidate_subclass
+from fonduer.meta import Meta


Maybe put the Meta the first?

senwu · 2018-06-29T20:52:25Z

fonduer/candidates/models/__init__.py

@@ -0,0 +1,3 @@
+from fonduer.candidates.models.candidate import Candidate, Marginal, candidate_subclass
+
+__all__ = ["Candidate", "Marginal", "candidate_subclass"]


Split Marginal to the learning?

Sure, I'll check it out. Any splitting of these model classes will happen in the next PR.

senwu · 2018-06-29T21:00:36Z

fonduer/candidates/models/candidate.py

+        )
+
+
+def init_models():


What does this function do?

Good catch, this was from when I was going to split the initialization. This should now be deleted.

ok, delete it then! :)

fonduer/features/structural_features.py

+    get_parent_tag,
+    get_prev_sibling_tags,
+    get_tag,
+    lowest_common_ancestor_depth,


senwu · 2018-06-29T21:26:52Z

fonduer/parser/visual.py

            return matches

        N = len(self.html_word_list)
        M = len(self.pdf_word_list)
-        assert (N > 0 and M > 0)
+        assert N > 0 and M > 0


use logging

Logging and assertions serve different purposes. Checking and throwing an exception would be a better fix if you don't like the assertion.

Yes, let's use logging!

senwu · 2018-06-29T21:27:49Z

fonduer/parser/visual.py

@@ -316,7 +340,7 @@ def display_links(self, max_rows=100):
                    pdf.append(b[1])
                    j.append(k)
                    break
-        assert (len(pdf) == len(html))
+        assert len(pdf) == len(html)


Allows us to remove E123, and will reduce future line diffs.

senwu · 2018-06-30T18:43:07Z

LGTM.

lukehsiao added 19 commits June 22, 2018 14:37

Add changelog placeholder

522ddfb

Remove futures package

5db4617

Update version number

70c6116

Update CHANGELOG

96dcb9f

Move annotation files into supervision submodule

eba02e1

Move lf_helpers into supervision submodule

883fdd4

Move candidates and matchers into submodule

a92a6b7

Move tree_struct into features and simplify

97e4a82

Removes the render_tree function and its dependencies.

Move visual parsing to parser, and visualizer to utils

0910b8f

Move utils to a submodule

457cb92

Restore normal test

71b5c4f

Move UDF into utils

8f001e0

Move feature settings into features submodule

b9e85de

Remove relative imports

9364990

Remove unused views

11a4135

This was used in Snorkel for their SparkLabelAnnotator.

Move candidate model to the candidate phase

baef323

Move context into parser models

958ecf1

Move annotation models to supervision submodule

fb1f754

Move meta to fonduer root

89d8388

lukehsiao added the clean-up Cleaning up the code or refactoring label Jun 29, 2018

lukehsiao added this to the v0.2.0 milestone Jun 29, 2018

lukehsiao self-assigned this Jun 29, 2018

lukehsiao requested a review from senwu June 29, 2018 18:53

lukehsiao changed the title ~~Reorganized code base directories into pipeline phases~~ Reorganize code base directories into pipeline phases Jun 29, 2018

lukehsiao added 2 commits June 29, 2018 11:55

Update changelog

b97e2ea

Update changelog

870adec

senwu approved these changes Jun 29, 2018

View reviewed changes

lukehsiao added 3 commits June 29, 2018 22:25

Remove unused init_models

c68ddd8

Run all fonduer files through black

5355c0d

Allows us to remove E123, and will reduce future line diffs.

Add comments to makefile

af56b39

Update travis tests

13333e9

lukehsiao force-pushed the 0.1.9 branch from aa4b26d to 62bd182 Compare June 30, 2018 17:33

Add logging for assert statements

fd65c6a

lukehsiao force-pushed the 0.1.9 branch from 62bd182 to fd65c6a Compare June 30, 2018 17:33

lukehsiao merged commit 22c11fb into master Jun 30, 2018

lukehsiao deleted the 0.1.9 branch June 30, 2018 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorganize code base directories into pipeline phases #59

Reorganize code base directories into pipeline phases #59

lukehsiao commented Jun 29, 2018

senwu Jun 29, 2018

lukehsiao Jun 30, 2018

senwu Jun 30, 2018

senwu Jun 29, 2018

lukehsiao Jun 30, 2018

senwu Jun 30, 2018

senwu Jun 29, 2018

lukehsiao Jun 30, 2018

lukehsiao Jun 30, 2018

senwu Jun 29, 2018

lukehsiao Jun 30, 2018

lukehsiao Jun 30, 2018

senwu Jun 29, 2018

lukehsiao Jun 30, 2018

senwu Jun 29, 2018

lukehsiao Jun 30, 2018

senwu Jun 30, 2018

senwu Jun 29, 2018

lukehsiao Jun 30, 2018

senwu Jun 30, 2018

This comment was marked as resolved.

This comment was marked as resolved.

senwu Jun 29, 2018

lukehsiao Jun 30, 2018

senwu Jun 30, 2018

senwu Jun 29, 2018

lukehsiao Jun 30, 2018

senwu commented Jun 30, 2018

		from fonduer.candidates import CandidateExtractor, OmniFigures, OmniNgrams
		from fonduer.candidates.matchers import (

		@@ -0,0 +1,3 @@
		from fonduer.candidates.models.candidate import Candidate, Marginal, candidate_subclass

		__all__ = ["Candidate", "Marginal", "candidate_subclass"]

Reorganize code base directories into pipeline phases #59

Reorganize code base directories into pipeline phases #59

Conversation

lukehsiao commented Jun 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

senwu commented Jun 30, 2018