New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reorganize code base directories into pipeline phases #59
Conversation
Removes the render_tree function and its dependencies.
This was used in Snorkel for their SparkLabelAnnotator.
.travis.yml
Outdated
@@ -42,7 +42,7 @@ before_script: | |||
- cd tests/e2e/ | |||
- "./download_data.sh" | |||
- cd ../.. | |||
- flake8 fonduer --count --max-line-length=127 --statistics --show-source --ignore=E731,W503,E741,E123 | |||
- flake8 fonduer --count --max-line-length=127 --statistics --show-source --ignore=E731,W503,E741,E123,E203 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do those parameters mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exceptions that we will allow. Full list here [1]. If we format with Black like I would like to, we can probably drop E123.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, go ahead!
@@ -1,7 +1,19 @@ | |||
Version 0.2.0 (coming soon...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it 0.1.9?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to go with 0.2.0 because it will not be backwards compatible. We didn't increment the minor version last backward compatible patch, but this one will be much more major, it would be better to go to 0.2.0 to signify the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok!
Makefile
Outdated
@@ -7,7 +7,7 @@ test: dev check | |||
pytest tests -v -rsXx | |||
|
|||
check: | |||
flake8 fonduer --count --max-line-length=127 --statistics --ignore=E731,W503,E741,E123 | |||
flake8 fonduer --count --max-line-length=127 --statistics --ignore=E731,W503,E741,E123,E203 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can add some comments in the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this file is already very short. What do you think requires commenting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll comment the specific errors we are allowing.
from fonduer.candidates import CandidateExtractor, OmniFigures, OmniNgrams | ||
from fonduer.candidates.matchers import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change the order of these imports to follow the pipeline and leave some comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just following convention and are alphabetical. I wouldn't change them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this whole file will look very very different after the next PR when we pull imports into submodules instead of having everything right at the root.
PersonMatcher, RegexMatchEach, RegexMatchSpan, | ||
Union) | ||
from fonduer.models import Document, Figure, Meta, Phrase, candidate_subclass | ||
from fonduer.meta import Meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put the Meta
the first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
@@ -0,0 +1,3 @@ | |||
from fonduer.candidates.models.candidate import Candidate, Marginal, candidate_subclass | |||
|
|||
__all__ = ["Candidate", "Marginal", "candidate_subclass"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split Marginal
to the learning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll check it out. Any splitting of these model classes will happen in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great!
) | ||
|
||
|
||
def init_models(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this function do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this was from when I was going to split the initialization. This should now be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, delete it then! :)
get_parent_tag, | ||
get_prev_sibling_tags, | ||
get_tag, | ||
lowest_common_ancestor_depth, |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
fonduer/parser/visual.py
Outdated
return matches | ||
|
||
N = len(self.html_word_list) | ||
M = len(self.pdf_word_list) | ||
assert (N > 0 and M > 0) | ||
assert N > 0 and M > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging and assertions serve different purposes. Checking and throwing an exception would be a better fix if you don't like the assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's use logging!
fonduer/parser/visual.py
Outdated
@@ -316,7 +340,7 @@ def display_links(self, max_rows=100): | |||
pdf.append(b[1]) | |||
j.append(k) | |||
break | |||
assert (len(pdf) == len(html)) | |||
assert len(pdf) == len(html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
Allows us to remove E123, and will reduce future line diffs.
LGTM. |
This PR starts the process of breaking up the codebase into the different phases of the pipeline.
See #56.