Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize code base directories into pipeline phases #59

Merged
merged 26 commits into from Jun 30, 2018
Merged

Conversation

lukehsiao
Copy link
Contributor

This PR starts the process of breaking up the codebase into the different phases of the pipeline.

See #56.

@lukehsiao lukehsiao added the clean-up Cleaning up the code or refactoring label Jun 29, 2018
@lukehsiao lukehsiao added this to the v0.2.0 milestone Jun 29, 2018
@lukehsiao lukehsiao self-assigned this Jun 29, 2018
@lukehsiao lukehsiao requested a review from senwu June 29, 2018 18:53
@lukehsiao lukehsiao changed the title Reorganized code base directories into pipeline phases Reorganize code base directories into pipeline phases Jun 29, 2018
.travis.yml Outdated
@@ -42,7 +42,7 @@ before_script:
- cd tests/e2e/
- "./download_data.sh"
- cd ../..
- flake8 fonduer --count --max-line-length=127 --statistics --show-source --ignore=E731,W503,E741,E123
- flake8 fonduer --count --max-line-length=127 --statistics --show-source --ignore=E731,W503,E741,E123,E203
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do those parameters mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceptions that we will allow. Full list here [1]. If we format with Black like I would like to, we can probably drop E123.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, go ahead!

@@ -1,7 +1,19 @@
Version 0.2.0 (coming soon...)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it 0.1.9?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to go with 0.2.0 because it will not be backwards compatible. We didn't increment the minor version last backward compatible patch, but this one will be much more major, it would be better to go to 0.2.0 to signify the changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!

Makefile Outdated
@@ -7,7 +7,7 @@ test: dev check
pytest tests -v -rsXx

check:
flake8 fonduer --count --max-line-length=127 --statistics --ignore=E731,W503,E741,E123
flake8 fonduer --count --max-line-length=127 --statistics --ignore=E731,W503,E741,E123,E203
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can add some comments in the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this file is already very short. What do you think requires commenting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll comment the specific errors we are allowing.

from fonduer.candidates import CandidateExtractor, OmniFigures, OmniNgrams
from fonduer.candidates.matchers import (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the order of these imports to follow the pipeline and leave some comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just following convention and are alphabetical. I wouldn't change them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this whole file will look very very different after the next PR when we pull imports into submodules instead of having everything right at the root.

PersonMatcher, RegexMatchEach, RegexMatchSpan,
Union)
from fonduer.models import Document, Figure, Meta, Phrase, candidate_subclass
from fonduer.meta import Meta
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put the Meta the first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

@@ -0,0 +1,3 @@
from fonduer.candidates.models.candidate import Candidate, Marginal, candidate_subclass

__all__ = ["Candidate", "Marginal", "candidate_subclass"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split Marginal to the learning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll check it out. Any splitting of these model classes will happen in the next PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!

)


def init_models():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this function do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this was from when I was going to split the initialization. This should now be deleted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, delete it then! :)

get_parent_tag,
get_prev_sibling_tags,
get_tag,
lowest_common_ancestor_depth,

This comment was marked as resolved.

This comment was marked as resolved.

return matches

N = len(self.html_word_list)
M = len(self.pdf_word_list)
assert (N > 0 and M > 0)
assert N > 0 and M > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging and assertions serve different purposes. Checking and throwing an exception would be a better fix if you don't like the assertion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's use logging!

@@ -316,7 +340,7 @@ def display_links(self, max_rows=100):
pdf.append(b[1])
j.append(k)
break
assert (len(pdf) == len(html))
assert len(pdf) == len(html)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

@senwu
Copy link
Collaborator

senwu commented Jun 30, 2018

LGTM.

@lukehsiao lukehsiao merged commit 22c11fb into master Jun 30, 2018
@lukehsiao lukehsiao deleted the 0.1.9 branch June 30, 2018 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clean-up Cleaning up the code or refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants