-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix/fix missing extensions in file detection #3926
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement :)
Let's add a test for each case:
The file-like object (open file) with name but not extension can go here as a third case:
, basically third test-case for this one should do the trick:
https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/file_utils/test_filetype.py#L605-L622
Maybe:
# -- case 3: file-like object has name with no extension --
"q3_invoices",
That leaves the file-path case here:
https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/file_utils/test_filetype.py#L576-L578
Which can be parameterized for the case where a file-name is provided.
Then we also need a case for metadata_file_path here:
https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/file_utils/test_filetype.py#L576-L578
That's same as earlier one I think, like:
# -- case 3: metadata_file_name value has no extension --
"q3_invoices",
@scanny unit tests updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not seeing the complete case coverage we want. Here are the three cases I believe:
file_path
is provided (so file is loaded from filesystem), but path has no extension.file
is provided and it has a name (usually because it was loaded from a temp file) and that path doesn't have an extension.file
is provided and doesn't have a name (e.g. BytesIO object) so we fall back tometadata_filename
,metadata_filename
is present so we use it but it doesn't have an extension.
Case 1: I don't see case 1 covered, that would be by parameterizing this test I believe:
https://github.com/Unstructured-IO/unstructured/pull/3926/files#diff-d88874605ea6403a7c07fd5ebb3bf5d795888e7fb2ecd90f15e60ef2300901a5R576-R578
Something like this:
@pytest.mark.parametrize(
("file_name", "expected_value"),
[
# -- case 1: file-path has an extension
("simple.docx", ".docx"),
# -- case 2: file-path has no extension
("q3_invoices", ""),
],
)
def it_derives_the_filename_extension_from_the_file_path_when_one_is_provided(
self, file_name: str, expected_value: str
):
ctx = _FileTypeDetectionContext(file_path=example_doc_path(file_name))
assert ctx.extension == expected_value
Case 2: I think that's covered by the case you added L612-L613.
Case 3: Looks like Case 3 is covered by the new test you added L626-L645 but you added other cases and the name is now misleading ("_nor_metadata" but you provide metadata_file_path
.
For Case 3 I'd be inclined to change the test you added to remove the redundant cases and improve the name something like this:
def and_it_returns_empty_string_extension_when_no_file_name_and_metadata_has_no_extension(self):
with open(example_doc_path("ideas-page.html"), "rb") as f:
file = io.BytesIO(f.read())
file.name = None
assert _FileTypeDetectionContext(file=file, metadata_file_path="q3_invoices").extension == ""
Let me know if you need help understanding how pytest
does its thing and how these are needed for the coverage we're looking for.
Description
Currently the logic breaks if the filename of the file does not include the extension and we explicitly index expecting it to exist, causing an error. This checks for that to exist first.