Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix/fix missing extensions in file detection #3926

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rbiseck3
Copy link
Contributor

Description

Currently the logic breaks if the filename of the file does not include the extension and we explicitly index expecting it to exist, causing an error. This checks for that to exist first.

@rbiseck3 rbiseck3 requested a review from scanny February 18, 2025 17:34
Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement :)

Let's add a test for each case:

The file-like object (open file) with name but not extension can go here as a third case:
, basically third test-case for this one should do the trick:
https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/file_utils/test_filetype.py#L605-L622

Maybe:

# -- case 3: file-like object has name with no extension --
"q3_invoices",

That leaves the file-path case here:
https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/file_utils/test_filetype.py#L576-L578

Which can be parameterized for the case where a file-name is provided.

Then we also need a case for metadata_file_path here:
https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/file_utils/test_filetype.py#L576-L578

That's same as earlier one I think, like:

# -- case 3: metadata_file_name value has no extension --
"q3_invoices",

@rbiseck3
Copy link
Contributor Author

@scanny unit tests updated

@rbiseck3 rbiseck3 requested a review from scanny February 18, 2025 19:31
Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not seeing the complete case coverage we want. Here are the three cases I believe:

  1. file_path is provided (so file is loaded from filesystem), but path has no extension.
  2. file is provided and it has a name (usually because it was loaded from a temp file) and that path doesn't have an extension.
  3. file is provided and doesn't have a name (e.g. BytesIO object) so we fall back to metadata_filename, metadata_filename is present so we use it but it doesn't have an extension.

Case 1: I don't see case 1 covered, that would be by parameterizing this test I believe:
https://github.com/Unstructured-IO/unstructured/pull/3926/files#diff-d88874605ea6403a7c07fd5ebb3bf5d795888e7fb2ecd90f15e60ef2300901a5R576-R578

Something like this:

@pytest.mark.parametrize(
    ("file_name", "expected_value"),
    [
        # -- case 1: file-path has an extension
        ("simple.docx", ".docx"),
        # -- case 2: file-path has no extension
        ("q3_invoices", ""),
    ],
)
def it_derives_the_filename_extension_from_the_file_path_when_one_is_provided(
    self, file_name: str, expected_value: str
):
    ctx = _FileTypeDetectionContext(file_path=example_doc_path(file_name))
    assert ctx.extension == expected_value

Case 2: I think that's covered by the case you added L612-L613.

Case 3: Looks like Case 3 is covered by the new test you added L626-L645 but you added other cases and the name is now misleading ("_nor_metadata" but you provide metadata_file_path.

For Case 3 I'd be inclined to change the test you added to remove the redundant cases and improve the name something like this:

def and_it_returns_empty_string_extension_when_no_file_name_and_metadata_has_no_extension(self):
    with open(example_doc_path("ideas-page.html"), "rb") as f:
        file = io.BytesIO(f.read())
        file.name = None

    assert _FileTypeDetectionContext(file=file, metadata_file_path="q3_invoices").extension == ""

Let me know if you need help understanding how pytest does its thing and how these are needed for the coverage we're looking for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants