Skip to content
This repository was archived by the owner on Apr 12, 2026. It is now read-only.

chore(deps): bump unstructured from 0.13.7 to 0.14.0#121

Merged
Daethyra merged 1 commit intostreamlitfrom
dependabot/pip/unstructured-0.14.0
May 23, 2024
Merged

chore(deps): bump unstructured from 0.13.7 to 0.14.0#121
Daethyra merged 1 commit intostreamlitfrom
dependabot/pip/unstructured-0.14.0

Conversation

@dependabot
Copy link
Copy Markdown
Contributor

@dependabot dependabot bot commented on behalf of github May 20, 2024

Bumps unstructured from 0.13.7 to 0.14.0.

Release notes

Sourced from unstructured's releases.

0.14.0

BREAKING CHANGES

  • Turn table extraction for PDFs and images off by default. Reverting the default behavior for table extraction to "off" for PDFs and images. A number of users didn't realize we made the change and were impacted by slower processing times due to the extra model call for table extraction.

Enhancements

  • Skip unnecessary element sorting in partition_pdf(). Skip element sorting when determining whether embedded text can be extracted.
  • Faster evaluation Support for concurrent processing of documents during evaluation
  • Add strategy parameter to partition_docx(). Behavior of future enhancements may be sensitive the partitioning strategy. Add this parameter so partition_docx() is aware of the requested strategy.
  • Add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR configuration parameteres to control temporary storage.

Features

  • Add form extraction basics (document elements and placeholder code in partition). This is to lay the ground work for the future. Form extraction models are not currently available in the library. An attempt to use this functionality will end in a NotImplementedError.

Fixes

  • Add missing starting_page_num param to partition_image
  • Make the filename and file params for partition_image and partition_pdf match the other partitioners
  • Fix include_slide_notes and include_page_breaks params in partition_ppt
  • Re-apply: skip accuracy calculation feature Overwritten by mistake
  • Fix type hint for paragraph_grouper param paragraph_grouper can be set to False, but the type hint did not not reflect this previously.
  • Remove links param from partition_pdf links is extracted during partitioning and is not needed as a paramter in partition_pdf.
  • Improve CSV delimeter detection. partition_csv() would raise on CSV files with very long lines.
  • Fix disk-space leak in partition_doc(). Remove temporary file created but not removed when file argument is passed to partition_doc().
  • Fix possible SyntaxError or SyntaxWarning on regex patterns. Change regex patterns to raw strings to avoid these warnings/errors in Python 3.11+.
  • Fix disk-space leak in partition_odt(). Remove temporary file created but not removed when file argument is passed to partition_odt().
  • AstraDB: option to prevent indexing metadata
Changelog

Sourced from unstructured's changelog.

0.14.0

BREAKING CHANGES

  • Turn table extraction for PDFs and images off by default. Reverting the default behavior for table extraction to "off" for PDFs and images. A number of users didn't realize we made the change and were impacted by slower processing times due to the extra model call for table extraction.

Enhancements

  • Skip unnecessary element sorting in partition_pdf(). Skip element sorting when determining whether embedded text can be extracted.
  • Faster evaluation Support for concurrent processing of documents during evaluation
  • Add strategy parameter to partition_docx(). Behavior of future enhancements may be sensitive the partitioning strategy. Add this parameter so partition_docx() is aware of the requested strategy.
  • Add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR configuration parameteres to control temporary storage.

Features

  • Add form extraction basics (document elements and placeholder code in partition). This is to lay the ground work for the future. Form extraction models are not currently available in the library. An attempt to use this functionality will end in a NotImplementedError.

Fixes

  • Add missing starting_page_num param to partition_image
  • Make the filename and file params for partition_image and partition_pdf match the other partitioners
  • Fix include_slide_notes and include_page_breaks params in partition_ppt
  • Re-apply: skip accuracy calculation feature Overwritten by mistake
  • Fix type hint for paragraph_grouper param paragraph_grouper can be set to False, but the type hint did not not reflect this previously.
  • Remove links param from partition_pdf links is extracted during partitioning and is not needed as a paramter in partition_pdf.
  • Improve CSV delimeter detection. partition_csv() would raise on CSV files with very long lines.
  • Fix disk-space leak in partition_doc(). Remove temporary file created but not removed when file argument is passed to partition_doc().
  • Fix possible SyntaxError or SyntaxWarning on regex patterns. Change regex patterns to raw strings to avoid these warnings/errors in Python 3.11+.
  • Fix disk-space leak in partition_odt(). Remove temporary file created but not removed when file argument is passed to partition_odt().
  • AstraDB: option to prevent indexing metadata
Commits
  • 76831f1 refactor: partition_pdf() pass kwargs through fast strategy pipeline (#...
  • 9cd0e70 fix: reenable arm64 builds for docker (#3045)
  • 1c8b2b2 feat: add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR config parametere...
  • ec987dc BREAKING CHANGE: revert table extraction off by default for PDFs and images (...
  • df8d39a fix: allow AstraDB to prevent indexing on metadata columns with long text (#3...
  • 934f1a4 fix: disable arm build for chainguard (#3039)
  • f320889 feat(docx): add strategy parameter to DOC and ODT (#3042)
  • 8644a3b fix(odt): fix disk-space leak in partition_odt() (#3037)
  • 0de9215 fix: use raw strings for regex patterns (#3029)
  • e6ada05 Feat: form parsing placeholders (#3034)
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot dependabot bot added the dependencies Pull requests that update a dependency file label May 20, 2024
@dependabot dependabot bot force-pushed the dependabot/pip/unstructured-0.14.0 branch from 198fb4d to ab1a7d3 Compare May 22, 2024 04:08
Bumps [unstructured](https://github.com/Unstructured-IO/unstructured) from 0.13.7 to 0.14.0.
- [Release notes](https://github.com/Unstructured-IO/unstructured/releases)
- [Changelog](https://github.com/Unstructured-IO/unstructured/blob/main/CHANGELOG.md)
- [Commits](Unstructured-IO/unstructured@0.13.7...0.14.0)

---
updated-dependencies:
- dependency-name: unstructured
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot bot force-pushed the dependabot/pip/unstructured-0.14.0 branch from ab1a7d3 to ad31468 Compare May 22, 2024 04:13
@Daethyra Daethyra merged commit 8399a11 into streamlit May 23, 2024
@dependabot dependabot bot deleted the dependabot/pip/unstructured-0.14.0 branch May 23, 2024 05:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant