Skip to content

Conversation

@danilojsl
Copy link
Contributor

@danilojsl danilojsl commented Sep 30, 2025

Please merge before #14668

Description

This change enables readers to accept a column of type String as input (in addition to existing types), so you can provide raw text directly rather than only files or external sources.

Motivation and Context

Before this patch, Spark NLP readers only supported inputs via file paths. That means if you already had a DataFrame with text content (say from another pipeline or a preliminary load), you had to write it to disk just to let the reader ingest it. This adds friction and overhead, especially in streaming or in-memory pipelines.

With this change, you can:

  • Feed raw text stored in a DataFrame column directly into Spark NLP readers — zero I/O overhead when not needed.
  • Simplify workflows and pipelines (no need for temporary file staging just to “read” back data).
  • Improve performance and resource usage in scenarios where input is already available as strings (e.g. generated, preprocessed, or coming from another system).
  • Make the reader APIs more flexible and general-purpose.

This enhancement broadens the usability of the readers and removes a common impedance mismatch in real-world ETL / NLP workflows.

How Has This Been Tested?

Screenshots (if appropriate):

  • Local Tests
  • Google Colab

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@danilojsl danilojsl self-assigned this Sep 30, 2025
@danilojsl danilojsl added the new-feature Introducing a new feature label Sep 30, 2025
@danilojsl danilojsl marked this pull request as ready for review October 7, 2025 20:47
@danilojsl danilojsl requested a review from DevinTDHa October 7, 2025 20:47
@DevinTDHa DevinTDHa changed the base branch from master to release/615-release-candidate October 8, 2025 11:42
@DevinTDHa DevinTDHa merged commit fec5e08 into release/615-release-candidate Oct 8, 2025
4 checks passed
@DevinTDHa DevinTDHa mentioned this pull request Oct 8, 2025
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-feature Introducing a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants