[SPARKNLP-1291] Adding support fort input string column on readers #14665

danilojsl · 2025-09-30T15:38:08Z

Please merge before #14668

Description

This change enables readers to accept a column of type String as input (in addition to existing types), so you can provide raw text directly rather than only files or external sources.

Motivation and Context

Before this patch, Spark NLP readers only supported inputs via file paths. That means if you already had a DataFrame with text content (say from another pipeline or a preliminary load), you had to write it to disk just to let the reader ingest it. This adds friction and overhead, especially in streaming or in-memory pipelines.

With this change, you can:

Feed raw text stored in a DataFrame column directly into Spark NLP readers — zero I/O overhead when not needed.
Simplify workflows and pipelines (no need for temporary file staging just to “read” back data).
Improve performance and resource usage in scenarios where input is already available as strings (e.g. generated, preprocessed, or coming from another system).
Make the reader APIs more flexible and general-purpose.

This enhancement broadens the usability of the readers and removes a common impedance mismatch in real-world ETL / NLP workflows.

How Has This Been Tested?

Screenshots (if appropriate):

Local Tests
Google Colab

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

[SPARKNLP-1291] Adding support fort input string column on readers

21e651d

danilojsl self-assigned this Sep 30, 2025

danilojsl added the new-feature Introducing a new feature label Sep 30, 2025

[SPARKNLP-1291] Removing unnecessary tests

9bc3323

danilojsl marked this pull request as ready for review October 7, 2025 20:47

danilojsl requested a review from DevinTDHa October 7, 2025 20:47

danilojsl mentioned this pull request Oct 7, 2025

[SPARKNLP-1290] Introducing ReaderAssembler Annotator #14668

Merged

10 tasks

DevinTDHa changed the base branch from master to release/615-release-candidate October 8, 2025 11:42

DevinTDHa approved these changes Oct 8, 2025

View reviewed changes

DevinTDHa merged commit fec5e08 into release/615-release-candidate Oct 8, 2025
4 checks passed

DevinTDHa mentioned this pull request Oct 8, 2025

Spark NLP 6.1.5 Release #14669

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARKNLP-1291] Adding support fort input string column on readers #14665

[SPARKNLP-1291] Adding support fort input string column on readers #14665

Uh oh!

danilojsl commented Sep 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARKNLP-1291] Adding support fort input string column on readers #14665

[SPARKNLP-1291] Adding support fort input string column on readers #14665

Uh oh!

Conversation

danilojsl commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danilojsl commented Sep 30, 2025 •

edited

Loading