![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/reader/SparkNLP_TXT_Reader_Demo.ipynb)

# Introducing TXT reader in SparkNLP
This notebook showcases the newly added  `sparknlp.read().txt()` method in Spark NLP that parses txt file content from both local files and real-time URLs into a Spark DataFrame.

## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for reading html files was introduced in Spark NLP 6.0.0. Please make sure you have upgraded to the latest Spark NLP release.

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [1]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

For local files example we will download a TXT file from Spark NLP Github repo:

In [11]:
!mkdir txt-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/txt/simple-text.txt -P txt-files

mkdir: cannot create directory ‘txt-files’: File exists
--2025-03-07 00:33:21--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1113-Adding-support-to-enhance-read-TXT-files/src/test/resources/reader/txt/simple-text.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 300 [text/plain]
Saving to: ‘txt-files/simple-text.txt’


2025-03-07 00:33:21 (4.67 MB/s) - ‘txt-files/simple-text.txt’ saved [300/300]



## Parsing text from Local Files
Use the `txt()` method to parse text file content from local directories.

In [12]:
import sparknlp

txt_df = sparknlp.read().txt("./txt-files")
txt_df.show()

+--------------------+--------------------+
|                path|                 txt|
+--------------------+--------------------+
|file:/content/txt...|[{Title, BIG DATA...|
+--------------------+--------------------+



In [13]:
txt_df.select("txt").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|txt                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+-------------------------------------------------------------------------------------------------------------------------------------------

You can also use DFS file systems like:
- Databricks: `dbfs://`
- HDFS: `hdfs://`
- Microsoft Fabric OneLake: `abfss://`

### Configuration Parameters

- `titleLengthSize`: You can customize the font size used to identify titles that should be treated as titles. By default, the font size is set to 50. However, if your text files require a different configuration, you can adjust this parameter accordingly. The example below demonstrates how to modify and work with this setting:

In [19]:
params = {"titleLengthSize": "5"}
txt_df = sparknlp.read(params).txt("./txt-files")
txt_df.show(truncate=False)

+---------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|path                                   |txt                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+---------------------------

- `storeContent`: By default, this is set to `false`. When enabled, the output will include the raw content of the file.

In [18]:
params = {"storeContent": "true"}
txt_df = sparknlp.read(params).txt("./txt-files")
txt_df.show()

+--------------------+--------------------+--------------------+
|                path|                 txt|             content|
+--------------------+--------------------+--------------------+
|file:/content/txt...|[{Title, BIG DATA...|BIG DATA ANALYTIC...|
+--------------------+--------------------+--------------------+

