![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/util/Training_Helpers.ipynb)

In this notebook, we will explore the file system supported by `CoNLL`, `CoNLLU` and `POS` traning classes in Spark NLP.

In [3]:
!mkdir datasets

In [4]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conllu/en.test.conllu
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll/test_conll_docid.txt
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/anc-pos-corpus-small/test-training.txt

--2023-03-15 00:27:56--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conllu/en.test.conllu
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1864 (1.8K) [text/plain]
Saving to: ‘en.test.conllu’


2023-03-15 00:27:56 (17.0 MB/s) - ‘en.test.conllu’ saved [1864/1864]

--2023-03-15 00:27:56--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll/test_conll_docid.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 696 [text/plain]
Saving to: ‘test_conll_docid.txt’

In [5]:
!mv en.test.conllu ./datasets
!mv test_conll_docid.txt ./datasets
!mv test-training.txt ./datasets

Spark NLP support the file systems below:


* Local file system: `file://` or `/my/path/`

* Distributed file system: `hdfs://` or `dbfs://`

* Cloud buckets: `s3a://` or `s3://`



Starting at spark-nlp 4.4.1, you can also set an S3 URI. To configure this,  it is necessary to set up the Spark session with the appropriate settings for both Spark NLP and Spark ML.

### Spark NLP Settings for S3

Spark NLP requires the following configuration:
1. **S3 Region**: We need the region to upload a file on your S3 bucket. This is defined in the config `spark.jsl.settings.aws.region`
2. **Spark NLP JAR**: Since some custom configurations are needed to use S3 URI. It is also required to include spark-nlp JAR either as a dependency for our application or during spark session creation. Since we are using a notebook, we will add these packages while creating a spark session in the following config:

- `spark.jars.packages` for Maven coordinates or `spark.jar` for FAT JAR
3. **KryoSerializer**: We recommend also adding the parameters described in creating manually a spark session in requirements section on [Spark NLP documentation](https://github.com/JohnSnowLabs/spark-nlp#requirements).
4. **Authenticating with S3**: This is needed to interact with external S3 buckets, and it will require an access key, a secret key, and a session token. Define the values in these configs:

- `spark.jsl.settings.aws.credentials.access_key_id`
- `spark.jsl.settings.aws.credentials.secret_access_key`
- `spark.jsl.settings.aws.credentials.session_token`

This configuration will depend on your S3 bucket and AWS configuration. In this notebook a connection through **Temporary Security Credentials** is showcased. **Please contact your administrator to choose the right setup, as well as, the required keys/tokens.**


### Spark ML Settings for S3

1. **AWS packages**: S3A depends upon two JARs, alongside `hadoop-common` and its dependencies, which are `hadoop-aws` and `aws-java-sdk` packages. So, you will need to either add these dependencies in your application or to your spark session. Since we are using a notebook, we will add these packages while creating the spark session in the following config:

- `spark.jars.packages`
2. **AWS File System**: Defining S3AFileSystem it's also required for interacting S3 with AWS SDK. Define the value in this config:

- `spark.hadoop.fs.s3a.impl`

Now, let's take a look at the spark session creation below to see how to define each of the configurations with its values for **Temporary Security Credentials**:

In [None]:
print("Enter your AWS Access Key:")
MY_ACCESS_KEY = input()

In [None]:
print("Enter your AWS Secret Key:")
MY_SECRET_KEY = input()

In [None]:
print("Enter your AWS Session Key:")
MY_SESSION_KEY = input()

In [10]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkNLP") \
    .master("local[*]") \
    .config("spark.driver.memory", "12G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "2000M") \
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.jars", "./sparknlp.jar") \
    .config("spark.jsl.settings.aws.credentials.access_key_id", MY_ACCESS_KEY) \
    .config("spark.jsl.settings.aws.credentials.secret_access_key", MY_SECRET_KEY) \
    .config("spark.jsl.settings.aws.credentials.session_token", MY_SESSION_KEY) \
    .config("spark.jsl.settings.aws.region", "us-east-1") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.1,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk:1.11.901") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .getOrCreate()

In [11]:
from sparknlp.pretrained import ResourceDownloader

## CoNLLU Examples

In [12]:
from sparknlp.training import CoNLLU

conllu_df = CoNLLU().readDataset(spark, "./datasets/en.test.conllu")

In [13]:
conllu_df.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|                form|                upos|                xpos|               lemma|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|What if Google Mo...|[{document, 0, 36...|[{document, 0, 36...|[{token, 0, 3, Wh...|[{pos, 0, 3, PRON...|[{pos, 0, 3, WP, ...|[{token, 0, 3, wh...|
|Google is a nice ...|[{document, 0, 30...|[{document, 0, 30...|[{token, 0, 5, Go...|[{pos, 0, 5, PROP...|[{pos, 0, 5, NNP,...|[{token, 0, 5, Go...|
|Does anybody use ...|[{document, 0, 37...|[{document, 0, 37...|[{token, 0, 3, Do...|[{pos, 0, 3, AUX,...|[{pos, 0, 3, VBZ,...|[{token, 0, 3, do...|
+--------------------+--------------------+--------------------+--------------------+--------------------+

In [14]:
conllu_df2 = CoNLLU().readDataset(spark, "file:///content/datasets/en.test.conllu")

In [15]:
conllu_df2.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|                form|                upos|                xpos|               lemma|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|What if Google Mo...|[{document, 0, 36...|[{document, 0, 36...|[{token, 0, 3, Wh...|[{pos, 0, 3, PRON...|[{pos, 0, 3, WP, ...|[{token, 0, 3, wh...|
|Google is a nice ...|[{document, 0, 30...|[{document, 0, 30...|[{token, 0, 5, Go...|[{pos, 0, 5, PROP...|[{pos, 0, 5, NNP,...|[{token, 0, 5, Go...|
|Does anybody use ...|[{document, 0, 37...|[{document, 0, 37...|[{token, 0, 3, Do...|[{pos, 0, 3, AUX,...|[{pos, 0, 3, VBZ,...|[{token, 0, 3, do...|
+--------------------+--------------------+--------------------+--------------------+--------------------+

In [16]:
conllu_df3 = CoNLLU().readDataset(spark, "s3://auxdata.johnsnowlabs.com/public/tmp/danilo/datasets/en.test.conllu")

In [17]:
conllu_df3.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|                form|                upos|                xpos|               lemma|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|What if Google Mo...|[{document, 0, 36...|[{document, 0, 36...|[{token, 0, 3, Wh...|[{pos, 0, 3, PRON...|[{pos, 0, 3, WP, ...|[{token, 0, 3, wh...|
|Google is a nice ...|[{document, 0, 30...|[{document, 0, 30...|[{token, 0, 5, Go...|[{pos, 0, 5, PROP...|[{pos, 0, 5, NNP,...|[{token, 0, 5, Go...|
|Does anybody use ...|[{document, 0, 37...|[{document, 0, 37...|[{token, 0, 3, Do...|[{pos, 0, 3, AUX,...|[{pos, 0, 3, VBZ,...|[{token, 0, 3, do...|
+--------------------+--------------------+--------------------+--------------------+--------------------+

In [18]:
conllu_df4 = CoNLLU().readDataset(spark, "s3a://auxdata.johnsnowlabs.com/public/tmp/danilo/datasets/en.test.conllu")

In [19]:
conllu_df4.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|                form|                upos|                xpos|               lemma|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|What if Google Mo...|[{document, 0, 36...|[{document, 0, 36...|[{token, 0, 3, Wh...|[{pos, 0, 3, PRON...|[{pos, 0, 3, WP, ...|[{token, 0, 3, wh...|
|Google is a nice ...|[{document, 0, 30...|[{document, 0, 30...|[{token, 0, 5, Go...|[{pos, 0, 5, PROP...|[{pos, 0, 5, NNP,...|[{token, 0, 5, Go...|
|Does anybody use ...|[{document, 0, 37...|[{document, 0, 37...|[{token, 0, 3, Do...|[{pos, 0, 3, AUX,...|[{pos, 0, 3, VBZ,...|[{token, 0, 3, do...|
+--------------------+--------------------+--------------------+--------------------+--------------------+

### CoNLL Examples

In [20]:
from sparknlp.training import CoNLL

conll_df = CoNLL().readDataset(spark, "./datasets/test_conll_docid.txt")

In [21]:
conll_df.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,..

In [22]:
conll_df2 = CoNLL().readDataset(spark, "file:///content/datasets/test_conll_docid.txt")

In [23]:
conll_df2.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,..

In [24]:
conll_df22 = CoNLL().readDataset(spark, "/content/datasets/test_conll_docid.txt")

In [25]:
conll_df22.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,..

In [26]:
conll_df3 = CoNLL().readDataset(spark, "s3://auxdata.johnsnowlabs.com/public/tmp/danilo/datasets/test_conll_docid.txt")

In [27]:
conll_df3.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,..

In [28]:
conll_df4 = CoNLL().readDataset(spark, "s3a://auxdata.johnsnowlabs.com/public/tmp/danilo/datasets/*")

In [29]:
conll_df4.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|             # # # #|[{document, 0, 6,...|[{document, 0, 6,...|[{token, 0, 0, #,...|[{pos, 0, 0, newd...|[{named_entity, 0...|
|               # # #|[{document, 0, 4,...|[{document, 0, 4,...|[{token, 0, 0, #,...|[{pos, 0, 0, newd...|[{named_entity, 0...|
|                 # #|[{document, 0, 2,...|[{document, 0, 2,...|[{token, 0, 0, #,...|[{pos, 0, 0, sent...|[{named_entity, 0...|
|I Today Muiriel T...|[{document, 0, 55...|[{document, 0, 55...|[{token, 0, 0, I,...|[{pos, 0, 0, have...|[{named_entity, 0...|
|I Why But I I It ...|[{document, 0, 44...|[{document, 0, 44...|[{token, 0, 0, I,...|[{pos, 0, 0, don'..

In [30]:
conll_df5 = CoNLL().readDataset(spark, "s3://auxdata.johnsnowlabs.com/public/tmp/danilo/datasets/*")

In [31]:
conll_df5.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|             # # # #|[{document, 0, 6,...|[{document, 0, 6,...|[{token, 0, 0, #,...|[{pos, 0, 0, newd...|[{named_entity, 0...|
|               # # #|[{document, 0, 4,...|[{document, 0, 4,...|[{token, 0, 0, #,...|[{pos, 0, 0, newd...|[{named_entity, 0...|
|                 # #|[{document, 0, 2,...|[{document, 0, 2,...|[{token, 0, 0, #,...|[{pos, 0, 0, sent...|[{named_entity, 0...|
|I Today Muiriel T...|[{document, 0, 55...|[{document, 0, 55...|[{token, 0, 0, I,...|[{pos, 0, 0, have...|[{named_entity, 0...|
|I Why But I I It ...|[{document, 0, 44...|[{document, 0, 44...|[{token, 0, 0, I,...|[{pos, 0, 0, don'..

In [32]:
conll_df6 = CoNLL().readDataset(spark, "./datasets/*")

In [33]:
conll_df6.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|             # # # #|[{document, 0, 6,...|[{document, 0, 6,...|[{token, 0, 0, #,...|[{pos, 0, 0, newd...|[{named_entity, 0...|
|               # # #|[{document, 0, 4,...|[{document, 0, 4,...|[{token, 0, 0, #,...|[{pos, 0, 0, newd...|[{named_entity, 0...|
|                 # #|[{document, 0, 2,...|[{document, 0, 2,...|[{token, 0, 0, #,...|[{pos, 0, 0, sent...|[{named_entity, 0...|
|          Pierre|NNP|[{document, 0, 9,...|[{document, 0, 9,...|[{token, 0, 9, Pi...|[{pos, 0, 9, Vink...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,..

In [34]:
conll_df7 = CoNLL().readDataset(spark, "file:///content/datasets/*")

In [35]:
conll_df7.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|             # # # #|[{document, 0, 6,...|[{document, 0, 6,...|[{token, 0, 0, #,...|[{pos, 0, 0, newd...|[{named_entity, 0...|
|               # # #|[{document, 0, 4,...|[{document, 0, 4,...|[{token, 0, 0, #,...|[{pos, 0, 0, newd...|[{named_entity, 0...|
|                 # #|[{document, 0, 2,...|[{document, 0, 2,...|[{token, 0, 0, #,...|[{pos, 0, 0, sent...|[{named_entity, 0...|
|          Pierre|NNP|[{document, 0, 9,...|[{document, 0, 9,...|[{token, 0, 9, Pi...|[{pos, 0, 9, Vink...|[{named_entity, 0...|
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,..

## POS Test

In [36]:
from sparknlp.training import POS

In [37]:
pos = POS()
posDf = pos.readDataset(spark, "./datasets/test-training.txt", "|", "tags")

In [38]:
posDf.selectExpr("explode(tags) as tags").show(truncate=False)

+---------------------------------------------+
|tags                                         |
+---------------------------------------------+
|{pos, 0, 5, NNP, {word -> Pierre}, []}       |
|{pos, 7, 12, NNP, {word -> Vinken}, []}      |
|{pos, 14, 14, ,, {word -> ,}, []}            |
|{pos, 16, 17, CD, {word -> 61}, []}          |
|{pos, 19, 23, NNS, {word -> years}, []}      |
|{pos, 25, 27, JJ, {word -> old}, []}         |
|{pos, 29, 29, ,, {word -> ,}, []}            |
|{pos, 31, 34, MD, {word -> will}, []}        |
|{pos, 36, 39, VB, {word -> join}, []}        |
|{pos, 41, 43, DT, {word -> the}, []}         |
|{pos, 45, 49, NN, {word -> board}, []}       |
|{pos, 51, 52, IN, {word -> as}, []}          |
|{pos, 47, 47, DT, {word -> a}, []}           |
|{pos, 56, 67, JJ, {word -> nonexecutive}, []}|
|{pos, 69, 76, NN, {word -> director}, []}    |
|{pos, 78, 81, NNP, {word -> Nov.}, []}       |
|{pos, 83, 84, CD, {word -> 29}, []}          |
|{pos, 81, 81, ., {word -> .}, []}      

In [39]:
posDf2 = pos.readDataset(spark, "file:///content/datasets/test-training.txt", "|", "tags")

In [40]:
posDf2.selectExpr("explode(tags) as tags").show(truncate=False)

+---------------------------------------------+
|tags                                         |
+---------------------------------------------+
|{pos, 0, 5, NNP, {word -> Pierre}, []}       |
|{pos, 7, 12, NNP, {word -> Vinken}, []}      |
|{pos, 14, 14, ,, {word -> ,}, []}            |
|{pos, 16, 17, CD, {word -> 61}, []}          |
|{pos, 19, 23, NNS, {word -> years}, []}      |
|{pos, 25, 27, JJ, {word -> old}, []}         |
|{pos, 29, 29, ,, {word -> ,}, []}            |
|{pos, 31, 34, MD, {word -> will}, []}        |
|{pos, 36, 39, VB, {word -> join}, []}        |
|{pos, 41, 43, DT, {word -> the}, []}         |
|{pos, 45, 49, NN, {word -> board}, []}       |
|{pos, 51, 52, IN, {word -> as}, []}          |
|{pos, 47, 47, DT, {word -> a}, []}           |
|{pos, 56, 67, JJ, {word -> nonexecutive}, []}|
|{pos, 69, 76, NN, {word -> director}, []}    |
|{pos, 78, 81, NNP, {word -> Nov.}, []}       |
|{pos, 83, 84, CD, {word -> 29}, []}          |
|{pos, 81, 81, ., {word -> .}, []}      

In [41]:
posDf3 = pos.readDataset(spark, "file:///content/datasets/test-training.txt", "|", "tags")

In [42]:
posDf3 = pos.readDataset(spark, "s3://auxdata.johnsnowlabs.com/public/tmp/danilo/datasets/test-training.txt", "|", "tags")

In [43]:
posDf3.selectExpr("explode(tags) as tags").show(truncate=False)

+---------------------------------------------+
|tags                                         |
+---------------------------------------------+
|{pos, 0, 5, NNP, {word -> Pierre}, []}       |
|{pos, 7, 12, NNP, {word -> Vinken}, []}      |
|{pos, 14, 14, ,, {word -> ,}, []}            |
|{pos, 16, 17, CD, {word -> 61}, []}          |
|{pos, 19, 23, NNS, {word -> years}, []}      |
|{pos, 25, 27, JJ, {word -> old}, []}         |
|{pos, 29, 29, ,, {word -> ,}, []}            |
|{pos, 31, 34, MD, {word -> will}, []}        |
|{pos, 36, 39, VB, {word -> join}, []}        |
|{pos, 41, 43, DT, {word -> the}, []}         |
|{pos, 45, 49, NN, {word -> board}, []}       |
|{pos, 51, 52, IN, {word -> as}, []}          |
|{pos, 47, 47, DT, {word -> a}, []}           |
|{pos, 56, 67, JJ, {word -> nonexecutive}, []}|
|{pos, 69, 76, NN, {word -> director}, []}    |
|{pos, 78, 81, NNP, {word -> Nov.}, []}       |
|{pos, 83, 84, CD, {word -> 29}, []}          |
|{pos, 81, 81, ., {word -> .}, []}      

In [44]:
posDf4 = pos.readDataset(spark, "s3a://auxdata.johnsnowlabs.com/public/tmp/danilo/datasets/test-training.txt", "|", "tags")

In [45]:
posDf4.selectExpr("explode(tags) as tags").show(truncate=False)

+---------------------------------------------+
|tags                                         |
+---------------------------------------------+
|{pos, 0, 5, NNP, {word -> Pierre}, []}       |
|{pos, 7, 12, NNP, {word -> Vinken}, []}      |
|{pos, 14, 14, ,, {word -> ,}, []}            |
|{pos, 16, 17, CD, {word -> 61}, []}          |
|{pos, 19, 23, NNS, {word -> years}, []}      |
|{pos, 25, 27, JJ, {word -> old}, []}         |
|{pos, 29, 29, ,, {word -> ,}, []}            |
|{pos, 31, 34, MD, {word -> will}, []}        |
|{pos, 36, 39, VB, {word -> join}, []}        |
|{pos, 41, 43, DT, {word -> the}, []}         |
|{pos, 45, 49, NN, {word -> board}, []}       |
|{pos, 51, 52, IN, {word -> as}, []}          |
|{pos, 47, 47, DT, {word -> a}, []}           |
|{pos, 56, 67, JJ, {word -> nonexecutive}, []}|
|{pos, 69, 76, NN, {word -> director}, []}    |
|{pos, 78, 81, NNP, {word -> Nov.}, []}       |
|{pos, 83, 84, CD, {word -> 29}, []}          |
|{pos, 81, 81, ., {word -> .}, []}      

In [46]:
posDf5 = pos.readDataset(spark, "file://content/datasets/test-training.txt", "|", "tags")

In [47]:
posDf5.selectExpr("explode(tags) as tags").show(truncate=False)

+---------------------------------------------+
|tags                                         |
+---------------------------------------------+
|{pos, 0, 5, NNP, {word -> Pierre}, []}       |
|{pos, 7, 12, NNP, {word -> Vinken}, []}      |
|{pos, 14, 14, ,, {word -> ,}, []}            |
|{pos, 16, 17, CD, {word -> 61}, []}          |
|{pos, 19, 23, NNS, {word -> years}, []}      |
|{pos, 25, 27, JJ, {word -> old}, []}         |
|{pos, 29, 29, ,, {word -> ,}, []}            |
|{pos, 31, 34, MD, {word -> will}, []}        |
|{pos, 36, 39, VB, {word -> join}, []}        |
|{pos, 41, 43, DT, {word -> the}, []}         |
|{pos, 45, 49, NN, {word -> board}, []}       |
|{pos, 51, 52, IN, {word -> as}, []}          |
|{pos, 47, 47, DT, {word -> a}, []}           |
|{pos, 56, 67, JJ, {word -> nonexecutive}, []}|
|{pos, 69, 76, NN, {word -> director}, []}    |
|{pos, 78, 81, NNP, {word -> Nov.}, []}       |
|{pos, 83, 84, CD, {word -> 29}, []}          |
|{pos, 81, 81, ., {word -> .}, []}      