![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# WordSegmenter

This notebook will cover the different parameters and usages of `WordSegmenter`.

**📖 Learning Objectives:**

1. Be able to split text into words in diffferent languages.

2. Understand how to use the `WordSegmenter` annotator.

3. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [WordSegmenter](https://nlp.johnsnowlabs.com/docs/en/annotators#wordsegmenter)

- Python Docs : [WordSegmenter](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/ws/word_segmenter/index.html#sparknlp.annotator.ws.word_segmenter.WordSegmenterModel)

- Scala Docs : [WordSegmenter](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/ws/WordSegmenterModel.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public).

## **📜 Background**

An important part of text preprocessing is to split texts into an array of words that can be further used on many NLP tasks.

This task is relatively more difficult in some languages such as Chinese, Japanese, Korean, and Thai where the words in a text are not separated by a white space (or other delimiter). 

For example, check the following text in Chinese:

> 我们都很喜欢自然语言处理！

We can identify that the Chinese words are all together without any separation, so how can we identify which composition of ideograms form a word? 

In this example, the words are:

- 我们 (we, composition of two ideograms)
- 都 (all, only one ideogram)
- 很 (very, only one ideogram)
- 喜欢 (like, composition of two ideograms)
- 自然语言处理 (NLP, composition of six ideograms), which can be breaked down to:
  - 自然 (Natural)
  - 语言 (Language)
  - 处理 (Processing)

But there is no easy way to programmatically identify them! Thus, we need help from Machine Learning models. 

In this notebook, we will introduce the Spark NLP annotators that can identify the words in this kind of texts, either by using pretrained models or training new ones.

John Snow Labs currently has pretrained models for Chinese, Japanese, Korean, and Thai.

## **🎬 Colab Setup**

Before going through the annotators, let's set up the environment and start a `spark` session.

In [None]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
from pyspark.sql import functions as F
from pyspark.ml import Pipeline

import sparknlp
from sparknlp.annotator import (
    Wav2Vec2ForCTC
)
from sparknlp.base import DocumentAssembler, LightPipeline


Starting the spark session:

In [None]:
spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 4.2.4
Apache Spark version: 3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `TOKEN`

## **🔎 Parameters**

- **model**: Part-of-Speech model.

- **pattern**: Regex pattern used to match delimiters (Default: "\\s+").

- **toLowercase**: Indicates whether to convert all characters to lowercase before tokenizing (Default: false). Useful when multilanguage is present in the text.

### ✌ Using pretrained models

We can use pretrained model with the `WordSegmenterModel` annotator. For a list of available models, check [NLP Models Hub](https://nlp.johnsnowlabs.com/models?task=Word+Segmentation).

This annotator acts like the `Tokenizer` annotator, for languages where the words don't have a clear separator (like white space).

We will show how to use the Chinese pretrained model ``.

In [None]:
# Chinese example
example_sentence = r"我们都很喜欢自然语言处理！"

Create the pipeline

In [None]:
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

# Model trained on the Chinese Treebank 9 dataset
word_segmenter = (
    WordSegmenterModel.pretrained("wordseg_ctb9", "zh")
    .setInputCols(["document"])
    .setOutputCol("words_segmented")
    .setPattern("\\s+")
)


pipeline = Pipeline(stages=[document_assembler, word_segmenter])
example = spark.createDataFrame([[example_sentence]]).toDF("text")

model = pipeline.fit(example)
result = model.transform(example)
result.select(F.explode("words_segmented.result").alias("word")).show(truncate=False)

wordseg_ctb9 download started this may take some time.
Approximate size to download 2.2 MB
[OK!]
+----+
|word|
+----+
|我们|
|都  |
|很  |
|喜欢|
|自然|
|语言|
|处理|
|！  |
+----+



## Fast inference with [LightPipelines](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline)

We can use Spark NLP's `LightPipeline` to run fast inference directly on text (or list of text) instead of using spark data frames. 

Let's check how to do that.

In [None]:
# Simply define the LightPipeline on the PipelineModel
lp = LightPipeline(model)

lp.annotate(example_sentence)

{'document': ['我们都很喜欢自然语言处理！'],
 'words_segmented': ['我们', '都', '很', '喜欢', '自然', '语言', '处理', '！']}

In [None]:
for i, token in enumerate(lp.annotate(example_sentence)["words_segmented"]):
  print(f"Word {i}: {token}")

Word 0: 我们
Word 1: 都
Word 2: 很
Word 3: 喜欢
Word 4: 自然
Word 5: 语言
Word 6: 处理
Word 7: ！


Easy as that!

## ⚡ **Training a new WordSegmenterModel**

To train a new model, we need to use the `WordSegmenterApproach` annotator.

The parameters of the annotator are:

- **model**: Part-of-Speech model.
- **pattern**: Regex pattern used to match delimiters (Default: "\\s+").
- **toLowercase**: Indicates whether to convert all characters to lowercase before tokenizing (Default: false). Useful when multilanguage is present in the text.
- **ambiguityThreshold**: How much percentage of total amount of words are covered to be marked as frequent (Default: 0.97)
- **frequencyThreshold**: How many times at least a tag on a word to be marked as frequent (Default: 20)
- **nIterations**: Number of iterations in training, converges to better accuracy (Default: 5)
- **posCol**: Name of the column containing the POS tags that match tokens

The implemented model is a modification of the following reference paper:

> [Chinese Word Segmentation as Character Tagging (Xue, IJCLCLP 2003)](https://aclanthology.org/O03-4002/)


### Training data

The training data for the `WordSegmenterApproach` annotator is a text file in the same format used to train `Part-of-Speech` (POS) models, meaning that each ideogram/character is tagged with a label and are separated by a delimiter.

We will use the following as training data to train a simple Korean Word Segmenter model (character and tag are separated by `| ` and characters-tag are separated by white space):

> 우|LL 리|RR 모|LL 두|MM 는|RR 자|LL연|MM 어|MM 처|MM 리|MM 를|RR 좋|LL 아|MM 합|MM 니|MM 다|RR !|LR 


Where the labels are:

* `LL`: The beginning of the word
* `MM`: Middle part of the word
* `RR`: The end of the word
* `LR`: A word formed of only one character

Create a text file with the training data:

In [None]:
with open("train_data.txt", "w", encoding="utf8") as f:
  f.write("우|LL 리|RR 모|LL 두|MM 는|RR 자|LL 연|MM 어|RR 처|LL 리|MM 를|RR 좋|LL 아|MM 합|MM 니|MM 다|RR !|LR ")

To read this kind of dataset, you can use the helper class [POS](https://nlp.johnsnowlabs.com/docs/en/training#pos-dataset).

In [None]:
from sparknlp.training import POS

In [None]:
train_data = POS().readDataset(spark, "train_data.txt")
train_data.show()

+-----------------------------+--------------------+--------------------+
|                         text|            document|                tags|
+-----------------------------+--------------------+--------------------+
|우 리 모 두 는 자 연 어 처...|[{document, 0, 32...|[{pos, 0, 0, LL, ...|
+-----------------------------+--------------------+--------------------+



Build the pipeline for training

In [None]:
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

wordSegmenter = (
    WordSegmenterApproach()
    .setInputCols(["document"])
    .setOutputCol("token")
    .setPosColumn("tags") # Name in the training data obtained with POS class
    .setNIterations(10)
    .setFrequencyThreshold(1) # Since our data is very small
)

pipeline = Pipeline().setStages([documentAssembler, wordSegmenter])

In [None]:
%%time

pipelineModel = pipeline.fit(train_data)

CPU times: user 21.7 ms, sys: 6.07 ms, total: 27.7 ms
Wall time: 1.06 s


Try the trained model:

In [None]:
lp = LightPipeline(pipelineModel)

lp.annotate("우리모두는자연어처리를좋아합니다!")

{'document': ['우리모두는자연어처리를좋아합니다!'],
 'token': ['우리', '모두는', '자연어', '처리를', '좋아합니다', '!']}

That's it! Now you know how to train a new Word Segmenter model, as well as how to use a pretrained one!