![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/training/chinese/word-segmentation/WordSegmenter_train_chinese_segmentation.ipynb)



# [Word Segmenter](https://nlp.johnsnowlabs.com/docs/en/annotators#wordsegmenter)

Many languages are not whitespace separated and their sentences are a
concatenation of many symbols, like Korean, Japanese or Chinese. Without
understanding the language, splitting the words into their corresponding tokens
is impossible. The WordSegmenter is trained to understand these languages and
split them into semantically correct parts.

Let's train a custom WordSegmenterModel that will tokenize Chinese words.

## Installation

Only run this block if you are inside Google Colab to set up Spark NLP otherwise
skip it.

In [None]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

## Training

### Starting Spark NLP

In [None]:
import sparknlp
spark = sparknlp.start()


print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))

Spark NLP version: 4.3.1
Apache Spark version: 3.3.0


### Training Data

To train your own model, a training dataset consisting of [Part-Of-Speech
tags](https://en.wikipedia.org/wiki/Part-of-speech_tagging) is required. The
data has to be loaded into a dataframe, where the column is a Spark NLP
Annotation of type `"POS"`. This can be set with `setPosColumn`.

For this example we will use some sample files parsed from the [Ontonotes 5.0 Dataset](https://github.com/taotao033/conll-formatted-ontonotes-5.0_for_chinese_language). If a full model needs to be trained, the whole dataset needs to be retrieved.

In [None]:
!wget https://raw.githubusercontent.com/taotao033/conll-formatted-ontonotes-5.0_for_chinese_language/master/onto.train.ner
!wget https://raw.githubusercontent.com/taotao033/conll-formatted-ontonotes-5.0_for_chinese_language/master/onto.test.ner

Spark NLP offers helper classes to load this kind of data into Spark DataFrames.
The resulting DataFrame will have columns for the word, POS tag and NER Tag.

In [None]:
from sparknlp.training import CoNLL
from pyspark.sql.functions import *

train = CoNLL(delimiter="\t").readDataset(spark, "onto.train.ner")
test = CoNLL(delimiter="\t").readDataset(spark, "onto.test.ner") \
    .withColumn("text", regexp_replace("text", "\t", ""))

### Pipeline
Now we will create the parts for the training pipeline. In this case it is
rather simple, as we only need to pass the annotations to the
WordSegmenterApproach annotator. We set the `posColumn` parameter to the name
of the column which was extracted (in this case `"pos"`). The resulting output
column will be `"token"`.

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

wordSegmenter = WordSegmenterApproach() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setPosColumn("pos") \
    .setNIterations(5)

pipeline = Pipeline().setStages([
    documentAssembler,
    wordSegmenter
])

pipelineModel = pipeline.fit(train)

After we have trained the model, we can use the resulting pipeline model to
transform the test data. Note that this model might not perform well, as it had
little data and iterations and only serves to illustrate the training process.

In [None]:
test_transformed = pipelineModel.transform(test)
test_transformed.select("token.result").show(5, False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[在, 华, 美, 资, 企, 业, 要, 求, 延, 长, 给, 中, 国, 的, 贸, 易, 最, 惠, 国, 待, 遇]                                                                                                |
|[新, 华, 社, 华, 盛, 顿, ４, 月, ２, ０, 日, 电, （, 记, 者, 应, 谦, ）]                                                                                                            |
|[美, 国, 商, 会, 中, 国, 分, 会, 近, 日, 派, 出, 一, 个, ２, ５, 人, 组, 成, 的, 代, 表, 团, ，, 在, 华, 盛, 顿, 向, 国, 会, 和, 白, 宫, 展, 开, 为, 期, 一, 周, 的,