![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb)


# [Word Segmenter](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/annotators/ws/WordSegmenterModel.scala)


[WordSegmenterModel-WSM](https://en.wikipedia.org/wiki/Text_segmentation) can tokenize non-english texts. Many languages are **not whitespace seperated** and their sentences are a concationation of many symbols, like Korean, Japanese or Chinese. Withouth **understanding the language** splitting the Words into their corrosponding tokens is impossible. The WordSegmenterModel is trained to understand these languages and split then semantically correct.

In [2]:
import os
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install pyspark==2.4.4 > /dev/null


In [None]:
import os

# Install java
import sparknlp
from pyspark.ml import Pipeline
from sparknlp.annotator import *
from sparknlp.base import *

spark = sparknlp.start()

In [12]:
import pandas as pd 
document_assembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud_trad", "zh")\
        .setInputCols(["document"])\
        .setOutputCol("words_segmented")    


pipeline = Pipeline(stages=[document_assembler, word_segmenter])
example = spark.createDataFrame(pd.DataFrame({'text': ["""然而，這樣的處理也衍生了一些問題。"""]}))


result = pipeline.fit(example).transform(example)
result.select('words_segmented.result').show()



wordseg_gsd_ud_trad download started this may take some time.
Approximate size to download 1.3 MB
[OK!]
+----------------------------+
|                      result|
+----------------------------+
|[然而, ，, 這樣, 的, 處理...|
+----------------------------+

