![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/multilingual/WordSegmenterMultilingual.ipynb)

In [3]:
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash

In [6]:
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
import sparknlp

In [7]:
spark = sparknlp.start()

## Multilingual Inference

When dealing with multilingual text, we have two options in WordSegmenter:
1. Use `setEnableRegexTokenizer` parameter. This is useful for current pretrained models.
2. Train a model with multilingual text. This can be useful in case a current model (with `setEnableRegexTokenizer=True`) does not yield good results

Setting `setEnableRegexTokenizer=True` parameter will make WordSegmenter to tokenize latin words based on spaces and apply word segmenter inference **only in non-latin words**. As show in the example below.

**Note:** There are 3 parameters to play around for tokenization of latin words. You can check those in our [official documentation](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/ws/WordSegmenterModel.html)

This example has a text with Thai and English words. So, we use a WordSegmenter model of Thai language. You can check additional WordSegmenter models in our [official model's page](https://nlp.johnsnowlabs.com/models?q=Word+Segmenter).

---



In [None]:
multilingual_text = "สำหรับฐานลำโพง apple homepod อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น speaker stands null"
multilingual_df = spark.createDataFrame([[multilingual_text]]).toDF("text")

In [9]:
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

word_segmenter = WordSegmenterModel().pretrained("wordseg_best", "th") \
      .setInputCols(["document"]) \
      .setOutputCol("token") \
      .setEnableRegexTokenizer(True)

pipeline = Pipeline(stages=[document_assembler, word_segmenter])
result_df = pipeline.fit(multilingual_df).transform(multilingual_df)

wordseg_best download started this may take some time.
Approximate size to download 79.2 KB
[OK!]


In [10]:
result_df.show()

+--------------------+--------------------+--------------------+
|                text|            document|               token|
+--------------------+--------------------+--------------------+
|สำหรับฐานลำโพง ap...|[{document, 0, 91...|[{token, 0, 8, สำ...|
+--------------------+--------------------+--------------------+



In [11]:
result_df.select("token").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|token                                                                                                                                                     

##Training a Multilingual Model

We can also train our own multilingual model, which will require to build a training file with the required format, as in this example to label each character for English and Thai alike.

The tags legend for the training dataset is the following:
- LL: Left Boundary of a word
- RR: Right Boundary of a word
- MM: Middle character of a word
- LR: A single character that can be seen as a word

In [12]:
thai_word1 = "สำ|LL ห|MM รั|MM บ|RR ฐ|LL า|MM น|RR ลำ|LL โ|MM พ|MM ง|RR "
english_words = "a|LL p|MM p|MM l|MM e|RR h|LL o|MM m|MM e|MM p|MM o|MM d|RR "
thai_word2 = "อุ|LL ป|MM ก|MM ร|MM ณ์|RR เ|LL ค|MM รื่|MM อ|MM ง|RR เ|LL สี|MM ย|MM ง|RR ยึ|LL ด|RR ข|LLา|RR ตั้|LL ง|RR พื้|LL น|RR "
english_words2 = "s|LL p|MM e|MM a|MM k|MM e|MM r|RR s|LL t|MM a|MM n|MM d|MM s|RR n|LL u|MM l|MM l|RR"
thai_english_sentence = thai_word1 + english_words + thai_word2 + english_words2

with open('./thai_english.txt', 'w') as alphabet_file:
    alphabet_file.write(thai_english_sentence + "\n")
    alphabet_file.write(thai_english_sentence + "\n")
    alphabet_file.write(thai_english_sentence + "\n")
    alphabet_file.write(thai_english_sentence + "\n")
    alphabet_file.write(thai_english_sentence + "\n")

In [13]:
! cat ./thai_english.txt

สำ|LL ห|MM รั|MM บ|RR ฐ|LL า|MM น|RR ลำ|LL โ|MM พ|MM ง|RR a|LL p|MM p|MM l|MM e|RR h|LL o|MM m|MM e|MM p|MM o|MM d|RR อุ|LL ป|MM ก|MM ร|MM ณ์|RR เ|LL ค|MM รื่|MM อ|MM ง|RR เ|LL สี|MM ย|MM ง|RR ยึ|LL ด|RR ข|LLา|RR ตั้|LL ง|RR พื้|LL น|RR s|LL p|MM e|MM a|MM k|MM e|MM r|RR s|LL t|MM a|MM n|MM d|MM s|RR n|LL u|MM l|MM l|RR
สำ|LL ห|MM รั|MM บ|RR ฐ|LL า|MM น|RR ลำ|LL โ|MM พ|MM ง|RR a|LL p|MM p|MM l|MM e|RR h|LL o|MM m|MM e|MM p|MM o|MM d|RR อุ|LL ป|MM ก|MM ร|MM ณ์|RR เ|LL ค|MM รื่|MM อ|MM ง|RR เ|LL สี|MM ย|MM ง|RR ยึ|LL ด|RR ข|LLา|RR ตั้|LL ง|RR พื้|LL น|RR s|LL p|MM e|MM a|MM k|MM e|MM r|RR s|LL t|MM a|MM n|MM d|MM s|RR n|LL u|MM l|MM l|RR
สำ|LL ห|MM รั|MM บ|RR ฐ|LL า|MM น|RR ลำ|LL โ|MM พ|MM ง|RR a|LL p|MM p|MM l|MM e|RR h|LL o|MM m|MM e|MM p|MM o|MM d|RR อุ|LL ป|MM ก|MM ร|MM ณ์|RR เ|LL ค|MM รื่|MM อ|MM ง|RR เ|LL สี|MM ย|MM ง|RR ยึ|LL ด|RR ข|LLา|RR ตั้|LL ง|RR พื้|LL น|RR s|LL p|MM e|MM a|MM k|MM e|MM r|RR s|LL t|MM a|MM n|MM d|MM s|RR n|LL u|MM l|MM l|RR
สำ|LL ห|MM รั|MM บ|RR ฐ|LL า|MM น|

In [16]:
from sparknlp.training import POS

train_df = POS().readDataset(spark, "./thai_english.txt")
train_df.show()

+--------------------+--------------------+--------------------+
|                text|            document|                tags|
+--------------------+--------------------+--------------------+
|สำ ห รั บ ฐ า น ล...|[{document, 0, 13...|[{pos, 0, 1, LL, ...|
|สำ ห รั บ ฐ า น ล...|[{document, 0, 13...|[{pos, 0, 1, LL, ...|
|สำ ห รั บ ฐ า น ล...|[{document, 0, 13...|[{pos, 0, 1, LL, ...|
|สำ ห รั บ ฐ า น ล...|[{document, 0, 13...|[{pos, 0, 1, LL, ...|
|สำ ห รั บ ฐ า น ล...|[{document, 0, 13...|[{pos, 0, 1, LL, ...|
+--------------------+--------------------+--------------------+



In [17]:
document_assembler = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("document")

word_segmenter = WordSegmenterApproach() \
    .setInputCols("document") \
    .setOutputCol("token") \
    .setPosColumn("tags") \
    .setNIterations(5)

pipeline = Pipeline(stages=[document_assembler, word_segmenter])

result = pipeline.fit(train_df).transform(multilingual_df)

result_df.show()

+--------------------+--------------------+--------------------+
|                text|            document|               token|
+--------------------+--------------------+--------------------+
|สำหรับฐานลำโพง ap...|[{document, 0, 91...|[{token, 0, 8, สำ...|
+--------------------+--------------------+--------------------+



In [18]:
result_df.select("token").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|token                                                                                                                                                     