![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.2_Multi_Lingual_Training_and_models.ipynb)

## Train SOTA a multi-lingual NLP classifier, capable of understanding 100+ languages



### Limits of uni-lingual embeddings
Textual Embeddings are numerical vectorized representations of our training set and they encode semantic information in high dimensional space. Usually they are trained only on one language and are thus only able to embed text from the original training laguage properly. The Sentence `I love sausages` and `Ich liebe Bratwurst` have the same semantic meaning and should have a small distance in embedding space. But most english embedding models have never seen German words like `Bratwurst`, so the model would never be able to deduct that these sentences say the same thing and should have close proximity in hyperspace.    



### Solutions with multi-lingual embeddings
**Multi-Lingual embeddings** have been trained on usually 100+ of languages at the same time, which enables them to encode text from all of these languages into a semantic hyperspace. In this space similar semantic  pieces of text get mapped to points which represent the similarity by having a small distance in the embedding space.

This enables many new use cases. You can train a classifier model on an English Dataset using Multi-Lingual Embeddings. The resulting model will be able to properly predict labels across all languages supported by the embedding. I.e if you use `LABSE` or `RoBERTa` your English dataset enables you to roll out a feature in 100+ countries because the embeddings already generalize enough for simple classifiers to properly leverage the hyperspace and yield good predictions


Embeddings are feat

There are many multi-lingual embeddings you can leverage via Spark NLP, you can find them all in the [Models-Hub by setting filters to Multi-Lingual and Type to Embedding](https://nlp.johnsnowlabs.com/models?language=xx&task=Embeddings)



| Name                                                                                                                   | Spark NLP Model Name               | language   |
|:-----------------------------------------------------------------------------------------------------------------------|:-----------------------------------|:-----------|
| GloVe Embeddings 6B 300 (Multilingual)                                                                                 | glove_6B_300                       | xx         |
| GloVe Embeddings 840B 300 (Multilingual)                                                                               | glove_840B_300                     | xx         |
| Multilingual BERT Embeddings (Base Cased)                                                                              | bert_multi_cased                   | xx         |
| Multilingual BERT Sentence Embeddings (Base Cased)                                                                     | sent_bert_multi_cased              | xx         |
| Universal Sentence Encoder Multilingual Large                                                                          | tfhub_use_multi_lg                 | xx         |
| Universal Sentence Encoder Multilingual                                                                                | tfhub_use_multi                    | xx         |
| Universal Sentence Encoder XLING English and German                                                                    | tfhub_use_xling_en_de              | xx         |
| Universal Sentence Encoder XLING English and Spanish                                                                   | tfhub_use_xling_en_es              | xx         |
| Universal Sentence Encoder XLING English and French                                                                    | tfhub_use_xling_en_fr              | xx         |
| Universal Sentence Encoder XLING Many                                                                                  | tfhub_use_xling_many               | xx         |
| Universal Sentence Encoder Multilingual Large (tfhub_use_multi_lg)                                                     | tfhub_use_multi_lg                 | xx         |
| Universal Sentence Encoder Multilingual (tfhub_use_multi)                                                              | tfhub_use_multi                    | xx         |
| BERT multilingual base model (cased)                                                                                   | bert_base_multilingual_cased       | xx         |
| BERT multilingual base model (uncased)                                                                                 | bert_base_multilingual_uncased     | xx         |
| DistilBERT base multilingual model (cased)                                                                             | distilbert_base_multilingual_cased | xx         |
| Twitter XLM-RoBERTa Base (twitter_xlm_roberta_base)                                                                    | twitter_xlm_roberta_base           | xx         |
| XLM-RoBERTa Base (xlm_roberta_base)                                                                                    | xlm_roberta_base                   | xx         |
| XLM-RoBERTa XTREME Base (xlm_roberta_xtreme_base)                                                                      | xlm_roberta_xtreme_base            | xx         |
| Universal sentence encoder for 100+ languages trained with CMLM (sent_bert_use_cmlm_multi_base_br)                     | sent_bert_use_cmlm_multi_base_br   | xx         |
| Universal sentence encoder for 100+ languages trained with CMLM (sent_bert_use_cmlm_multi_base)                        | sent_bert_use_cmlm_multi_base      | xx         |
| Multilingual Representations for Indian Languages (MuRIL)                                                              | bert_muril                         | xx         |
| Multilingual Representations for Indian Languages (MuRIL) - BERT Sentence Embedding pre-trained on 17 Indian languages | sent_bert_muril                    | xx         |
| XLM-RoBERTa Base Sentence Embeddings (sent_xlm_roberta_base)                                                           | sent_xlm_roberta_base              | xx         |
| XLM-RoBERTa Large (xlm_roberta_large)                                                                                  | xlm_roberta_large                  | xx         |



## Colab Setup

In [None]:
! pip install -q pyspark==3.2.0 spark-nlp

In [None]:
import sparknlp
spark = sparknlp.start(spark32 = True)
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import pandas as pd
print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 3.4.0
Apache Spark version: 3.2.0


## Download Dataset

In [None]:
! wget -q http://ckl-it.de/wp-content/uploads/2021/02/news_category_test_multi_lingual.csv

In [None]:
dataset = spark.read \
      .option("header", True) \
      .csv('/content/news_category_test_multi_lingual.csv').limit(10000)
dataset.show()

+---+--------+--------------------+------------------------------+
|_c0|       y|                text|                test_sentences|
+---+--------+--------------------+------------------------------+
|  0|Business|Unions representi...|          టర్నర్ నెవాల్ వద్...|
|  1|Sci/Tech| TORONTO, Canada ...|          Торонто, Канада #...|
|  2|Sci/Tech| A company founde...|          Une société fondé...|
|  3|Sci/Tech| It's barely dawn...|          সবেমাত্র ভোর যখন ...|
|  4|Sci/Tech| Southern Califor...|          Көньяк Калифорния...|
|  5|Sci/Tech|"The British Depa...|           with the ostensi...|
|  6|Sci/Tech|"confessed author...|           something expert...|
|  7|Sci/Tech|\\FOAF/LOAF  and ...|          \ FOAF / LOAF- un...|
|  8|Sci/Tech|"Wiltshire Police...|          "வில்ட்ஷயர் பொலிஸ...|
|  9|Sci/Tech|In its first two ...|          ក្នុងរយៈពេលពីរឆ្ន...|
| 10|Sci/Tech| A group of techn...|          Техас Инструменты...|
| 11|Sci/Tech| Apple Computer I...|苹果计算机公司（AAPL.O）。 ...|
| 12|

In [None]:
# Split dataset 
train_df , test_df = dataset.randomSplit([0.7, 0.3])

# Create Pipeline With MultiLingual Embeddings and Trainable Classifier

In [None]:
from pyspark.ml import Pipeline
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *


# actual content is inside description column
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sent_embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

# the classes/labels/categories are in category column
classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("y")\
    .setMaxEpochs(60)\
    .setLr(0.005) 

    
pipeline = Pipeline(
    stages = [
        document,
        sent_embeddings,
        classsifierdl
    ])
mutli_lingual_model = pipeline.fit(train_df)


labse download started this may take some time.
Approximate size to download 1.7 GB
[OK!]


In [None]:
preds = mutli_lingual_model.transform(test_df)
preds.show()

+----+--------+--------------------+--------------------+--------------------+--------------------+--------------------+
| _c0|       y|                text|      test_sentences|            document| sentence_embeddings|               class|
+----+--------+--------------------+--------------------+--------------------+--------------------+--------------------+
|   0|Business|Unions representi...|టర్నర్ నెవాల్ వద్...|[{document, 0, 12...|[{sentence_embedd...|[{category, 0, 12...|
|1000|Business|Persistent econom...|                    |[{document, 0, 15...|[{sentence_embedd...|[{category, 0, 15...|
|1001|Business|The catheter that...|                    |[{document, 0, 24...|[{sentence_embedd...|[{category, 0, 24...|
|1006|Business|SCOTTISH  amp; So...|                    |[{document, 0, 15...|[{sentence_embedd...|[{category, 0, 15...|
|1007|  Sports|Teenage striker W...|                    |[{document, 0, 13...|[{sentence_embedd...|[{category, 0, 13...|
|1008|  Sports|  British police 

# Evauate Multi-Lingual Model

In [None]:
df = preds.select(['y','class.result']).toPandas()
df['result'] = df['result'].apply(lambda x : x[0])
df

Unnamed: 0,y,result
0,Business,Business
1,Business,Business
2,Business,Business
3,Business,Business
4,Sports,Sports
...,...,...
2270,Business,Business
2271,Sports,Sci/Tech
2272,Sci/Tech,Sci/Tech
2273,Sports,Sports


In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report
print(classification_report(df['result'], df['y']))

              precision    recall  f1-score   support

    Business       0.81      0.79      0.80       572
    Sci/Tech       0.82      0.81      0.81       595
      Sports       0.96      0.92      0.94       578
       World       0.84      0.92      0.88       530

    accuracy                           0.86      2275
   macro avg       0.86      0.86      0.86      2275
weighted avg       0.86      0.86      0.86      2275



In [None]:
# Train dataset metrics
preds = mutli_lingual_model.transform(train_df)
df = preds.select(['y','class.result']).toPandas()
df['result'] = df['result'].apply(lambda x : x[0])
print(classification_report(df['result'], df['y']))


              precision    recall  f1-score   support

    Business       0.92      0.88      0.90      1389
    Sci/Tech       0.91      0.90      0.90      1340
      Sports       0.98      0.96      0.97      1376
       World       0.88      0.95      0.92      1220

    accuracy                           0.92      5325
   macro avg       0.92      0.92      0.92      5325
weighted avg       0.92      0.92      0.92      5325



# The Model understands English
![en](https://www.worldometers.info/img/flags/small/tn_nz-flag.gif)

In [None]:
model=LightPipeline(mutli_lingual_model)

In [None]:
model.annotate("Businesses are the best way of making profit ")

{'class': ['Business'],
 'document': ['Businesses are the best way of making profit '],
 'sentence_embeddings': ['Businesses are the best way of making profit ']}

In [None]:
model.annotate("Science has advanced rapidly over the last century ")

{'class': ['Sci/Tech'],
 'document': ['Science has advanced rapidly over the last century '],
 'sentence_embeddings': ['Science has advanced rapidly over the last century ']}

# The Model understands German
![de](https://www.worldometers.info/img/flags/small/tn_gm-flag.gif)

In [None]:
# German for: 'Businesses are the best way of making profit'
model.annotate("Unternehmen sind der beste Weg, um Gewinn zu erzielen")

{'class': ['Business'],
 'document': ['Unternehmen sind der beste Weg, um Gewinn zu erzielen'],
 'sentence_embeddings': ['Unternehmen sind der beste Weg, um Gewinn zu erzielen']}

In [None]:
# German for: 'Science has advanced rapidly over the last century'
model.annotate("Die Wissenschaft hat im letzten Jahrhundert rasante Fortschritte gemacht ")

{'class': ['Sci/Tech'],
 'document': ['Die Wissenschaft hat im letzten Jahrhundert rasante Fortschritte gemacht '],
 'sentence_embeddings': ['Die Wissenschaft hat im letzten Jahrhundert rasante Fortschritte gemacht ']}

# The Model understands Chinese
![zh](https://www.worldometers.info/img/flags/small/tn_ch-flag.gif)

In [None]:
# Chinese for: 'Businesses are the best way of making profit'
model.annotate("創業是最好的盈利方式 ")

{'class': ['Business'],
 'document': ['創業是最好的盈利方式 '],
 'sentence_embeddings': ['創業是最好的盈利方式 ']}

In [None]:
# Chinese for: 'Science has advanced rapidly over the last century'
model.annotate("在上个世纪，科学发展迅速 ")
		

{'class': ['Sci/Tech'],
 'document': ['在上个世纪，科学发展迅速 '],
 'sentence_embeddings': ['在上个世纪，科学发展迅速 ']}

# Model understands Afrikaans

![af](https://www.worldometers.info/img/flags/small/tn_sf-flag.gif)



In [None]:
#  Afrikaans for: 'Businesses are the best way of making profit'
model.annotate("Besighede is die beste manier om wins te maak")

{'class': ['Business'],
 'document': ['Besighede is die beste manier om wins te maak'],
 'sentence_embeddings': ['Besighede is die beste manier om wins te maak']}

In [None]:
#  Afrikaans for: 'Science has advanced rapidly over the last century'
model.annotate("Die wetenskap het die afgelope eeu vinnig gevorder ")

{'class': ['Sci/Tech'],
 'document': ['Die wetenskap het die afgelope eeu vinnig gevorder '],
 'sentence_embeddings': ['Die wetenskap het die afgelope eeu vinnig gevorder ']}

# Model understands Urdu
![ur](https://www.worldometers.info/img/flags/small/tn_pk-flag.gif)

In [None]:
# Urdu for: 'There have been a great increase in businesses over the last decade'
model.annotate("پچھلے ایک دہائی کے دوران کاروباروں میں زبردست اضافہ ہوا ہے ")

{'class': ['Business'],
 'document': ['پچھلے ایک دہائی کے دوران کاروباروں میں زبردست اضافہ ہوا ہے '],
 'sentence_embeddings': ['پچھلے ایک دہائی کے دوران کاروباروں میں زبردست اضافہ ہوا ہے ']}

In [None]:
# Urdu for: 'Science has advanced rapidly over the last century'
model.annotate("سائنس گذشتہ صدی کے دوران تیزی سے ترقی کرچکی ہے ")

{'class': ['Sci/Tech'],
 'document': ['سائنس گذشتہ صدی کے دوران تیزی سے ترقی کرچکی ہے '],
 'sentence_embeddings': ['سائنس گذشتہ صدی کے دوران تیزی سے ترقی کرچکی ہے ']}

# Model understands Hindi
![hi](https://www.worldometers.info/img/flags/small/tn_in-flag.gif)


In [None]:
# hindi for: 'There have been a great increase in businesses over the last decade'
model.annotate("पिछले दशक में व्यवसायों में बहुत वृद्धि हुई है ")

{'class': ['Business'],
 'document': ['पिछले दशक में व्यवसायों में बहुत वृद्धि हुई है '],
 'sentence_embeddings': ['पिछले दशक में व्यवसायों में बहुत वृद्धि हुई है ']}

In [None]:
		
# hindi for: 'Science has advanced rapidly over the last century'
model.annotate("विज्ञान पिछली सदी में तेजी से आगे बढ़ा है ")

{'class': ['Sci/Tech'],
 'document': ['विज्ञान पिछली सदी में तेजी से आगे बढ़ा है '],
 'sentence_embeddings': ['विज्ञान पिछली सदी में तेजी से आगे बढ़ा है ']}