

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/Ner_HRL.ipynb)






## **NER Model for 10 Different Languages**


## **1. Colab Setup**

In [1]:
# Install PySpark and Spark NLP
! pip install -q pyspark==3.1.2 spark-nlp

[K     |████████████████████████████████| 212.4 MB 70 kB/s 
[K     |████████████████████████████████| 140 kB 43.6 MB/s 
[K     |████████████████████████████████| 198 kB 67.6 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
!pip install --ignore-installed spark-nlp-display

Collecting spark-nlp-display
  Downloading spark_nlp_display-1.8-py3-none-any.whl (95 kB)
[K     |████████████████████████████████| 95 kB 3.1 MB/s 
[?25hCollecting svgwrite==1.4
  Downloading svgwrite-1.4-py3-none-any.whl (66 kB)
[K     |████████████████████████████████| 66 kB 4.9 MB/s 
[?25hCollecting spark-nlp
  Using cached spark_nlp-3.4.0-py2.py3-none-any.whl (140 kB)
Collecting numpy
  Downloading numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 251 kB/s 
[?25hCollecting pandas
  Downloading pandas-1.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[K     |████████████████████████████████| 11.3 MB 47.8 MB/s 
[?25hCollecting ipython
  Downloading ipython-7.31.1-py3-none-any.whl (792 kB)
[K     |████████████████████████████████| 792 kB 36.4 MB/s 
[?25hCollecting pygments
  Downloading Pygments-2.11.2-py3-none-any.whl (1.1 MB)
[K     |████████████████████████████████| 1

In [3]:
import pandas as pd
import numpy as np
import json
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

## **2. Start Spark Session**

In [4]:
spark = sparknlp.start()

## **3. Sample Examples for all of the 10 Different languages**

In [5]:
text_list_english = ["""Jerome Horsey was a resident of the Russia Company in Moscow from 1572 to 1585.""","""The 1906 San Francisco earthquake struck the coast of Northern California.""","""Jan Verhaas is a Dutch snooker and pool referee. He was born in Maassluis, and now lives in Brielle.""","""Ethiopian historians who married Rita Pankhurst in Addis Ababa have been married for more than a year."""]

In [6]:
text_list_arabic = ["""يمكنكم مشاهدة أمير منطقة الرياض الأمير فيصل بن بندر بن عبد العزيز في كل مناسبة وافتتاح تتعلق بمشاريع التعليم والصحة وخدمة الطرق والمشاريع الثقافية في منطقة الرياض.""","""خريطة العالم بِيد أمير البحار العُثماني حاجي أحمد مُحيي الدين پیري، الشهير باسم پيري ريِّس، رُسمت سنة 1513م.""",""" ويحتفل بحياة وإنجازات مارتن لوثر كنغ، وهو زعيم بارز في الحقوق المدنية الأمريكية والأكثر شهرة بحملاته لإنهاء التمييز العنصري في وسائل النقل العامة والمساواة العرقية في الولايات المتحدة."""]

In [7]:
text_list_german = ["""Die Mona Lisa ist ein Ölgemälde aus dem 16. Jahrhundert, das von Leonardo geschaffen wurde. Es findet im Louvre in Paris statt.""","""Emilie Hartmanns Vater August Hartmann war Lehrer an der Hohen Karlsschule in Stuttgart, bis zu deren Auflösung 1793.""","""1794 wurde Emilie Hartmann geboren in Germany.""","""Jenny Staley Hoad (* 3. März 1934 in Melbourne als Jennifer Staley) ist eine ehemalige australische Tennisspielerin."""]

In [8]:
text_list_spanish = ["""El día de Martin Luther King, Jr. (en inglés Martin Luther King, Jr. Day) es un día festivo de los Estados Unidos marcado por el aniversario del natalicio del reverendo doctor Martin Luther King, Jr. Se celebra el tercer lunes de enero de cada año, que es aproximadamente la fecha del nacimiento de King, el 15 de enero de 1929.""","""Chalfie se graduó en la Universidad de Harvard y es profesor de biología en la Universidad de Columbia.""","""Nacida el 10 de agosto de 1943 como Veronica Yvette Bennett en Harlem del Este, Manhattan, Nueva York, de madre afroamericana y padre irlandés.""","""Ricardo Bofill Levi nació el 5 de diciembre de 1939 en Barcelona."""]

In [9]:
text_list_latvian = ["""Anna Haselborga ir zviedru kērlinga spēlētāja, 2018. gada ziemas olimpisko spēļu čempione un pasaules čempione jauktajās dubultspēlēs, dzimusi Stokholmā.""","""Aleksandrs Melderis dzimis 1909. gada 19. janvārī Jelgavā Pētera un Margrietas (dzim. Rozenes) Melderu ģimenē.""","""6. janvārī Eiropas Zāļu aģentūra (EZA) apstiprināja arī ASV kompānijas Moderna izstrādāto vakcīnu.""","""29. janvārī, EZA apstiprināja Oksfordas Universitātes un farmācijas kompānijas “AstraZeneca” izstrādāto vakcīnu."""]

In [10]:
text_list_dutch = ["""Amerigo Vespucci werd op 9 maart 1454 in Florence geboren, hij was dus een Genuees.""", """Van 23 juni tot 6 juli 1505 werd het Beleg van Arnhem opgezet door Filips de Schone.""","""Graham William Nash is een Engelse zanger en Graham William Nash is geboren in Blackpool.""","""Gaspard Ulliel was een Franse filmacteur en -model, en Gaspard Ulliel werd geboren in Frankrijk."""]

In [11]:
text_list_portuguese = ["""Kobe Bean Bryant foi um jogador de basquete profissional americano e Kobe Bean Bryant nasceu nos Estados Unidos.""","""O Museu Britânico localiza-se em Londres e foi fundado em 7 de junho de 1753.""","""Simon Marius era um astrônomo alemão, e Simon Marius nasceu em Gunzenhausen.""","""Muse é uma banda britânica de rock de Teignmouth, Devon, formada em 1994."""]

In [12]:
text_list_french = ["""Quand j'ai dit à John que je voulais déménager en Alaska, il m'a prévenu que j'aurais du mal à trouver un Starbucks là-bas.""","""Germaine Poinso-Chapuis est une avocate et femme politique française, née le 6 mars 1901 à Marseille et morte le 18 février 1981 dans la même ville.""","""Zine el-Abidine Ben Ali né le 3 septembre 1936 à Hammam Sousse et mort le 19 septembre 2019 à Djeddah, est un homme d'État tunisien.""","""Ricardo Bofill Leví est un architecte espagnol, né le 5 décembre 1939 à Barcelone où il est mort le 14 janvier 2022."""]

In [13]:
text_list_chinese = ["""史蒂夫·戴维斯 生于 英格兰""","""诺瓦克·德约科维奇 生于 贝尔格莱德""","""阿莱克西娅·普特利亚斯 出生于 西班牙"""]

In [14]:
text_list_italian = ["""Il Martin Luther King's Day è una festa nazionale degli Stati Uniti in onore dell'attivista e vincitore del Premio Nobel per la pace Martin Luther King (15 gennaio 1929 - 4 aprile 1968) che si celebra il terzo lunedì di gennaio, un giorno vicino a gennaio 15, giorno della sua nascita negli Stati Uniti.""","""Doraemon è un manga scritto e disegnato da Fujiko F. Fujio e pubblicato in Giappone dal dicembre 1969 all'aprile 1996 sul mensile CoroCoro Comic di Shōgakukan, per un totale di ventisette anni di attività.""","""James Watt nacque in Scozia il 19 gennaio 1736 da genitori presbiteriani.""","""Martin Luther King nacque ad Atlanta, negli Stati Uniti il 15 gennaio 1929."""]

## **4. Define Spark NLP pipeline**

Select a language - Languages: **"text_list_english","text_list_arabic","text_list_german","text_list_spanish","text_list_latvian","text_list_dutch","text_list_portuguese","text_list_french","text_list_chinese","text_list_italian"**

In [15]:
text_list = ["text_list_english","text_list_arabic","text_list_german","text_list_spanish","text_list_latvian","text_list_dutch","text_list_portuguese","text_list_french","text_list_chinese","text_list_italian"]


In [24]:
import os
from sparknlp_display import NerVisualizer

for MODEL_NAME in text_list:

    documentAssembler = DocumentAssembler()\
          .setInputCol("text")\
          .setOutputCol("document")

    sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
          .setInputCols(["document"])\
          .setOutputCol("sentence")

    tokenizer = Tokenizer()\
          .setInputCols(["sentence"])\
          .setOutputCol("token")

    tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_hrl", "xx")\
      .setInputCols(["sentence",'token'])\
      .setOutputCol("ner")

    ner_converter = NerConverter()\
          .setInputCols(["sentence", "token", "ner"])\
          .setOutputCol("ner_chunk")
          
    nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])

    empty_data = spark.createDataFrame([[""]]).toDF("text")

    model = nlpPipeline.fit(empty_data)

    if MODEL_NAME == "text_list_english":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_english}))
    elif MODEL_NAME == 'text_list_arabic':
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_arabic}))
    elif MODEL_NAME == 'text_list_german':
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_german}))
    elif MODEL_NAME == "text_list_spanish":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_spanish}))
    elif MODEL_NAME == "text_list_latvian":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_latvian}))
    elif MODEL_NAME == "text_list_dutch":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_dutch}))
    elif MODEL_NAME == "text_list_portuguese":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_portuguese}))
    elif MODEL_NAME == "text_list_french":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_french}))
    elif MODEL_NAME == "text_list_chinese":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_chinese}))
    elif MODEL_NAME == "text_list_italian":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_italian}))

    print("<----------------- MODEL NAME:","\033[1m" + MODEL_NAME + "\033[0m"," ----------------- >")
    result = model.transform(df)
    result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
          .select(F.expr("cols['0']").alias("chunk"),
                  F.expr("cols['1']['entity']").alias("ner_label"))\
          .show(truncate=False)

    NerVisualizer().display(
        result = result.collect()[2],
        label_col = 'ner_chunk',
        document_col = 'document'
    )

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
xlm_roberta_large_token_classifier_hrl download started this may take some time.
Approximate size to download 1.7 GB
[OK!]
<----------------- MODEL NAME: [1mtext_list_english[0m  ----------------- >
+-------------------+---------+
|chunk              |ner_label|
+-------------------+---------+
|Jerome Horsey      |PER      |
|Russia Company     |ORG      |
|Moscow             |LOC      |
|San Francisco      |LOC      |
|Northern California|LOC      |
|Jan Verhaas        |PER      |
|Maassluis          |LOC      |
|Brielle            |LOC      |
|Rita Pankhurst     |PER      |
|Addis Ababa        |LOC      |
+-------------------+---------+



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
xlm_roberta_large_token_classifier_hrl download started this may take some time.
Approximate size to download 1.7 GB
[OK!]
<----------------- MODEL NAME: [1mtext_list_arabic[0m  ----------------- >
+---------------------------+---------+
|chunk                      |ner_label|
+---------------------------+---------+
|الرياض                     |LOC      |
|فيصل بن بندر بن عبد العزيز |PER      |
|الرياض                     |LOC      |
|حاجي أحمد مُحيي الدين پیري،|PER      |
|پيري ريِّس،                |PER      |
|مارتن لوثر كنغ،            |PER      |
|الولايات المتحدة           |LOC      |
+---------------------------+---------+



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
xlm_roberta_large_token_classifier_hrl download started this may take some time.
Approximate size to download 1.7 GB
[OK!]
<----------------- MODEL NAME: [1mtext_list_german[0m  ----------------- >
+-----------------+---------+
|chunk            |ner_label|
+-----------------+---------+
|Leonardo         |PER      |
|Louvre           |LOC      |
|Paris            |LOC      |
|Emilie Hartmanns |PER      |
|August Hartmann  |PER      |
|Hohen Karlsschule|ORG      |
|Stuttgart        |LOC      |
|Emilie Hartmann  |PER      |
|Germany          |LOC      |
|Jenny Staley Hoad|PER      |
|Melbourne        |LOC      |
|Jennifer Staley  |PER      |
+-----------------+---------+



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
xlm_roberta_large_token_classifier_hrl download started this may take some time.
Approximate size to download 1.7 GB
[OK!]
<----------------- MODEL NAME: [1mtext_list_spanish[0m  ----------------- >
+-----------------------+---------+
|chunk                  |ner_label|
+-----------------------+---------+
|Martin Luther King, Jr |PER      |
|Estados Unidos         |LOC      |
|Martin Luther King, Jr |PER      |
|King                   |PER      |
|Chalfie                |PER      |
|Universidad de Harvard |ORG      |
|Universidad de Columbia|ORG      |
|Veronica Yvette Bennett|PER      |
|Harlem del Este        |LOC      |
|Manhattan              |LOC      |
|Nueva York             |LOC      |
|Ricardo Bofill Levi    |PER      |
|Barcelona              |LOC      |
+-----------------------+---------+



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
xlm_roberta_large_token_classifier_hrl download started this may take some time.
Approximate size to download 1.7 GB
[OK!]
<----------------- MODEL NAME: [1mtext_list_latvian[0m  ----------------- >
+-----------------------+---------+
|chunk                  |ner_label|
+-----------------------+---------+
|Anna Haselborga        |PER      |
|Stokholmā              |LOC      |
|Aleksandrs Melderis    |PER      |
|Jelgavā                |LOC      |
|Pētera                 |PER      |
|Margrietas             |PER      |
|Rozenes                |PER      |
|Melderu                |PER      |
|Eiropas Zāļu aģentūra  |ORG      |
|EZA                    |ORG      |
|ASV                    |LOC      |
|Moderna                |ORG      |
|EZA                    |ORG      |
|Oksfordas Universitātes|ORG      |
|“AstraZeneca”          |ORG      |
+-----------------------+---------+



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
xlm_roberta_large_token_classifier_hrl download started this may take some time.
Approximate size to download 1.7 GB
[OK!]
<----------------- MODEL NAME: [1mtext_list_dutch[0m  ----------------- >
+-------------------+---------+
|chunk              |ner_label|
+-------------------+---------+
|Amerigo Vespucci   |PER      |
|Florence           |LOC      |
|Arnhem             |LOC      |
|Filips de Schone   |PER      |
|Graham William Nash|PER      |
|Graham William Nash|PER      |
|Blackpool          |LOC      |
|Gaspard Ulliel     |PER      |
|Gaspard Ulliel     |PER      |
|Frankrijk          |LOC      |
+-------------------+---------+



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
xlm_roberta_large_token_classifier_hrl download started this may take some time.
Approximate size to download 1.7 GB
[OK!]
<----------------- MODEL NAME: [1mtext_list_portuguese[0m  ----------------- >
+----------------+---------+
|chunk           |ner_label|
+----------------+---------+
|Kobe Bean Bryant|PER      |
|Kobe Bean Bryant|PER      |
|Estados Unidos  |LOC      |
|Museu Britânico |LOC      |
|Londres         |LOC      |
|Simon Marius    |PER      |
|Simon Marius    |PER      |
|Gunzenhausen    |LOC      |
|Muse            |ORG      |
|Teignmouth      |LOC      |
|Devon           |LOC      |
+----------------+---------+



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
xlm_roberta_large_token_classifier_hrl download started this may take some time.
Approximate size to download 1.7 GB
[OK!]
<----------------- MODEL NAME: [1mtext_list_french[0m  ----------------- >
+-----------------------+---------+
|chunk                  |ner_label|
+-----------------------+---------+
|John                   |PER      |
|Alaska                 |LOC      |
|Starbucks              |ORG      |
|Germaine Poinso-Chapuis|PER      |
|Marseille              |LOC      |
|Zine el-Abidine Ben Ali|PER      |
|Hammam Sousse          |LOC      |
|Djeddah                |LOC      |
|Ricardo Bofill Leví    |PER      |
|Barcelone              |LOC      |
+-----------------------+---------+



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
xlm_roberta_large_token_classifier_hrl download started this may take some time.
Approximate size to download 1.7 GB
[OK!]
<----------------- MODEL NAME: [1mtext_list_chinese[0m  ----------------- >
+---------------------+---------+
|chunk                |ner_label|
+---------------------+---------+
|史蒂夫·戴维斯        |PER      |
|英格兰               |LOC      |
|诺瓦克·德约科维奇    |PER      |
|贝尔格莱德           |LOC      |
|阿莱克西娅·普特利亚斯|PER      |
|西班牙               |LOC      |
+---------------------+---------+



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
xlm_roberta_large_token_classifier_hrl download started this may take some time.
Approximate size to download 1.7 GB
[OK!]
<----------------- MODEL NAME: [1mtext_list_italian[0m  ----------------- >
+------------------+---------+
|chunk             |ner_label|
+------------------+---------+
|Stati Uniti       |LOC      |
|Martin Luther King|PER      |
|Stati Uniti       |LOC      |
|Fujiko F          |PER      |
|Fujio             |PER      |
|Giappone          |LOC      |
|CoroCoro Comic    |ORG      |
|Shōgakukan        |LOC      |
|James Watt        |PER      |
|Scozia            |LOC      |
|Martin Luther King|PER      |
|Atlanta           |LOC      |
|Stati Uniti       |LOC      |
+------------------+---------+

