

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/Ner_HRL.ipynb)






## **NER Model for 10 Different Languages**


## **1. Colab Setup**

In [1]:
# Install PySpark and Spark NLP
! pip install -q pyspark==3.1.2 spark-nlp

In [2]:
!pip install --ignore-installed spark-nlp-display

Collecting spark-nlp-display
  Using cached spark_nlp_display-1.8-py3-none-any.whl (95 kB)
Collecting numpy
  Using cached numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
Collecting ipython
  Using cached ipython-7.31.1-py3-none-any.whl (792 kB)
Collecting spark-nlp
  Using cached spark_nlp-3.4.0-py2.py3-none-any.whl (140 kB)
Collecting svgwrite==1.4
  Using cached svgwrite-1.4-py3-none-any.whl (66 kB)
Collecting pandas
  Using cached pandas-1.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
Collecting pygments
  Using cached Pygments-2.11.2-py3-none-any.whl (1.1 MB)
Collecting jedi>=0.16
  Using cached jedi-0.18.1-py2.py3-none-any.whl (1.6 MB)
Collecting backcall
  Using cached backcall-0.2.0-py2.py3-none-any.whl (11 kB)
Collecting setuptools>=18.5
  Using cached setuptools-60.5.0-py3-none-any.whl (958 kB)
Collecting decorator
  Using cached decorator-5.1.1-py3-none-any.whl (9.1 kB)
Collecting prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0

In [3]:
import pandas as pd
import numpy as np
import json
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

## **2. Start Spark Session**

In [4]:
spark = sparknlp.start()

## **3. Sample Examples for all of the 10 Different languages**

In [5]:
text_list_english = ["""Jerome Horsey was a resident of the Russia Company in Moscow from 1572 to 1585.""","""The 1906 San Francisco earthquake struck the coast of Northern California.""","""Jan Verhaas is a Dutch snooker and pool referee. He was born in Maassluis, and now lives in Brielle.""","""Ethiopian historians who married Rita Pankhurst in Addis Ababa have been married for more than a year."""]

In [6]:
text_list_arabic = ["""يمكنكم مشاهدة أمير منطقة الرياض الأمير فيصل بن بندر بن عبد العزيز في كل مناسبة وافتتاح تتعلق بمشاريع التعليم والصحة وخدمة الطرق والمشاريع الثقافية في منطقة الرياض.""","""خريطة العالم بِيد أمير البحار العُثماني حاجي أحمد مُحيي الدين پیري، الشهير باسم پيري ريِّس، رُسمت سنة 1513م.""",""" ويحتفل بحياة وإنجازات مارتن لوثر كنغ، وهو زعيم بارز في الحقوق المدنية الأمريكية والأكثر شهرة بحملاته لإنهاء التمييز العنصري في وسائل النقل العامة والمساواة العرقية في الولايات المتحدة."""]

In [7]:
text_list_german = ["""Die Mona Lisa ist ein Ölgemälde aus dem 16. Jahrhundert, das von Leonardo geschaffen wurde. Es findet im Louvre in Paris statt.""","""Emilie Hartmanns Vater August Hartmann war Lehrer an der Hohen Karlsschule in Stuttgart, bis zu deren Auflösung 1793.""","""1794 wurde Emilie Hartmann geboren in Germany.""","""Jenny Staley Hoad (* 3. März 1934 in Melbourne als Jennifer Staley) ist eine ehemalige australische Tennisspielerin."""]

In [8]:
text_list_spanish = ["""El día de Martin Luther King, Jr. (en inglés Martin Luther King, Jr. Day) es un día festivo de los Estados Unidos marcado por el aniversario del natalicio del reverendo doctor Martin Luther King, Jr. Se celebra el tercer lunes de enero de cada año, que es aproximadamente la fecha del nacimiento de King, el 15 de enero de 1929.""","""Chalfie se graduó en la Universidad de Harvard y es profesor de biología en la Universidad de Columbia.""","""Nacida el 10 de agosto de 1943 como Veronica Yvette Bennett en Harlem del Este, Manhattan, Nueva York, de madre afroamericana y padre irlandés.""","""Ricardo Bofill Levi nació el 5 de diciembre de 1939 en Barcelona."""]

In [9]:
text_list_latvian = ["""Anna Haselborga ir zviedru kērlinga spēlētāja, 2018. gada ziemas olimpisko spēļu čempione un pasaules čempione jauktajās dubultspēlēs, dzimusi Stokholmā.""","""Aleksandrs Melderis dzimis 1909. gada 19. janvārī Jelgavā Pētera un Margrietas (dzim. Rozenes) Melderu ģimenē.""","""6. janvārī Eiropas Zāļu aģentūra (EZA) apstiprināja arī ASV kompānijas Moderna izstrādāto vakcīnu.""","""29. janvārī, EZA apstiprināja Oksfordas Universitātes un farmācijas kompānijas “AstraZeneca” izstrādāto vakcīnu."""]

In [10]:
text_list_dutch = ["""Amerigo Vespucci werd op 9 maart 1454 in Florence geboren, hij was dus een Genuees.""", """Van 23 juni tot 6 juli 1505 werd het Beleg van Arnhem opgezet door Filips de Schone.""","""Graham William Nash is een Engelse zanger en Graham William Nash is geboren in Blackpool.""","""Gaspard Ulliel was een Franse filmacteur en -model, en Gaspard Ulliel werd geboren in Frankrijk."""]

In [11]:
text_list_portuguese = ["""Kobe Bean Bryant foi um jogador de basquete profissional americano e Kobe Bean Bryant nasceu nos Estados Unidos.""","""O Museu Britânico localiza-se em Londres e foi fundado em 7 de junho de 1753.""","""Simon Marius era um astrônomo alemão, e Simon Marius nasceu em Gunzenhausen.""","""Muse é uma banda britânica de rock de Teignmouth, Devon, formada em 1994."""]

In [12]:
text_list_french = ["""Quand j'ai dit à John que je voulais déménager en Alaska, il m'a prévenu que j'aurais du mal à trouver un Starbucks là-bas.""","""Germaine Poinso-Chapuis est une avocate et femme politique française, née le 6 mars 1901 à Marseille et morte le 18 février 1981 dans la même ville.""","""Zine el-Abidine Ben Ali né le 3 septembre 1936 à Hammam Sousse et mort le 19 septembre 2019 à Djeddah, est un homme d'État tunisien.""","""Ricardo Bofill Leví est un architecte espagnol, né le 5 décembre 1939 à Barcelone où il est mort le 14 janvier 2022."""]

In [13]:
text_list_chinese = ["""史蒂夫·戴维斯 生于 英格兰""","""诺瓦克·德约科维奇 生于 贝尔格莱德""","""阿莱克西娅·普特利亚斯 出生于 西班牙"""]

In [14]:
text_list_italian = ["""Il Martin Luther King's Day è una festa nazionale degli Stati Uniti in onore dell'attivista e vincitore del Premio Nobel per la pace Martin Luther King (15 gennaio 1929 - 4 aprile 1968) che si celebra il terzo lunedì di gennaio, un giorno vicino a gennaio 15, giorno della sua nascita negli Stati Uniti.""","""Doraemon è un manga scritto e disegnato da Fujiko F. Fujio e pubblicato in Giappone dal dicembre 1969 all'aprile 1996 sul mensile CoroCoro Comic di Shōgakukan, per un totale di ventisette anni di attività.""","""James Watt nacque in Scozia il 19 gennaio 1736 da genitori presbiteriani.""","""Martin Luther King nacque ad Atlanta, negli Stati Uniti il 15 gennaio 1929."""]

In [15]:
text_list = ["text_list_english","text_list_arabic","text_list_german","text_list_spanish","text_list_latvian","text_list_dutch","text_list_portuguese","text_list_french","text_list_chinese","text_list_italian"]


In [16]:
# Creating input folders
import os
for MODEL_NAME in text_list:
  INPUT_FILE_PATH='/content/Ner_HRL/inputs/'+MODEL_NAME+'/'
  OUTPUT_FILE_PATH='/content/Ner_HRL/outputs/'+MODEL_NAME+'/'
      
      # Create folders
  !rm -r $INPUT_FILE_PATH
  !mkdir -p $INPUT_FILE_PATH

  if MODEL_NAME == 'text_list_english': 
    for i, v in enumerate(text_list_english):
        open(os.path.join(INPUT_FILE_PATH,'Example'+str(i+1)+'.txt'), 'w', encoding="utf8").write(v[:min(len(v)-10, 100)]+'... \n'+v)
  elif MODEL_NAME == 'text_list_arabic':
    for i, v in enumerate(text_list_arabic):
        open(os.path.join(INPUT_FILE_PATH,'Example'+str(i+1)+'.txt'), 'w', encoding="utf8").write(v[:min(len(v)-10, 100)]+'... \n'+v)
  elif MODEL_NAME == 'text_list_german':
    for i, v in enumerate(text_list_german):
        open(os.path.join(INPUT_FILE_PATH,'Example'+str(i+1)+'.txt'), 'w', encoding="utf8").write(v[:min(len(v)-10, 100)]+'... \n'+v)
  elif MODEL_NAME == 'text_list_spanish':
    for i, v in enumerate(text_list_spanish):
        open(os.path.join(INPUT_FILE_PATH,'Example'+str(i+1)+'.txt'), 'w', encoding="utf8").write(v[:min(len(v)-10, 100)]+'... \n'+v)
  elif MODEL_NAME == 'text_list_latvian':
    for i, v in enumerate(text_list_latvian):
        open(os.path.join(INPUT_FILE_PATH,'Example'+str(i+1)+'.txt'), 'w', encoding="utf8").write(v[:min(len(v)-10, 100)]+'... \n'+v)
  elif MODEL_NAME == 'text_list_dutch':
    for i, v in enumerate(text_list_dutch):
        open(os.path.join(INPUT_FILE_PATH,'Example'+str(i+1)+'.txt'), 'w', encoding="utf8").write(v[:min(len(v)-10, 100)]+'... \n'+v)
  elif MODEL_NAME == 'text_list_portuguese':
    for i, v in enumerate(text_list_portuguese):
        open(os.path.join(INPUT_FILE_PATH,'Example'+str(i+1)+'.txt'), 'w', encoding="utf8").write(v[:min(len(v)-10, 100)]+'... \n'+v)
  elif MODEL_NAME == 'text_list_french':
    for i, v in enumerate(text_list_french):
        open(os.path.join(INPUT_FILE_PATH,'Example'+str(i+1)+'.txt'), 'w', encoding="utf8").write(v[:min(len(v)-10, 100)]+'... \n'+v)
  elif MODEL_NAME == 'text_list_chinese':
    for i, v in enumerate(text_list_chinese):
        open(os.path.join(INPUT_FILE_PATH,'Example'+str(i+1)+'.txt'), 'w', encoding="utf8").write(v[:min(len(v)-10, 100)]+'... \n'+v)
  elif MODEL_NAME == 'text_list_italian':
    for i, v in enumerate(text_list_italian):
        open(os.path.join(INPUT_FILE_PATH,'Example'+str(i+1)+'.txt'), 'w', encoding="utf8").write(v[:min(len(v)-10, 100)]+'... \n'+v)



      ## Loading back Example File


In [17]:
# Creating output folders

for MODEL_NAME in text_list:
  INPUT_FILE_PATH='/content/Ner_HRL/inputs/'+MODEL_NAME+'/'
  OUTPUT_FILE_PATH='/content/Ner_HRL/outputs/'+MODEL_NAME+'/'
      
      # Create folders
  !rm -r $OUTPUT_FILE_PATH
  !mkdir -p $OUTPUT_FILE_PATH

## **4. Define Spark NLP pipeline**

In [18]:
from sparknlp_display import NerVisualizer
for MODEL_NAME in text_list:
    INPUT_FILE_PATH='/content/Ner_HRL/inputs/'+MODEL_NAME+'/'
    OUTPUT_FILE_PATH='/content/Ner_HRL/outputs/'+MODEL_NAME+'/'
    file_list=sorted(os.listdir(INPUT_FILE_PATH))
    file_paths =sorted([ os.path.join(INPUT_FILE_PATH, pth) for pth in file_list]) 
      

    documentAssembler = DocumentAssembler()\
          .setInputCol("text")\
          .setOutputCol("document")

    sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
          .setInputCols(["document"])\
          .setOutputCol("sentence")

    tokenizer = Tokenizer()\
          .setInputCols(["sentence"])\
          .setOutputCol("token")

    tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_hrl", "xx")\
      .setInputCols(["sentence",'token'])\
      .setOutputCol("ner")

    ner_converter = NerConverter()\
          .setInputCols(["sentence", "token", "ner"])\
          .setOutputCol("ner_chunk")
          
    nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])

    empty_data = spark.createDataFrame([[""]]).toDF("text")

    model = nlpPipeline.fit(empty_data)

    if MODEL_NAME == "text_list_english":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_english}))
    elif MODEL_NAME == 'text_list_arabic':
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_arabic}))
    elif MODEL_NAME == 'text_list_german':
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_german}))
    elif MODEL_NAME == "text_list_spanish":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_spanish}))
    elif MODEL_NAME == "text_list_latvian":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_latvian}))
    elif MODEL_NAME == "text_list_dutch":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_dutch}))
    elif MODEL_NAME == "text_list_portuguese":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_portuguese}))
    elif MODEL_NAME == "text_list_french":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_french}))
    elif MODEL_NAME == "text_list_chinese":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_chinese}))
    elif MODEL_NAME == "text_list_italian":
      df = spark.createDataFrame(pd.DataFrame({"text": text_list_italian}))

    result = model.transform(df)
    result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
          .select(F.expr("cols['0']").alias("chunk"),
                  F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
          #\
          #.show(truncate=False)

    #NerVisualizer().display(
     #   result = result.collect()[3],
      #  label_col = 'ner_chunk',
       # document_col = 'document'
    #)
    result = result.toPandas()
 
    for i in result.index:
        result[['ner_chunk']].iloc[i].to_json(
            os.path.join(OUTPUT_FILE_PATH, file_list[i].split('.')[0]+".json"))

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
xlm_roberta_large_token_classifier_hrl download started this may take some time.
Approximate size to download 1.7 GB
[OK!]
+-------------------+---------+
|chunk              |ner_label|
+-------------------+---------+
|Jerome Horsey      |PER      |
|Russia Company     |ORG      |
|Moscow             |LOC      |
|San Francisco      |LOC      |
|Northern California|LOC      |
|Jan Verhaas        |PER      |
|Maassluis          |LOC      |
|Brielle            |LOC      |
|Rita Pankhurst     |PER      |
|Addis Ababa        |LOC      |
+-------------------+---------+

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
xlm_roberta_large_token_classifier_hrl download started this may take some time.
Approximate size to download 1.7 GB
[OK!]
+---------------------------+---------+
|chunk                      |ner_label|
+----------