

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/Ner_masakhaner.ipynb)






# ***`NER Model for African Languages`***


## 1. Colab Setup

In [None]:
# Install PySpark and Spark NLP
! pip install -q pyspark==3.3.0 spark-nlp==4.2.8

# Install Spark NLP Display lib
! pip install --upgrade -q spark-nlp-display

In [None]:
import pandas as pd
import numpy as np
import json
import os

from pyspark.ml import Pipeline
from pyspark.sql.types import StringType, IntegerType
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
from sparknlp_display import NerVisualizer

## 2. Start Spark Session

In [None]:
spark = sparknlp.start()
print ("Spark NLP Version :", sparknlp.version())
spark

Spark NLP Version : 4.2.8


### <font color='green'> üìç***xlm_roberta_large_token_classifier_masakhaner***</font>

*It‚Äôs been trained using xlm_roberta_large fine-tuned model on African languages (**Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Nigerian, Pidgin, Swahilu, Wolof, and Yor√πb√°**).*


### <font color='green'> üìç***distilbert_base_token_classifier_masakhaner***</font>

*it‚Äôs been finetuned on MasakhaNER dataset for African languages (**Hausa, Igbo, Kinyarwanda, Luganda, Nigerian, Pidgin, Swahilu, Wolof, and Yor√πb√°**) leveraging DistilBert embeddings and DistilBertForTokenClassification for NER purposes.*


## 3. Sample Examples for all of the African languages

In [None]:
text_list_amharic = ["""·ä†·àÖ·àò·ãµ ·â´·äï·ã≥ ·ä®3-10-2000 ·åÄ·àù·àÆ ·â†·ä†·ã≤·àµ ·ä†·â†·â£ ·äñ·àØ·àç·ç¢""","""·à∞·àõ·ã´·ãä ·çì·à≠·â≤ ·ãõ·à¨ ·â†·ãà·âÖ·â≥·ãä ·ã®·àÄ·åà·à™·â± ·ã®·çñ·àà·â≤·ä´ ·åâ·ã≥·ãÆ·âΩ ·àã·ã≠ ·â†·àò·ä¢·ä†·ãµ ·åΩ·àÖ·çà·âµ ·â§·âµ ·ã®·à∞·å†·ãç ·åã·ãú·å£·ãä ·àò·åç·àà·å´ ·ã≠·ä®·â∞·àã·àç ·ç°·ç°""","""·ã® ·ãì·àò·â± ·ä†·ã≤·à± ·ã®·ãö·àù·â£·â•·ãå ·çï·à¨·ãö·ã∞·äï·âµ ·ä§·àò·à≠·à∞·äï ·àù·äì·äï·åã·åç·ãã ·â†·àÅ·àà·â± ·âª·ãé·âª·â∏·ãç ·ä†·äï·çÉ·à≠ ·â†·ãï·ãµ·àú ·âµ·äï·àπ ·äì·â∏·ãç ·ç¢""","""·ã∂·ã≠·â∏ ·â¨·àà ·ã´·äê·åã·åà·à´·â∏·ãç ·ã®·àò·â•·âµ ·â∞·àü·åã·âø ·àä·äï·ã≥ ·àõ·ãú·à™·à¨ ·â†·ãï·ãµ·àô ·â†·â∞·ã≥·ä®·àô ·àò·à™·ãé·âΩ ·äê·ãç ·ã®·àù·äï·â∞·ã≥·ã∞·à®·ãç ·â†·àõ·àà·âµ ·ã®·à≠·à≥·â∏·ãç·äï ·ä•·äì ·ã®·ä†·àÖ·åâ·à©·äï ·ãà·å£·âµ ·âµ·ãç·àç·ãµ ·âÖ·à¨·â≥ ·åà·àç·å∏·ãã·àç ·ç¢""","""·à™·â≥ ·çì·äï·ä≠·ä∏·à≠·àµ·âµ ·ã®·ä¢·âµ·ãÆ·åµ·ã´ ·â£·àà·ãç·àà·â≥ ·ä†·ã≤·àµ ·ä†·â†·â£ ·àã·ã≠ ·âµ·ã≥·à≠ ·ã®·àò·à∞·à®·â±·âµ ·ã®·ä¢·âµ·ãÆ·åµ·ã´ ·â≥·à™·ä≠ ·â∞·àò·à´·àõ·à™·ãé·âΩ ·â†·âµ·ã≥·à≠ ·ä® ·ãì·àò·â≥·âµ ·â†·àã·ã≠ ·ãò·àç·âÄ·ãã·àç ·ç¢""","""·â†·à≥·àç·àµ·â± ·ä•·àµ·à´·ä§·àç ·ãâ·àµ·å• ·â†·â∞·ã∞·à®·åà·ãâ ·ä†·å†·âÉ·àã·ã≠ ·àù·à≠·å´ ·ä†·ä≠·à´·à™·ãâ ·ã®·å†·âÖ·àã·ã≠ ·àö·äï·àµ·âµ·à≠ ·â§·äï·ã´·àö·äï ·äî·âµ·äï·ã´·àÅ ·çì·à≠·â≤ ·àä·ä©·ãµ ·ä†·à∏·äê·çà ·ç¢"""]

In [None]:
text_list_hausa = ["""A saurari cikakken rahoton wakilin Muryar Amurka Ibrahim Abdul'aziz""","""Najeriya : Kungiyar Ma'aikatan Jami'o'i Ta Shiga Yajin Aikin Gargadi""","""A ranar Juma‚Äôa mai zuwa ne wa‚Äôadin yajin aikin na gargadi zai kammala , kuma a hirar su da wakilin Muaryar Amurka , Komared Mohammed Jaji ya yi tsokaci game da mataki na gaba .""","""Kan haka Majalisar Dinkin Duniya ta zabi Aliko Dangote , da shugaban bankin raya Afirka , da wassu mutane 25 a fadin duniya su jagoranci magance matsalar tamowa , kafin shekara 2030 .""","""Temitope Olatoye Sugar shine mai wakiltar mazabar Lagelu da Akinyele daga jihar Oyo , a majalisar wakilan tarayyar Najeriya .""","""Tsohon mataimakin shugaban Najeriya , kuma dan takarar shugaban kasa a zaben 2019 karkashin jam‚Äôiyyar adawa ta PDP , Atiku Abubakar , ya yi Allah wadai da yunkurin da wasu sojoji suka yi na kifar da ‚Äú zababbiyar gwamnatin Habasha ."""]

In [None]:
text_list_igbo = ["""Osote onye - isi nd·ªã ome - iwu Na·ªãjir·ªãa b·ª• Ike Ekweremadu ekwuola na ike agw·ª•la nd·ªã S·ªãnat·ªã iji otu nkeji daraj·ª•·ª• akwanyere nd·ªã egburu n'ime oke ·ªçgbaghara d·ªã na Na·ªãjir·ªãa oge ·ªç bula .""","""Okwu a Buhari kwuru na isi nd·ªçr·ªçnd·ªçr·ªç ·ªçch·ªãch·ªã na 2015 bu ·ªãhe eji kp·ª•r·ª• ya na ·ªçn·ª• ugbua , ·ªçkachas·ªã ka ·ª•l·ªç ·ªçr·ª• na - ah·ª• maka ·ªçn·ª• ·ªçg·ª•g·ª• a na - akp·ªç National Bureau of Statistics ( NBS ) nwep·ª•tara ozi n'ak·ªçwa na mmad·ª• ruru nde asaa na nari ise so na nd·ªã enweghi ·ªçr·ª• kemgbe af·ªç 2016 .""","""Google Africa kwuru n'igwe okwu Twitter s·ªã : Taa , any·ªã na - akwanyere onye egwuregwu b·ªç·ªçl·ª• a ma ama , Stephen Keshi ugwu .""","""Keshi ch·ªãr·ªã nd·ªã otu egwuregwu Super Eagles kemgbe af·ªç 2011 ma durukwa ha gaa as·ªçmpi d·ªã iche iche nke g·ª•nyere ; Iko Mba Afrika na 2013 ( nke ha bulatara Na·ªãjir·ªãa ) , iko mpaghara Afr·ªãka d·ªã iche iche na 2013 , ma nye aka wetara Naijiria ·ªçn·ªçd·ª• n'as·ªçmpi Iko Mba·ª•wa niile na 2014 .""","""N' akw·ª•kw·ªç ozi , ngalaba 'US Department' tinyere na websait ha , ha kwuru s·ªã : Yunaited Steeti na - enwe obi mwute n' iyi ·ªçr·ª• nke onye ndu nd·ªã na - ama g·ªç·ªçmenti Kenya aka n'ihu b·ª• Raila Odinga duru onwe ya ka ·ªçnwa Jenuwari gbara ·ªãr·ªã at·ªç .""","""Cheta na G·ªç·ªçmenti etiti mechiburu ·ª•l·ªç·ªçr·ª• ngosi TV at·ªç maka na ha gbasara ozi gos·ªãr·ªã Raila Odinga ebe ·ªç na - edu onwe ya iyi ·ªçr·ª• ma kp·ªçkwa onwe ya onyeisiala mba Kenya , ebe ul·ªçikpe Kenya akw·ª•s·ªãr·ªã mmechi ah·ª• ·ª•b·ªçch·ªã Wenesde .""","""Taa , otu n'ime nd·ªã kewap·ª•tara n'otu nd·ªçr·ªçnd·ªçr·ªç ·ªçch·ªãch·ªã APC kp·ªçr·ªç nd·ªã ntaak·ª•k·ªç n'isi ·ª•l·ªç·ªçr·ª• ha maka ·ªã k·ªçwa echiche ha n'esomokwu nke di n'etiti nd·ªã APC nke Imo steeti . N'·ªçn·ª• okwu TOE Ekechi b·ª• on·ª• na - ekwuchitere otu a , ha na - ebo g·ªçvan·ªç Okorocha ebubo na o nupuru iwu ji patu ha isi ·ªçt·ª•t·ª•""","""Otu kporo onweha 'The Coalition of Northern Groups' na bekee gwara onyeisiala Na·ªãjir·ªãa b·ª• Muhammadu Buhuri na onye chiburu d·ªãka osote onyeisiala n'oge garaaga b·ª• Atiku Abubakar na ·ªç ga - ad·ªã mma ma·ªçb·ª•r·ª• na ha ab·ª•·ªç wepuru aka n'ime ·ªçs·ªç ·ªã banye n'·ªçkwa ·ªçch·ªãch·ªã d·ªãka onyeisiala n'af·ªç 2019 ."""]

In [None]:
text_list_kinyarwanda = ["""Ambasaderi w‚ÄôUmuryango w‚ÄôUbumwe bw‚Äôu Burayi mu Rwanda , Nicola Bellomo , aherutse gushima uko u Rwanda rurimo guhangana n‚Äôicyorezo cya Coronavirus , yizeza ko uyu muryango uzakomeza gufatanya na rwo muri uru rugamba no mu zindi gahunda z‚Äôiterambere .""","""Imibare ya Banki y‚ÄôIsi yo kuwa 9 Mata igaragaza ko ubukungu bwo muri Afurika yo-munsi y‚ÄôUbutayu bwa Sahara , bwagizweho ingaruka na Coronavirus ndetse ko buzamanuka ku kigero cya - 2 .""","""Mu butumwa yanditse kuri Twitter kuri uyu kuwa Kane , Mateke yahishuye ko kuva kera na kare atemeraga amasezerano basinyanye n‚Äôu Rwanda agamije guhosha umwuka mubi uri hagati y‚Äôibihugu byombi , mu gihe ari umwe mu bagombaga kuba bakurikirana uko ashyirwa mu bikorwa .""","""Amagambo ya Mateke anahura n‚Äôay‚Äôumudepite Ruth Nankabirwa , kuri uyu kuwa Gatatu wabwiye bagenzi be mu Nteko Ishinga Amategeko ko Guverinoma ya Uganda ikwiye gukemura bwangu ikibazo ifitanye n‚Äôu Rwanda , ariko asa n‚Äôuca amarenga ku buryo bwakoreshwa .""","""Ubwo bari ku ngingo zijyanye n‚Äôuko Uganda ifasha imitwe yitwaje intwaro , Nduhungirehe yatanze urugero rw‚Äôigitero cyabaye mu ijoro rishyira ku itariki ya Kane Ukwakira aho abarwanyi b‚Äôumutwe wa RUD Urunana bateye mu Kinigi ."""]

In [None]:
text_list_luganda = ["""Phillip Wokorach , Justin Kimono ne Adrian Kisito be bamu ku baayambye Uganda , eyawangula empala zino omwaka oguwedde , okuva emabeganefuna obuwanguzi .""","""Oluvannyuma yaddukira mu Zimbabwe ngakozesa Paasipooti eyali mu mannya ga David Mubiru , kyokka aboobuyinza baamuyigga ne bamukomyawo mu Uganda , mu November 2016 , okumalayo ekibonerezo ekyemyaka ena nemyaka emirala ebiri , egyamwongerwako olwokutoloka mu kkomera .""","""DPC wa Rakai , Patience Baganzi yategeezezza nti bagenda kumukwasa poliisi ye Katwe mu Kampala gye yaddiza omusango avunaanibwe .""","""OMWAMI wa Ssabasajja owessaza lya Mawokota afudde kibwatukira nalekabanna Mawokota mu kiyongobero . Kayima David Ssekyeru afudde mu ngeri yentiisa bwaseredde nagwa mu kinaabiro nga egenze okunaaba bagenze okuyita ambulensi okumuddusa mu ddwaliro e Mmengo nafiira mu kkubo nga tebanatuuka mu ddwaliro . Ssekyeru abadde amaze wiiki emu nga mugonvugonvu kyokka abadde azeemu endasi kwekwewaliriza agende mu kinaabiro""","""Omwogezi wa poliisi mu Greater Masaka , ASP Paul Kangave yategeezezza Bukedde nti poliisi yatandikiddewo okunoonyereza oluvannyuma lwokufuna amawulire gokutemulwa kwomusuubuzi ono .""","""Mugisha baamukwatira Ndeeba mu Nsiike Zooni ku ntandikwa ya wiiki ewedde era yasooka kutegeeza poliisi nti munne Mulo yattibwa ekibinja kyaba bodaboda abaabalondoola nga baakamala okubba pikipiki e Mityana ne babataayiriza e Makindye ."""]

In [None]:
text_list_Nigerian = ["""Jii 2 go mane gin ja apiko moro ma ja higni 20 mane oyang nyinge kaka Kevin Omondi kod achiel kuom jowuoth mage mane oting' o mane iluongo ni Shopie Anyango ma ja higni 23 ne jotho mana kanyo gi kanyo e masirano mane ojuko lori moro mar kambi jo China kod apiko yoo Ringa""", """Japuonjreno ma wuoi ma jahigni 15 ochopo e nyim jayal bura Joseph Karanja kama odonjne kod ketho mar nego Noel Adhiambo midenyo ma jahigni 11 ; mane en japuonjre e skul ma Kosele Community Christian Center e kar chung' od bura ma Kasipul dwee mokalo .""","""Magi oyangi gi jawach eloo State House nyadendi Kanze Dena mane owacho ni jogo nyocha opim ne tuono e pimo manyocha otim chieng tich 4""","""Kanomedochiwo ler ewii wachno Kanze nowacho ni jotich duto mag State House ipimoga moting' o e kinde ka kinde moting' o nyaka jatend piny Kenya migosi Uhuru Kenyata gi familia mare mar ng' eyo chal margi ne tuo mar Covid - 19no kowacho ni jii 4 mane oyudi ni kod tuono sani jonie kar thieth ma Kenyatta University Teaching , Referal and Research Hospital ma gidhiyoe nyime gi yudo thieth""","""MCAsgo mane otelnegi kod Julius Nyambok ma Homa Bay Central ward ne jogolo rang' isi mag chenro buora mag dongruok moting' o kama ikanoe remo e kar thieth ma Homa Bay County Referal Hospital , kambi mogo ma chiro Kigoti kod kar pidho jamni ma Arujo"""]

In [None]:
text_list_Pidgin = ["""Popular cable satellite broadcaster DsTV , no get right to Bundesliga live matches for di 2019 / 2020 season so na pipo wey get StarTimes dey in luck because na dem get broadcast rights for Sub - Saharan Africa .""","""Whichever way wey you watch just know say you dey part of one billion pipo wey Bayern CEO Karl - Heinz Rummenigge don gauge say go watch dis weekend live matches See Saturday games .""","""Conditions Spain top league and working place of Lionel Messi dey torchlight June 12 as di date when dem go resume di season .""","""LA Lakers legend Kobe Bryant and im daughter Gianna plus seven oda die for helicopter crash for di city of Calabasa , California on Sunday 26 January .""","""Ighalo move go Chinese Super League for 2017 , first with Changchun Yatai .""","""Senegal and Liverpool forward Mane beat both Egypt player Mohammed Salah and Algeria winger Riyad Mahrez to win di award wey dem do for Egypt on Tuesday ."""]

In [None]:
text_list_Swahilu = ["""Wanamgambo wa ADF Mauaji ya Alhamisi katika mkoa wa Mbau kaskazini mwa Beni yanashukiwa kufanya na kundi la waasi la Allied Democratic Force , ADF , ambalo linahusika na mfululizo wa mauaji tangu kuanza kwa ghasia mwezi November .""","""Jeshi la Congo limegundua ‚Äòkiwanda cha kutengeneza mabomu ya kienyeji‚Äô katika kambi moja ya ADF waliyoiteka , msemaji wa jeshi jenerali Leon Richard Kasonga amesema Jumatano .""","""Wajumbe wa kikosi kazi cha virusi vya corona cha White House wamepangiwa kutoa ushuhuda mbele ya kamati ya Nishati na Biashara ya Baraza la Wawakilishi Jumanne , na Spika wa Baraza la Wawakilishi Nancy Pelosi amesema , ‚Äú Wananchi wa Marekani wanahitaji majibu kwa nini Rais Trump anataka upimaji upunguzwe kasi wakati wataalam wanasema upimaji zaidi unahitajika .""","""Siku Jumatano maafisa wawili wa Umoja wa Mataifa watawasilisha ripoti inayoeleza kwamba kuna ushahidi wa kutosha unaodhihirisha kwamba Saudi Arabia ilidukua simu ya Bezos .""","""Mahakama ya Juu ya Korea Kusini imeamrisha mahakama ya chini ifikirie tena moja ya mashtaka ya jinai dhidi ya Rais wa zamani Park Geun - hye ambaye alilazimishwa kuondoka madarakani mwaka 2017 kutokana na kashfa ya ufisadi .""","""Waziri Mkuu wa Uingereza Boris Johnson amesema ataheshimu utaratibu wa sheria lakini Uingereza itajiondowa kutoka Umoja wa Ulaya ( EU ) ifikapo Oktoba 31 ."""]

In [None]:
text_list_Wolof = ["""Dafa di sax , ni mu ame woon noonu fit moo taxoon √±u d√†q ko , moom ak benn doomu Far√£s bu daan wuyoo ci turu Daniel Cohn - Bendit , ca daara ju mag jooju , ci atum 1969 .""","""Usmaan Sonkoo ngi juddoo Cees ci atum 1974 .""","""Waaw , Isaa S√†ll nekkoon na fi Njiitu ndajem diiwaanu Fatig ci njeexitalu atiy 1990 .""","""IR√É NDAW : Komisaariya bu Ndaakaaru woolu na waaraatekatu Sentv bi .""","""Ciy ati 60 , bokkoon na ci ‚Äù Groupe de Grenoble ‚Äù kur√©l gu doon j√©em a suqali l√†mmi√±i r√©ew mi mook √±oomin Asan Silla , Mas√†mba Sare ak Saaliyu K√†nji ak it Ablaay W√†dd mi fi doonoon njiitu r√©ewum Senegaal ."""]

In [None]:
text_list_Yor√πb√° = ["""·∫∏gb·∫πÃÅ Oh√πn √Ägb√°y√© d√∫r√≥ ·π£in·π£in p·∫πÃÄl√∫ Luis Carlos , ·∫πb√≠i r·∫πÃÄ , √†ti on√≠r√≤y√¨n al√°d√†√°·π£i·π£·∫πÃÅ gbogbo √†w·ªçn t√≠ √≥ ≈Ñ m√∫ √¨j·ªçba ·π£e b√≠ √≥ ti y·∫π n√≠ Venezuela .""","""Il√© - i·π£·∫πÃÅ·∫π Mohammed Sani Musa , Activate Technologies Limited , ni √≥ k√≥ ·∫πÃÄr·ªç √åw√© - p√©l√©b√© √åd√¨b√≤ Al√°l√≤p·∫πÃÅ ( PVCs ) t√≠ a l√≤ f√∫n ·ªçd√∫n - un 2019 , n√≠gb√† t√≠ √≥ j·∫πÃÅ √≤«πd√≠jedup√≤ l√°b·∫πÃÅ ·∫πgb·∫πÃÅ ol√≥·π£√®l√∫u t√≠ √≥ ≈Ñ tuk·ªçÃÄ √®t√≤ √¨·π£√®l√∫ l·ªçÃÅw·ªçÃÅ All Progressives Congress ( APC ) f√∫n A·π£oj√∫ √ål√† - O√≤r√πn Niger , √¨y·∫πn g·∫πÃÅg·∫πÃÅ b√≠ il√© i·π£·∫πÃÅ a·π£√®w√°d√¨√≠ , Premium Times ·π£e t·∫πÃÄ ·∫πÃÅ""","""Ishaku Elisha Abbo ti ·∫πgb·∫πÃÅ al√°tak√≤ People‚Äôs Democratic Party ( PDP ) j·∫πÃÅ a·π£oj√∫ t√≠ √≥ ≈Ñ ·π£oj√∫ ·∫∏k√πn - un √Är√≠w√° Adamawa n√≠ √åp√≠nl·∫πÃÄ Adamawa , n√≠ √¨l√† - o√≤r√πn √†r√≠w√° or√≠l·∫πÃÄ √®d√®e N√†√¨j√≠r√≠√† .""","""N√≠n√∫ o·π£√π Ag·∫πm·ªç 2019 , n√≠ √¨·π£oj√∫ ·ªçl·ªçÃÅp√†√° , Abbo ·π£e √†·π£em√°·π£e p·∫πÃÄl√∫ √≤·π£√¨·π£·∫πÃÅb√¨nrin kan n√≠n√∫ √¨s·ªçÃÄ ohun √¨b√°l√≤p·ªçÃÄ n√≠ ol√∫ - √¨l√∫ N√†√¨j√≠r√≠√† n√≠ Abuja .""","""Abba Moro , t√≠ √≠ ·π£e ·ªçm·ªç ·∫πgb·∫πÃÅ·∫π PDP , ni a·π£oj√∫ f√∫n ·∫πÃÄka G√∫√∫s√π Benue , √†√°r√≠n gb√πngb√πn √†r√≠w√° N√†√¨j√≠r√≠√† .""","""N√≠ ·ªçj·ªçÃÅ 15 , o·π£√π ·∫∏r·∫πÃÅn√† , ·ªçd√∫n - un 2014 , Moro , t√≠ √≠ ·π£e ·ªçÃÄg√° p√°t√°p√°t√° √®t√≤ ab√©l√© , l√≥ w√† n√≠d√¨√≠ √¨·π£·∫πÃÄl·∫πÃÄ abanil·ªçÃÅk√†nj·∫πÃÅ √ågbanis√≠·π£·∫πÃÅ ·∫∏ÃÄ·π£·ªçÃÄ a·π£·ªçÃÅbod√® N√†√¨j√≠r√≠√† t√≠ √†w·ªçn ·ªçÃÄd·ªçÃÅlangba t√≠ √≥ t√≥ b√≠i ·∫πgb·∫π·∫πgb·∫πÃÄr√∫n 6 t√≥ f·∫πÃÅ √†y√® i·π£·∫πÃÅ ·∫πgb·∫πÃÄr√∫n 4 t√≠ √≥ ·π£√≠ s√≠l·∫πÃÄ n√≠n√∫ il√©e·π£·∫πÃÅ ·∫∏ÃÄ·π£·ªçÃÄ A·π£·ªçÃÅbod√® N√†√¨j√≠r√≠√† t√≠"""]

In [None]:
model_names = ["xlm_roberta_large_token_classifier_masakhaner", 
               "distilbert_base_token_classifier_masakhaner"]

In [None]:
xlm_roberta_text_list = [text_list_amharic, 
                         text_list_hausa, 
                         text_list_igbo , 
                         text_list_kinyarwanda, 
                         text_list_luganda, 
                         text_list_Nigerian, 
                         text_list_Pidgin, 
                         text_list_Swahilu, 
                         text_list_Wolof, 
                         text_list_Yor√πb√°]


In [None]:
distilbert_text_list = [text_list_hausa, 
                        text_list_igbo , 
                        text_list_kinyarwanda, 
                        text_list_luganda, 
                        text_list_Nigerian, 
                        text_list_Pidgin, 
                        text_list_Swahilu, 
                        text_list_Wolof, 
                        text_list_Yor√πb√°]


## 4. Define Spark NLP pipeline

In [None]:
def ner_masakhaner(model_name, language_text):

    documentAssembler = DocumentAssembler()\
          .setInputCol("text")\
          .setOutputCol("document")

    sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
          .setInputCols(["document"])\
          .setOutputCol("sentence")

    tokenizer = Tokenizer()\
          .setInputCols(["sentence"])\
          .setOutputCol("token")

    ner_converter = NerConverter()\
          .setInputCols(["sentence", "token", "ner"])\
          .setOutputCol("ner_chunk")


    if model_name == 'xlm_roberta_large_token_classifier_masakhaner':
      tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_masakhaner", "xx")\
          .setInputCols(["sentence",'token'])\
          .setOutputCol("ner")

    else:
      tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_base_token_classifier_masakhaner", "xx")\
          .setInputCols(["sentence",'token'])\
          .setOutputCol("ner")

    nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])

    empty_data = spark.createDataFrame([[""]]).toDF("text")
    model = nlpPipeline.fit(empty_data)

    print("")
    print("\u001b[32m*************************  MODEL NAME :  " + model_name + "   ***********************\u001b[32m", end ='\n\n')
    

    for text_name in language_text:
      x = [ i for i, a in globals().items() if a == text_name][0]
      df = spark.createDataFrame(text_name, StringType()).toDF("text")
      print("")
      print("\u001b[31m*************************  LANGUAGE_TEXT :  " + x + "   ***********************\u001b[0m", end ='\n\n')
      
      
      #result dataframe
      result = model.transform(df)
      result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                           result.ner_chunk.metadata)).alias("cols")) \
            .select(F.expr("cols['0']").alias("chunk"),
                    F.expr("cols['1']['entity']").alias("ner_label"))\
            .show(truncate=False)

      #visualization
      NerVisualizer().display(
          result = result.collect()[3],
          label_col = 'ner_chunk',
          document_col = 'document')

In [18]:
# with "xlm_roberta_large_token_classifier_masakhaner" model
ner_masakhaner('xlm_roberta_large_token_classifier_masakhaner', xlm_roberta_text_list)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
xlm_roberta_large_token_classifier_masakhaner download started this may take some time.
Approximate size to download 1.7 GB
[OK!]

[32m*************************  MODEL NAME :  xlm_roberta_large_token_classifier_masakhaner   ***********************[32m


[31m*************************  LANGUAGE_TEXT :  text_list_amharic   ***********************[0m

+--------------+---------+
|chunk         |ner_label|
+--------------+---------+
|·ä†·àÖ·àò·ãµ ·â´·äï·ã≥      |PER      |
|·ä®3-10-2000 ·åÄ·àù·àÆ|DATE     |
|·â†·ä†·ã≤·àµ ·ä†·â†·â£      |LOC      |
|·à∞·àõ·ã´·ãä ·çì·à≠·â≤      |ORG      |
|·ãõ·à¨            |DATE     |
|·â†·àò·ä¢·ä†·ãµ ·åΩ·àÖ·çà·âµ ·â§·âµ |ORG      |
|·ã® ·ãì·àò·â±         |DATE     |
|·ã®·ãö·àù·â£·â•·ãå        |LOC      |
|·ä§·àò·à≠·à∞·äï ·àù·äì·äï·åã·åç·ãã  |PER      |
|·ã∂·ã≠·â∏ ·â¨·àà        |ORG      |
|·àä·äï·ã≥ ·àõ·ãú·à™·à¨      |PER      |
|·ã®·ä†·àÖ·åâ·à©·äï


[31m*************************  LANGUAGE_TEXT :  text_list_hausa   ***********************[0m

+-----------------------------------+---------+
|chunk                              |ner_label|
+-----------------------------------+---------+
|Muryar Amurka                      |ORG      |
|Ibrahim Abdul'aziz                 |PER      |
|Najeriya                           |LOC      |
|Kungiyar Ma'aikatan Jami'o'i       |ORG      |
|Juma‚Äôa mai zuwa                    |DATE     |
|Muaryar Amurka                     |ORG      |
|Mohammed Jaji                      |PER      |
|Majalisar Dinkin Duniya            |ORG      |
|Aliko Dangote                      |PER      |
|bankin raya Afirka                 |ORG      |
|shekara 2030                       |DATE     |
|Temitope Olatoye Sugar             |PER      |
|Lagelu                             |LOC      |
|Akinyele                           |LOC      |
|Oyo                                |LOC      |
|majalisar wakilan tarayyar Najeriya|


[31m*************************  LANGUAGE_TEXT :  text_list_igbo   ***********************[0m

+-----------------------------+---------+
|chunk                        |ner_label|
+-----------------------------+---------+
|Na·ªãjir·ªãa                     |LOC      |
|Ike Ekweremadu               |PER      |
|otu nkeji                    |DATE     |
|Na·ªãjir·ªãa                     |LOC      |
|Buhari                       |PER      |
|2015                         |DATE     |
|National Bureau of Statistics|ORG      |
|NBS                          |ORG      |
|af·ªç 2016                     |DATE     |
|Google                       |ORG      |
|Africa                       |LOC      |
|Twitter                      |ORG      |
|Taa                          |DATE     |
|Stephen Keshi                |PER      |
|Keshi                        |PER      |
|Super Eagles                 |ORG      |
|af·ªç 2011                     |DATE     |
|Afrika                       |LOC      |
|2013     


[31m*************************  LANGUAGE_TEXT :  text_list_kinyarwanda   ***********************[0m

+-------------------------------------+---------+
|chunk                                |ner_label|
+-------------------------------------+---------+
|w‚ÄôUmuryango w‚ÄôUbumwe bw‚Äôu Burayi     |ORG      |
|Rwanda                               |LOC      |
|Nicola Bellomo                       |PER      |
|u Rwanda                             |LOC      |
|Banki y‚ÄôIsi                          |ORG      |
|kuwa 9 Mata                          |DATE     |
|Afurika yo-munsi y‚ÄôUbutayu bwa Sahara|LOC      |
|Twitter                              |ORG      |
|kuwa Kane                            |DATE     |
|Mateke                               |PER      |
|Rwanda                               |LOC      |
|Mateke                               |PER      |
|Ruth Nankabirwa                      |PER      |
|kuwa Gatatu                          |DATE     |
|Nteko Ishinga Amategeko             


[31m*************************  LANGUAGE_TEXT :  text_list_luganda   ***********************[0m

+-----------------------------------+---------+
|chunk                              |ner_label|
+-----------------------------------+---------+
|Phillip Wokorach                   |PER      |
|Justin Kimono                      |PER      |
|Adrian Kisito                      |PER      |
|Uganda                             |LOC      |
|omwaka oguwedde                    |DATE     |
|Zimbabwe                           |LOC      |
|David Mubiru                       |PER      |
|Uganda                             |LOC      |
|November 2016                      |DATE     |
|ekyemyaka ena nemyaka emirala ebiri|DATE     |
|Rakai                              |LOC      |
|Patience Baganzi                   |PER      |
|poliisi ye Katwe                   |ORG      |
|Kampala                            |LOC      |
|Mawokota                           |LOC      |
|Mawokota                           |


[31m*************************  LANGUAGE_TEXT :  text_list_Nigerian   ***********************[0m

+-----------------------------------------+---------+
|chunk                                    |ner_label|
+-----------------------------------------+---------+
|higni 20                                 |DATE     |
|Kevin Omondi                             |PER      |
|Shopie Anyango                           |PER      |
|higni 23                                 |DATE     |
|China                                    |LOC      |
|Ringa                                    |LOC      |
|jahigni 15                               |DATE     |
|Joseph Karanja                           |PER      |
|Noel Adhiambo                            |PER      |
|jahigni 11                               |DATE     |
|skul ma Kosele Community Christian Center|ORG      |
|od bura ma Kasipul                       |ORG      |
|dwee mokalo                              |DATE     |
|State House                        


[31m*************************  LANGUAGE_TEXT :  text_list_Pidgin   ***********************[0m

+-----------------------+---------+
|chunk                  |ner_label|
+-----------------------+---------+
|DsTV                   |ORG      |
|2019 / 2020            |DATE     |
|StarTimes              |ORG      |
|Sub - Saharan Africa   |LOC      |
|Bayern                 |ORG      |
|Karl - Heinz Rummenigge|PER      |
|weekend                |DATE     |
|Saturday               |DATE     |
|Spain                  |LOC      |
|Lionel Messi           |PER      |
|June 12                |DATE     |
|LA Lakers              |ORG      |
|Kobe Bryant            |PER      |
|Gianna                 |PER      |
|city of Calabasa       |LOC      |
|California             |LOC      |
|Sunday 26 January      |DATE     |
|Ighalo                 |PER      |
|Chinese Super League   |ORG      |
|2017                   |DATE     |
+-----------------------+---------+
only showing top 20 rows




[31m*************************  LANGUAGE_TEXT :  text_list_Swahilu   ***********************[0m

+-----------------------+---------+
|chunk                  |ner_label|
+-----------------------+---------+
|ADF                    |ORG      |
|Alhamisi               |DATE     |
|Mbau kaskazini         |LOC      |
|Beni                   |LOC      |
|Allied Democratic Force|ORG      |
|ADF                    |ORG      |
|November               |DATE     |
|Congo                  |LOC      |
|ADF                    |ORG      |
|Leon Richard Kasonga   |PER      |
|Jumatano               |DATE     |
|White House            |ORG      |
|Jumanne                |DATE     |
|Nancy Pelosi           |PER      |
|Marekani               |LOC      |
|Trump                  |PER      |
|Jumatano               |DATE     |
|Umoja wa Mataifa       |ORG      |
|Saudi Arabia           |LOC      |
|Bezos                  |PER      |
+-----------------------+---------+
only showing top 20 rows




[31m*************************  LANGUAGE_TEXT :  text_list_Wolof   ***********************[0m

+------------------+---------+
|chunk             |ner_label|
+------------------+---------+
|Far√£s             |LOC      |
|Daniel Cohn       |PER      |
|Bendit            |PER      |
|atum 1969         |DATE     |
|Usmaan Sonkoo     |PER      |
|Cees              |LOC      |
|atum 1974         |DATE     |
|Isaa S√†ll         |PER      |
|Fatig             |LOC      |
|atiy 1990         |DATE     |
|IR√É NDAW          |PER      |
|Ndaakaaru         |LOC      |
|Sentv             |ORG      |
|ati 60            |DATE     |
|Groupe de Grenoble|ORG      |
|Asan Silla        |PER      |
|Mas√†mba Sare      |PER      |
|Saaliyu K√†nji     |PER      |
|Ablaay W√†dd       |PER      |
|Senegaal          |LOC      |
+------------------+---------+




[31m*************************  LANGUAGE_TEXT :  text_list_Yor√πb√°   ***********************[0m

+-----------------------------------------+---------+
|chunk                                    |ner_label|
+-----------------------------------------+---------+
|Oh√πn √Ägb√°y√©                              |ORG      |
|Luis Carlos                              |PER      |
|Venezuela                                |LOC      |
|Mohammed Sani Musa                       |PER      |
|Activate Technologies Limited            |ORG      |
|·ªçd√∫n - un 2019                           |DATE     |
|All Progressives Congress                |ORG      |
|APC                                      |ORG      |
|√ål√† - O√≤r√πn Niger                        |LOC      |
|Premium Times                            |ORG      |
|Ishaku Elisha Abbo                       |PER      |
|People‚Äôs Democratic Party                |ORG      |
|PDP                                      |ORG      |
|√Är√≠w√° Adamawa      

In [19]:
# with "distilbert_base_token_classifier_masakhaner" model
ner_masakhaner('distilbert_base_token_classifier_masakhaner', distilbert_text_list)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
distilbert_base_token_classifier_masakhaner download started this may take some time.
Approximate size to download 482.3 MB
[OK!]

[32m*************************  MODEL NAME :  distilbert_base_token_classifier_masakhaner   ***********************[32m


[31m*************************  LANGUAGE_TEXT :  text_list_hausa   ***********************[0m

+-----------------------------------+---------+
|chunk                              |ner_label|
+-----------------------------------+---------+
|Muryar Amurka                      |ORG      |
|Ibrahim Abdul'aziz                 |PER      |
|Najeriya                           |LOC      |
|Kungiyar Ma'aikatan Jami'o'i       |ORG      |
|Muaryar Amurka                     |ORG      |
|Mohammed Jaji                      |PER      |
|Majalisar Dinkin Duniya            |ORG      |
|Aliko Dangote                      |PER      |
|bankin raya Af


[31m*************************  LANGUAGE_TEXT :  text_list_igbo   ***********************[0m

+-----------------------------+---------+
|chunk                        |ner_label|
+-----------------------------+---------+
|Na·ªãjir·ªãa                     |LOC      |
|Ike Ekweremadu               |PER      |
|S·ªãnat·ªã                       |PER      |
|otu nkeji                    |DATE     |
|Na·ªãjir·ªãa                     |LOC      |
|Buhari                       |PER      |
|2015                         |DATE     |
|National Bureau of Statistics|ORG      |
|NBS                          |ORG      |
|af·ªç                          |DATE     |
|2016                         |DATE     |
|Google                       |ORG      |
|Africa                       |LOC      |
|Twitter                      |ORG      |
|Taa                          |DATE     |
|Stephen Keshi                |PER      |
|Keshi                        |PER      |
|Super Eagles                 |ORG      |
|2011   


[31m*************************  LANGUAGE_TEXT :  text_list_kinyarwanda   ***********************[0m

+-------------------------------------+---------+
|chunk                                |ner_label|
+-------------------------------------+---------+
|Rwanda                               |LOC      |
|Nicola Bellomo                       |PER      |
|u Rwanda                             |LOC      |
|kuwa 9 Mata                          |DATE     |
|Afurika yo-munsi y‚ÄôUbutayu bwa Sahara|LOC      |
|Twitter                              |ORG      |
|uyu kuwa Kane                        |DATE     |
|Mateke                               |PER      |
|Rwanda                               |LOC      |
|Mateke                               |PER      |
|Ruth Nankabirwa                      |PER      |
|uyu kuwa Gatatu                      |DATE     |
|Nteko Ishinga Amategeko              |ORG      |
|Uganda                               |LOC      |
|Rwanda                               |LOC   


[31m*************************  LANGUAGE_TEXT :  text_list_luganda   ***********************[0m

+-----------------------------------+---------+
|chunk                              |ner_label|
+-----------------------------------+---------+
|Phillip Wokorach                   |PER      |
|Justin Kimono                      |PER      |
|Adrian Kisito                      |PER      |
|Uganda                             |LOC      |
|omwaka oguwedde                    |DATE     |
|Zimbabwe                           |LOC      |
|David Mubiru                       |PER      |
|Uganda                             |LOC      |
|November 2016                      |DATE     |
|ekyemyaka ena nemyaka emirala ebiri|DATE     |
|Rakai                              |LOC      |
|Patience Baganzi                   |PER      |
|poliisi ye Katwe                   |ORG      |
|Kampala                            |LOC      |
|Mawokota                           |LOC      |
|Mawokota                           |


[31m*************************  LANGUAGE_TEXT :  text_list_Nigerian   ***********************[0m

+-----------------------------------------+---------+
|chunk                                    |ner_label|
+-----------------------------------------+---------+
|higni 20                                 |DATE     |
|Kevin Omondi                             |PER      |
|Shopie Anyango                           |PER      |
|higni 23                                 |DATE     |
|China                                    |LOC      |
|Ringa                                    |LOC      |
|jahigni 15                               |DATE     |
|Joseph Karanja                           |PER      |
|Noel Adhiambo                            |PER      |
|jahigni 11                               |DATE     |
|skul ma Kosele Community Christian Center|ORG      |
|od bura ma Kasipul                       |ORG      |
|dwee mokalo                              |DATE     |
|State House                        


[31m*************************  LANGUAGE_TEXT :  text_list_Pidgin   ***********************[0m

+-----------------------+---------+
|chunk                  |ner_label|
+-----------------------+---------+
|DsTV                   |ORG      |
|2019 / 2020            |DATE     |
|StarTimes              |ORG      |
|Sub - Saharan Africa   |LOC      |
|Bayern                 |ORG      |
|Karl - Heinz Rummenigge|PER      |
|weekend                |DATE     |
|Saturday               |DATE     |
|Spain                  |LOC      |
|Lionel Messi           |PER      |
|June 12                |DATE     |
|LA Lakers              |ORG      |
|Kobe Bryant            |PER      |
|Gianna                 |PER      |
|city of Calabasa       |LOC      |
|California             |LOC      |
|Sunday 26 January      |DATE     |
|Ighalo                 |PER      |
|Chinese Super League   |ORG      |
|2017                   |DATE     |
+-----------------------+---------+
only showing top 20 rows




[31m*************************  LANGUAGE_TEXT :  text_list_Swahilu   ***********************[0m

+-----------------------+---------+
|chunk                  |ner_label|
+-----------------------+---------+
|ADF                    |ORG      |
|Alhamisi               |DATE     |
|Mbau kaskazini         |LOC      |
|Beni                   |LOC      |
|Allied Democratic Force|ORG      |
|ADF                    |ORG      |
|November               |DATE     |
|Congo                  |LOC      |
|ADF                    |ORG      |
|Leon Richard Kasonga   |PER      |
|Jumatano               |DATE     |
|White House            |ORG      |
|Jumanne                |DATE     |
|Nancy Pelosi           |PER      |
|Marekani               |LOC      |
|Trump                  |PER      |
|Jumatano               |DATE     |
|Umoja wa Mataifa       |ORG      |
|Saudi Arabia           |LOC      |
|Bezos                  |PER      |
+-----------------------+---------+
only showing top 20 rows




[31m*************************  LANGUAGE_TEXT :  text_list_Wolof   ***********************[0m

+------------------+---------+
|chunk             |ner_label|
+------------------+---------+
|Far√£s             |LOC      |
|Daniel Cohn       |PER      |
|Bendit            |PER      |
|atum 1969         |DATE     |
|Usmaan Sonkoo     |PER      |
|Cees              |LOC      |
|atum 1974         |DATE     |
|Isaa S√†ll         |PER      |
|Fatig             |LOC      |
|atiy 1990         |DATE     |
|IR√É NDAW          |ORG      |
|Ndaakaaru         |LOC      |
|Sentv             |ORG      |
|ati 60            |DATE     |
|Groupe de Grenoble|ORG      |
|Asan Silla        |PER      |
|Mas√†mba Sare      |PER      |
|Saaliyu K√†nji     |PER      |
|Ablaay W√†dd       |PER      |
|Senegaal          |LOC      |
+------------------+---------+




[31m*************************  LANGUAGE_TEXT :  text_list_Yor√πb√°   ***********************[0m

+-----------------------------+---------+
|chunk                        |ner_label|
+-----------------------------+---------+
|Oh√πn √Ägb√°y√©                  |ORG      |
|Luis Carlos                  |PER      |
|Venezuela                    |LOC      |
|Mohammed Sani Musa           |PER      |
|Activate Technologies Limited|ORG      |
|√åd√¨b√≤                        |ORG      |
|·ªçd√∫n - un 2019               |DATE     |
|All Progressives Congress    |ORG      |
|APC                          |ORG      |
|A·π£oj√∫ √ål√† - O√≤r√πn Niger      |LOC      |
|Premium Times                |ORG      |
|Ishaku Elisha Abbo           |PER      |
|People‚Äôs Democratic Party    |ORG      |
|PDP                          |ORG      |
|√Är√≠w√° Adamawa                |PER      |
|Adamawa                      |LOC      |
|N√†√¨j√≠r√≠√†                     |LOC      |
|2019                         |DA