### Polyglot

Polyglot is a natural language pipeline that supports massive multilingual applications.


Features

- Tokenization (165 Languages)
- Language detection (196 Languages)
- Named Entity Recognition (40 Languages)
- Part of Speech Tagging (16 Languages)
- Sentiment Analysis (136 Languages)
- Word Embeddings (137 Languages)
- Morphological analysis (135 Languages)
- Transliteration (69 Languages)



### Installation / Dependencies

polyglot depends on numpy and libicu-dev, on ubuntu/debian linux distribution you can install such packages by executing the following command:



In [2]:
!sudo apt-get install python-numpy libicu-dev
!pip install polyglot
!pip3 install pyicu
!pip3 install pycld2
!pip3 install morfessor

Reading package lists... Done
Building dependency tree       
Reading state information... Done
python-numpy is already the newest version (1:1.13.3-2ubuntu1).
python-numpy set to manually installed.
libicu-dev is already the newest version (60.2-3ubuntu3.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Collecting polyglot
[?25l  Downloading https://files.pythonhosted.org/packages/e7/98/e24e2489114c5112b083714277204d92d372f5bbe00d5507acf40370edb9/polyglot-16.7.4.tar.gz (126kB)
[K     |████████████████████████████████| 133kB 2.8MB/s 
[?25hBuilding wheels for collected packages: polyglot
  Building wheel for polyglot (setup.py) ... [?25l[?25hdone
  Created wheel for polyglot: filename=polyglot-16.7.4-py2.py3-none-any.whl size=52557 sha256=4c2a87a95a8cc9e43b7a0131cbbb370e7ec547cae2b20a46fab7a72d25b84ff7
  Stored in directory: /

### Language Detection

In [3]:
from polyglot.detect import Detector
arabic_text = u"""
أفاد مصدر امني في قيادة عمليات صلاح الدين في العراق بأن " القوات الامنية تتوقف لليوم
الثالث على التوالي عن التقدم الى داخل مدينة تكريت بسبب
انتشار قناصي التنظيم الذي يطلق على نفسه اسم "الدولة الاسلامية" والعبوات الناسفة
والمنازل المفخخة والانتحاريين، فضلا عن ان القوات الامنية تنتظر وصول تعزيزات اضافية ".
"""
detector = Detector(arabic_text)
print(detector.language)

name: Arabic      code: ar       confidence:  99.0 read bytes:   907


In [44]:
en_text = "He is a student "
fr_text = "Il est un étudiant"
ru_text = "Он студент"
hn_text= "अंदर आ जाओ।"
detect_en = Detector(en_text)
detect_fr = Detector(fr_text)
detect_ru = Detector(ru_text)
detect_hn = Detector(hn_text)
print(detect_en.language)

print(detect_fr.language)
print(detect_ru.language)
print(detect_hn.language)

Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.


name: English     code: en       confidence:  94.0 read bytes:   704
name: French      code: fr       confidence:  95.0 read bytes:   870
name: Serbian     code: sr       confidence:  95.0 read bytes:   614
name: Hindi       code: hi       confidence:  96.0 read bytes:   530


#### Mixed Text

If the text contains snippets from different languages, the detector is able to find the most probable langauges used in the text. For each language, we can query the model confidence level:

In [4]:
mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state located in East Asia.
"""


In [5]:
for language in Detector(mixed_text).languages:
  print(language)

name: English     code: en       confidence:  87.0 read bytes:  1154
name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755
name: un          code: un       confidence:   0.0 read bytes:     0


Sometimes, there is no enough text to make a decision, like detecting a language from one word. This forces the detector to switch to a best effort strategy, a warning will be thrown and the attribute reliable will be set to False.



In [6]:
detector = Detector("pizza")
print(detector)

Detector is not able to detect the language reliably.


Prediction is reliable: False
Language 1: name: English     code: en       confidence:  85.0 read bytes:  1194
Language 2: name: un          code: un       confidence:   0.0 read bytes:     0
Language 3: name: un          code: un       confidence:   0.0 read bytes:     0


#### Supported Languages



cld2 can detect up to 165 languages.

In [7]:
from polyglot.utils import pretty_list
print(pretty_list(Detector.supported_languages()))

  1. Abkhazian                  2. Afar                       3. Afrikaans                
  4. Akan                       5. Albanian                   6. Amharic                  
  7. Arabic                     8. Armenian                   9. Assamese                 
 10. Aymara                    11. Azerbaijani               12. Bashkir                  
 13. Basque                    14. Belarusian                15. Bengali                  
 16. Bihari                    17. Bislama                   18. Bosnian                  
 19. Breton                    20. Bulgarian                 21. Burmese                  
 22. Catalan                   23. Cebuano                   24. Cherokee                 
 25. Nyanja                    26. Corsican                  27. Croatian                 
 28. Croatian                  29. Czech                     30. Chinese                  
 31. Chinese                   32. Chinese                   33. Chinese                  

### **Tokenization**

Tokenization is the process that identifies the text boundaries of words and sentences. We can identify the boundaries of sentences first then tokenize each sentence to identify the words that compose the sentence. Of course, we can do word tokenization first and then segment the token sequence into sentneces. Tokenization in polyglot relies on the Unicode Text Segmentation algorithm as implemented by the ICU Project.

In [8]:
# Load packages
import polyglot
from polyglot.text import Text,Word

In [9]:
# Word Tokens
docx = Text(u"He likes reading and painting")

In [10]:
docx.words

WordList(['He', 'likes', 'reading', 'and', 'painting'])

In [11]:
docx2 = Text(u"He exclaimed, 'what're you doing? Reading?'.")

In [12]:
docx2.words

WordList(['He', 'exclaimed', ',', "'", "what're", 'you', 'doing', '?', 'Reading', '?', "'", '.'])

In [13]:
# Sentence tokens
docx3 = Text(u"He likes reading and painting.He exclaimed, 'what're you doing? Reading?'.")

In [14]:
docx3.sentences

[Sentence("He likes reading and painting.He exclaimed, 'what're you doing?"),
 Sentence("Reading?'.")]

### Part of Speech Tagging

Part of speech tagging task aims to assign every word/token in plain text a category that identifies the syntactic functionality of the word occurrence.

Polyglot recognizes 17 parts of speech, this set is called the universal part of speech tag set:



-  ADJ: adjective
- ADP: adposition
- ADV: adverb
- AUX: auxiliary verb
- CONJ: coordinating conjunction
- DET: determiner
- INTJ: interjection
- NOUN: noun
- NUM: numeral
- PART: particle
- PRON: pronoun
- PROPN: proper noun
- PUNCT: punctuation
- SCONJ: subordinating conjunction
- SYM: symbol
- VERB: verb
- X: other


In [15]:
from polyglot.downloader import downloader
print(downloader.supported_languages_table("pos2"))

  1. Italian                    2. French                     3. Spanish; Castilian       
  4. Bulgarian                  5. Slovene                    6. Irish                    
  7. Finnish                    8. Dutch                      9. Swedish                  
 10. Danish                    11. Portuguese                12. English                  
 13. German                    14. Indonesian                15. Czech                    
 16. Hungarian                


Download **Necessary** ***Models***

In [18]:
!polyglot download embeddings2.en pos2.en

[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /root/polyglot_data...
[polyglot_data] Downloading package pos2.en to /root/polyglot_data...


In [19]:
docx.pos_tags


[('He', 'PRON'),
 ('likes', 'VERB'),
 ('reading', 'VERB'),
 ('and', 'CONJ'),
 ('painting', 'NOUN')]

In [20]:
blob = """We will meet at eight o'clock on Thursday morning."""
text = Text(blob)

# We can also specify language of that text by using
# text = Text(blob, hint_language_code='en')
text.pos_tags

[('We', 'PRON'),
 ('will', 'AUX'),
 ('meet', 'VERB'),
 ('at', 'ADP'),
 ('eight', 'NUM'),
 ("o'clock", 'NOUN'),
 ('on', 'ADP'),
 ('Thursday', 'PROPN'),
 ('morning', 'NOUN'),
 ('.', 'PUNCT')]

### Named Entity Extraction

Named entity extraction task aims to extract phrases from plain text that correpond to entities. Polyglot recognizes 3 categories of entities:

- Locations (Tag: I-LOC): cities, countries, regions, continents, neighborhoods, administrative divisions …
- Organizations (Tag: I-ORG): sports teams, newspapers, banks, universities, schools, non-profits, companies, …
- Persons (Tag: I-PER): politicians, scientists, artists, atheletes …


#### Languages Coverage

The models were trained on datasets extracted automatically from Wikipedia. Polyglot currently supports 40 major languages.

In [22]:
from polyglot.downloader import downloader
print(downloader.supported_languages_table("ner2", 3))

  1. Italian                    2. Hindi                      3. French                   
  4. Spanish; Castilian         5. Vietnamese                 6. Arabic                   
  7. Bulgarian                  8. Norwegian                  9. Estonian                 
 10. Japanese                  11. Greek, Modern             12. Slovene                  
 13. Korean                    14. Serbian                   15. Finnish                  
 16. Catalan; Valencian        17. Croatian                  18. Dutch                    
 19. Swedish                   20. Tagalog                   21. Danish                   
 22. Latvian                   23. Ukrainian                 24. Romanian, Moldavian, ... 
 25. Persian                   26. Slovak                    27. Portuguese               
 28. English                   29. Malay                     30. Polish                   
 31. German                    32. Indonesian                33. Chinese                  

Download Necessary Models

In [24]:
!polyglot download embeddings2.en ner2.en

[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /root/polyglot_data...
[polyglot_data]   Package embeddings2.en is already up-to-date!
[polyglot_data] Downloading package ner2.en to /root/polyglot_data...


In [25]:
blob = """The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world"."""
text = Text(blob)

# We can also specify language of that text by using
# text = Text(blob, hint_language_code='en')

text.entities

[I-ORG(['Israeli']), I-PER(['Benjamin', 'Netanyahu']), I-LOC(['Iran'])]

Or, we can query entites per sentence

In [27]:
for sent in text.sentences:
  print(sent, "\n")
  for entity in sent.entities:
    print(entity.tag, entity)

The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world". 

I-ORG ['Israeli']
I-PER ['Benjamin', 'Netanyahu']
I-LOC ['Iran']


By doing more careful inspection of the second entity Benjamin Netanyahu, we can locate the position of the entity within the sentence.

In [28]:
benjamin = sent.entities[1]
sent.words[benjamin.start: benjamin.end]

WordList(['Benjamin', 'Netanyahu'])

### Morphological Analysis
 

#### Morphology
+  morpheme is the smallest grammatical unit in a language. 
+ morpheme may or may not stand alone, word, by definition, is freestanding. 


Morphies which are the primitive units of syntax, the smallest individually meaningful elements in the utterances of a language. Morphemes are important in automatic generation and recognition of a language, especially in languages in which words may have many different inflected forms.



#### Languages Coverage

Using polyglot vocabulary dictionaries, morfessor models is trained  on the most frequent words 50,000 words of each language.

In [30]:
from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))

  1. Kapampangan                2. Italian                    3. Upper Sorbian            
  4. Sakha                      5. Hindi                      6. French                   
  7. Spanish; Castilian         8. Vietnamese                 9. Arabic                   
 10. Macedonian                11. Pashto, Pushto            12. Bosnian-Croatian-Serbian 
 13. Egyptian Arabic           14. Norwegian Nynorsk         15. Sundanese                
 16. Sicilian                  17. Azerbaijani               18. Bulgarian                
 19. Yoruba                    20. Tajik                     21. Georgian                 
 22. Tatar                     23. Galician                  24. Malagasy                 
 25. Uighur, Uyghur            26. Amharic                   27. Venetian                 
 28. Yiddish                   29. Norwegian                 30. Alemannic                
 31. Estonian                  32. West Flemish              33. Divehi; Dhivehi; Mald... 

#### **Download Necessary Models**

In [31]:
!polyglot download morph2.en morph2.ar

[polyglot_data] Downloading package morph2.en to
[polyglot_data]     /root/polyglot_data...
[polyglot_data] Downloading package morph2.ar to
[polyglot_data]     /root/polyglot_data...


In [32]:
words = ["preprocessing", "processor", "invaluable", "thankful", "crossed"]
for w in words:
  w = Word(w, language="en")
  print("{:<20}{}".format(w, w.morphemes))


preprocessing       ['pre', 'process', 'ing']
processor           ['process', 'or']
invaluable          ['in', 'valuable']
thankful            ['thank', 'ful']
crossed             ['cross', 'ed']


If the text is not tokenized properly, morphological analysis could offer a smart of way of splitting the text into its original units. Here, is an example:

In [33]:
blob = "Wewillmeettoday."
text = Text(blob)
text.language = "en"
text.morphemes

WordList(['We', 'will', 'meet', 'to', 'day', '.'])

### Transliteration

Transliteration is the conversion of a text from one script to another. For instance, a Latin transliteration of the Greek phrase “Ελληνική Δημοκρατία”, usually translated as ‘Hellenic Republic’, is “Ellēnikḗ Dēmokratía”.

In [35]:
from polyglot.transliteration import Transliterator

#### **Languages Coverage**

In [36]:
from polyglot.downloader import downloader
print(downloader.supported_languages_table("transliteration2"))

  1. Italian                    2. Hindi                      3. French                   
  4. Spanish; Castilian         5. Vietnamese                 6. Arabic                   
  7. Macedonian                 8. Bosnian-Croatian-Serbian   9. Norwegian Nynorsk        
 10. Azerbaijani               11. Bulgarian                 12. Georgian                 
 13. Galician                  14. Amharic                   15. Yiddish                  
 16. Norwegian                 17. Estonian                  18. Japanese                 
 19. Haitian; Haitian Creole   20. Belarusian                21. Greek, Modern            
 22. Welsh                     23. Albanian                  24. Marathi (Marāṭhī)        
 25. Armenian                  26. Slovene                   27. Korean                   
 28. Irish                     29. Bengali                   30. Serbian                  
 31. Finnish                   32. Catalan; Valencian        33. Croatian                 

#### Downloading Necessary Models

In [37]:
!polyglot download embeddings2.en transliteration2.ar

[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /root/polyglot_data...
[polyglot_data]   Package embeddings2.en is already up-to-date!
[polyglot_data] Downloading package transliteration2.ar to
[polyglot_data]     /root/polyglot_data...


In [48]:
blob = """We will meet at eight on Thursday morning."""
text = Text(blob)

In [49]:
for x in text.transliterate('ar'):
  print(x)

وي
ويل
ميت
ات
ييايت
ون
ثورسداي
مورنينغ



In [50]:
for x in text.transliterate('hi'):
  print(x)

वे
विलल
मीत
ात
ेिगहत
ोन
थूरसदे
मोरनिंग



In [46]:
!polyglot download transliteration2.hi

[polyglot_data] Downloading package transliteration2.hi to
[polyglot_data]     /root/polyglot_data...


### Sentiment Analysis

Polyglot has polarity lexicons for 136 languages. The scale of the words’ polarity consisted of three degrees: +1 for positive words, and -1 for negatives words. Neutral words will have a score of 0.

#### Languages Coverage

In [52]:
from polyglot.downloader import downloader
print(downloader.supported_languages_table("sentiment2", 3))

  1. Kapampangan                2. Italian                    3. Upper Sorbian            
  4. Sakha                      5. Hindi                      6. French                   
  7. Spanish; Castilian         8. Vietnamese                 9. Arabic                   
 10. Macedonian                11. Pashto, Pushto            12. Bosnian-Croatian-Serbian 
 13. Egyptian Arabic           14. Norwegian Nynorsk         15. Sundanese                
 16. Sicilian                  17. Azerbaijani               18. Bulgarian                
 19. Yoruba                    20. Tajik                     21. Georgian                 
 22. Tatar                     23. Galician                  24. Malagasy                 
 25. Uighur, Uyghur            26. Amharic                   27. Venetian                 
 28. Yiddish                   29. Norwegian                 30. Alemannic                
 31. Estonian                  32. West Flemish              33. Divehi; Dhivehi; Mald... 

#### Download Package

In [55]:
!polyglot download sentiment2.en

[polyglot_data] Downloading package sentiment2.en to
[polyglot_data]     /root/polyglot_data...


Polarity

To inquiry the polarity of a word, we can just call its own attribute polarity

In [56]:
text = Text("The movie was really good.")

In [57]:
print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)
for w in text.words:
    print("{:<16}{:>2}".format(w, w.polarity))

Word            Polarity
------------------------------
The              0
movie            0
was              0
really           0
good             1
.                0


In [59]:
docx = Text(u"He hates reading and playing")
docy= Text("He likes reading and painting")

In [60]:
docx.polarity

-1.0

In [61]:
docy.polarity

1.0

In [62]:
blob = ("Barack Obama gave a fantastic speech last night. "
        "Reports indicate he will move next to New Hampshire.")
text = Text(blob)

In [67]:
first_sentence = text.sentences[0]
print(first_sentence)
print(first_sentence.polarity)
first_entity = first_sentence.entities[0]
print(first_entity)
print(first_entity.positive_sentiment)

Barack Obama gave a fantastic speech last night.
1.0
['Barack', 'Obama']
0.9444444444444444
