<img align="left" src="./files/CC_BY.png"><br />

This notebook was created by Gregor Bös and Maximilian Noichl for the 2025 GAP.12 satellite workshop 'Data-Driven Methods for Philosophy'.

It is based on notebooks created by William Mattingly for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/), with adaptations by Zhuo Chen under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

For questions/comments/improvements, email mail@gregorboes.com, m.noichl@uu.nl, zhuo.chen@ithaka.org or nathan.kelber@ithaka.org.<br />
____

# MultiLingual NER

This is a notebook on **Named Entity Recognition**. 

**Chapter I: NER, encoding, and multilingual corpora**
**Chapter II: Introduction to spaCy**
**Chapter III: Word Embeddings and Entity Rulers**

**Use case:** For Learning (Detailed explanation, not focused on immediate research application)

**Difficulty:** Intermediate

**Knowledge Required:** 

* Python basics

**Knowledge Recommended:**

* Natural Language Processing

**Data Format**: .txt

**Libraries Used**: spaCy___

Motivating NER for Philosophy (corpus-based methods, good for quantifiying historical claims, structuring large amounts of text...).

# Introduction to Named Entity Recognition

## Named Entity Recognition
**Entities** are words in a text that correspond to a specific type of data. For example, we may find the following types of entities in a text.
* numerical, such as cardinal numbers; 
* temporal, such as dates; 
* nominal, such as names of people and places;
* political, such as geopolitical entities (GPE). 

Named entity recognition, or NER, is the process by which a system takes an input of a text and outputs the identification of entities.

### A simple example
Let's use the following sentence as an example.

*Martha, a senior, moved to Spain where she will be playing basketball until 05 June 2022 or until she can't play any longer.*

First, there is "Martha", a person's name. Different NER models may give the label of PERSON or PER to it.

Second, there is "Spain", a country name. It is a GPE, or Geopolitical Entity.

Finally, there is "05 June 2022", a date. It is a DATE entity. 

In this series, we are going to use the SpaCy library to do NER. Here is a preview of how spaCy identifies entities in a text string.

In [None]:
import subprocess
import sys

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Install your specific packages
packages = [
    'beautifulsoup4==4.12.2',
    'click==8.1.3',
    'gensim==4.3.1',
    'ipympl==0.9.3',
    'jupyter-ai==2.19.1',
    'jupyter-ai-magics==2.19.0',
    'jupyterlab-git==0.50.0',
    'matplotlib==3.8.4',
    'numpy>=1.16',
    'nltk==3.9.1',
    'openai==1.51.0',
    'pandas>=2.0.3',
    'pillow==10.3.0',
    'pyarrow==14.0.1',
    'pyldavis==3.4.1',
    'pytesseract==0.3.10',
    'regex==2023.6.3',
    'requests==2.32.3',
    'scikit-learn==1.5.1',
    'scipy==1.11.1',
    'seaborn==0.12.2',
    'spacy==3.5.4',
    'urllib3==2.2.2',
    'vadersentiment==3.3.2',
    'wordcloud==1.9.2',
    'zipp==3.19.2'
]

for package in packages:
    install(package)

# Additional setup for specific packages that need extra data/models
print("Setting up additional package data...")

# NLTK data downloads
try:
    import nltk
    nltk.download('punkt', quiet=True)
    nltk.download('stopwords', quiet=True)
    nltk.download('vader_lexicon', quiet=True)
    nltk.download('wordnet', quiet=True)
    nltk.download('omw-1.4', quiet=True)
    print("✓ NLTK data downloaded")
except Exception as e:
    print(f"⚠ NLTK setup issue: {e}")

print("Package installation and setup complete!")

In [None]:
### for running spacy displacy later
from IPython.display import display
sys.modules['IPython.core.display'] = sys.modules['IPython.display']

In [None]:
# download the small English NLP model from spacy
!python3 -m spacy download en_core_web_sm # for English NER

In [None]:
# import spacy
import spacy

In [None]:
# load the small English model
nlp = spacy.load("en_core_web_sm")

# Create a doc object out of the text string
sentence = """Martha, a senior, moved to Spain where she will be playing basketball until 05 June 2022 
or until she can't play any longer."""
doc = nlp(sentence)

# Get the entities from the doc
for ent in doc.ents:
    print(ent.text, ent.label_)

## Natural Language Processing

Natual Language Processing (NLP) is the process by which a researcher uses a computer system to parse human language and extract important information from texts.

How do we extract information from texts? We do it through a series of pipelines that perform some operations on the data at hand.

<center><img src='./files/NER_NLP_pipeline.png' width=700></center>

Named entity recognition is a branch of natural language processing. From the graph, you may notice that named entity recognition comes later in NLP. This is because it needs to receive a tokenized text and, in some languages, it needs to understand a word's part-of-speech (POS) to perform well.

### Tokenizer
The job of a tokenizer is to break a text into individual tokens. Tokens are items in a text that have some linguistic meaning. They can be words, such as "Martha", but they can also be punctuation marks, such as "," in the relative clause ", a senior,". Likewise, "n't" in the contraction "can't" would also be recognized as a token since "n't" in English corresponds to the word "not".

In [None]:
# Get the tokens in the doc object we created
for token in doc:
    print(token.text)

### POS tagger
A common pipeline after a tokenizer is a POS tagger whose job is to identify the parts-of-speech, or POS, in the text. Let us consider an example sentence:

The boy took the ball to the store.

The nominative (subject), "boy", comes first in the sentence, followed by the verb, "took", then followed by the accusative (object), "ball", and finally the dative (indirect object), "store". The words "the" and "to" also contain vital information. "The" occurs twice and tells the reader that it's not just any ball, it's the ball; likewise, it's not just a store, but the store. The period too tells us something important. This is a statement, not a question. 

In [None]:
# Import the visualizer in spaCy
from spacy import displacy

# Visualize the POS tags and syntactic dependencies using displacy.serve() function
displacy.render(doc, style="dep")

In [None]:
# Get all pos tags
dir(spacy.parts_of_speech)

In [None]:
# Get the meaning of a certain tag
spacy.explain('PROPN')

# Text Encoding

Before we tokenize a text, we have to know what encoding it uses. 

When we say "plain text", we are actually being really sloppy. There is no such thing as "plain text"! It does not make sense to have a text without knowing what encoding it uses. 

Have you ever received an email in which you find the text is unintelligible because there are random question marks in it? Did you wonder why?

## A little bit history about text encoding

### ASCII (American Standard Code for Information Interchange)
There was a time when the only characters that mattered were good old unaccented English letters. ASCII is a code which was able to represent every character using a number between 32 and 127 (Codes below 32 were called unprintable and for control characters, e.g. the backspace). For example, the ASCII code of the letter A is 65. As you know, computers use a binary system and therefore the number 65 is actually encoded as a 8-bit number.

A $\rightarrow$ 0100 0001

8-bit allows up to $2^{8}=256$ characters and we have only had 128 (with numbers 0-127). That means we can use the numbers 128 to 255 to represent other characters! English, of course, is not the only language that matters. Therefore, people speaking different languages chose to use the numbers 128 to 255 for the characters in their own language. This means that two different characters from two different languages may be represented by the same number in their respective encoding standard. This is no good, because when Americans would send their résumés to Israel they would arrive as rגsumגs. 


### UTF-8
Can't we have a single character set that includes every reasonable writing system on the planet? Yes we can! 
Here comes the brilliant idea of UTF-8. UTF stands for Unicode Transformation Format. 8 means 8-bit.

Every letter in every alphabet is assigned a number written like this: U+0041. It is called a *code point*. The U+ means "Unicode" and the numbers are hexadecimal (from 0 to F). In Python, code points are written in the form \uXXXX, where XXXX is the number in 4-digit hexadecimal form. The English letter A is assigned the number U+0041.

UTF-8 was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII.

## ord( ) and chr( )

`ord()` and `chr()` are built-in functions in Python.

The `ord()` function takes a single Unicode character and returns its integer Unicode code point value.

In [None]:
# Use ord() to convert a Unicode character string to its integer code point value

ord("ø")

The `chr()` function does the opposite. It takes an integer and returns the corresponding Unicode character. 

In [None]:
# Use chr() to convert a number to a character

chr(248)

## What does all this mean to Named Entity Recognition?
This means that 'plain text' is just a mystery. Computers ultimately only get a sequence of numbers and what characters those numbers translate to depends on the encoding. 

For those who work with multilingual corpora, especially those who work with texts that were created before the modern day, you will encounter at some point corpora that contain multiple encodings. We can use Python, however, to read a different encoding, standardize it into utf-8, and then continue to open that file as a utf-8 file consistently in the future.

In [None]:
# Use Python3 to read a polish text with latin2 encoding
with open(path, 'r', encoding='latin2') as f:
    pol_data=f.read()
print(pol_data)

In [None]:
# Write the data with the encoding utf8 to another file
with open('./data/pol_lat2_to_utf8.txt', 'w', encoding='utf8') as f:
    f.write(pol_data)

In [None]:
# Open the new file using utf8
with open('./data/pol_lat2_to_utf8.txt', 'r', encoding='utf8') as f:
    pol_utf8=f.read()
print(pol_utf8)

When we print out the Polish data from the file encoded in latin2 and from the one encoded in utf8, we get two strings on the screen that look exactly the same. However, when we get the byte strings from the two files, we see that even if two characters look the same to our naked eye, e.g. ń, they are different at the byte level. This means the computer will see them as two different characters.

Again, for those who work with multilingual corpora, you will encounter at some point corpora that contain multiple encodings. It is best practice to always convert your data to utf8 before proceeding to tokenization and NER. 

<h2 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h2>

Here is a txt file that is encoded in ISO-8859-15. Can you convert it to a utf-8 file?

In [None]:
# Get the file
file = './files/iso_text.txt'

## Problems within UTF-8

Our problems with encodings, unfortunately, do not end with UTF-8. Once we have encoded our texts into UTF-8, we can still have issues with characters that look the same but being encoded differently. This is particularly true with accented characters.

Here we have two characters that look exactly the same and are also deemed the same by the computer.

In [None]:
# Two characters that look exactly the same are deemed the same
"Ç" == "Ç"

Here, we also have two characters that look exactly the same but this time they are deemed as two different characters by the computer.

In [None]:
# Two characters that look exactly the same are not deemed the same
"Ç" == "Ç"

The two characters are regarded as different by the computer because at the byte level, they are different.

In [None]:
# Print out the unicode strings for the two characters
print("\u00C7", "\u0043\u0327")

One of them is seen as a single character, the accented C. 

In [None]:
# Latin capital letter C with cedilla
accent_c = "\u00C7"
print(accent_c)

The other is seen as a compound character, consisting of two characters, one being the Latin letter C and the other being the 'combining cedilla' character. 

In [None]:
# 'Latin capital letter C' and 'combining cedilla' characters together
compound_c = "\u0043\u0327"
print(compound_c)

The 'combining cedilla' character can be combined with other letters.

In [None]:
# Another example of compound character
"J\u0327"

In [None]:
# Take a look at the two parts of the compound C
"\u0043 \u0327"

Because the compound C and the accented C are different at the byte level, they are not considered the same by the computer. 

In [None]:
# Two different byte strings for the two characters
"\u00C7" == "\u0043\u0327"

### Unicode normalization

In NER though, we will not want our NER model to interpret the two characters as two different characters. Therefore, we will need to first normalize them to make them the same at the byte level.

In [None]:
import unicodedata

| Name | Abbreviation | Description | Example |
| --- | --- | --- | --- |
| Form D | NFD | *Canonical* decomposition | `Ç` → `C ̧` |
| Form C | NFC | *Canoncial* decomposition followed by *canonical* composition | `Ç` → `C ̧` → `Ç` |

```
Source: James Briggs - https://towardsdatascience.com/what-on-earth-is-unicode-normalization-56c005c55ad0
```

In [None]:
# compound C and accented C
print(compound_c, accent_c)

In [None]:
print("\u00C7", "\u0043\u0327")

In [None]:
# Decompose the accented character using Normal Form D
nfd_accent = unicodedata.normalize('NFD', accent_c)
print(compound_c == nfd_accent)

In [None]:
# Decompose and then compose using Normal Form C
nfc_compound = unicodedata.normalize('NFC', compound_c)
print(accent_c == nfc_compound)

# Introduction to SpaCy
The spaCy (spelled correctly) library is a robust library for Natural Language Processing. It supports a wide variety of languages with statistical models capable of parsing texts, identifying parts-of-speech, and extract entities. 

Let's see an example of NLP task that spaCy can do for us.

## Tokenization
Recall that last time we have seen a graph showing the NLP pipelines. A pipeline's purpose is to take input data, perform some sort of operations on that input data, and then output some useful information from the data. On a pipeline, we find the pipes. A pipe is an individual component of a pipeline. Different pipes perform different tasks. After we read in the data from a text file, an essential task of NLP is tokenization. 

<center><img src='./files/NER_NLP_pipeline.png' width=700></center>

One form of tokenization is **word tokenization**. When we do word tokenization, we break a text up into individual words and punctuations. Another form of tokenization is **sentence tokenization**. Sentence tokenization is precisely the same as word tokenization, except instead of breaking a text up into individual words and punctuations, we break a text up into individual sentences.

If you are an English speaker, you may think you do not need spaCy for sentence tokenization, because in English, the end of a sentence is indicated by a period `.`. Why not just use the the built-in `split()` function which allows us to split a text string by the period `.`? 

This is a ligit question, but simply splitting a text string by the period `.` will run into problems sometimes and spaCy is actually way more smarter.

In [None]:
!python3 -m spacy download en_core_web_sm # for English NER
!python3 -m spacy download es_core_news_sm # for Spanish NER

In [2]:
# String to be split
text = "Martin J. Thompson is known for his writing skills. He is also good at programming."

In [3]:
# Split the string by period
sents = text.split(".")
print(sents)

['Martin J', ' Thompson is known for his writing skills', ' He is also good at programming', '']


We had the unfortunate result of splitting at Martin J. The reason for this is obvious. In English, it is common convention to indicate abbreviation with the same punctuation mark used to indicate the end of a sentence. 

We can use SpaCy, however, to do sentence tokenization. SpaCy is smart enough to not break at Martin J.

First, let's import the spaCy library. Then, we need to load an NLP model object. To do this, we use the `spacy.load()` function. Here, we load the small English NLP model trained on written web text that includes vocabulary, syntax and entities.

In [None]:
# Load the small English NLP model
import spacy
nlp = spacy.load("en_core_web_sm")

We can use this English NLP model to parse a text and create a Doc object. If you need a quick refresh about what classes and object are, you can refer to [Python intermediate 4](../Python-intermediate/python-intermediate-4.ipynb).

In [None]:
# Use the English model to parse the text we created
doc = nlp(text)

There is a lot of data stored in the Doc object. For example, we can iterate over the sentences in the Doc object and print them out.

In [None]:
# Get the sentence tokens in doc
for sent in doc.sents:
    print(sent)

# spaCy's built-in NER

We have seen one example NLP task that spaCy can do for us. Now let's move on to named entity recognition, the NLP task we focus on in this series.

SpaCy already has a built-in NER off the shelf for us to use. 

We will iterate over the doc object as we did above, but instead of iterating over `doc.sents`, we will iterate over `doc.ents`. For our purposes right now, we simply want to get each entity's text (the string itself) and its corresponding label (note the underscore `_` after label).

In [None]:
# Print out the entities in the doc object together with their labels
for ent in doc.ents: # iterate over the entities 
    print (ent.text, ent.label_)

As we can see the small English model has correctly identified that Martin J. Thompson is an entity and given it the correct label PERSON.

Of course we have many different kinds of entities. Here is a list of entity labels used by the small English NLP model we loaded.

In [None]:
# List of labels in the small English model for NER
nlp.get_pipe("ner").labels

If you would like to know the meaning of a label, you can use the `explain` function.

In [None]:
# Get what a label means
spacy.explain('NORP')

<h2 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h2>

In [None]:
### Get the .csv file for this exercise

hp_file = './files/NER_Harry_Potter_1.csv'

In [None]:
### Take a look at the first five rows of the table
import pandas as pd
df = pd.read_csv(hp_file, delimiter=';')
df.head()

In this table we find the name of the characters speaking and their speech. 

Can you make two new columns, "Entities" and "Labels", such that each row of the "Entities" column stores a list of entities found in the sentence in the same row and each row of the "Labels" column stores a list of labels for the entities?

# spaCy's EntityRuler

Life would be so easy if we could just grab the ready-to-use built-in NER of spaCy and apply it to the large volume of data we have at hand. However, things are not that easy.

In [None]:
# Another sample text string
text = "Aars is a small town in Denmark. The town was founded in the 14th century."

#Create the Doc object
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

We see that the built-in NER failed to identify Aars as an entity of the GPE type. If we do want to extract 'Aars' from the text and give it a label of GPE, what can we do? 

## Add EntityRuler as a new pipe

Recall that we have talked about the pipes in a pipeline at the beginning of this lesson. In the case of spaCy, there are a few different pipes that perform different tasks. The tokenizer tokenizes the text into individual tokens; the parser parses the text, and the NER identifies entities and labels them accordingly. When we create a Doc object, all of this data is stored in the Doc object.  

In [None]:
# Take a look at the current pipes
nlp.pipe_names

The EntityRuler is a spaCy factory that allows one to create a set of patterns with corresponding labels. In order to extract the target entities and label them successfully, we can create an EntityRuler, give it some instructions, and then add it to the spaCy pipeline as a new pipe. 

In [None]:
# Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

# List of Entities and Patterns
patterns = [
                {"label": "GPE", "pattern": "Aars"}
            ]

ruler.add_patterns(patterns)

After we add the EntityRuler, we can use the new pipeline to do NER. 

In [None]:
# Use the new model to parse the text and create a new Doc object
doc = nlp(text)

# Iterate over the entities and print them out
for ent in doc.ents: 
    print (ent.text, ent.label_)

In [None]:
# Take a look at the pipes in the new pipeline
nlp.pipe_names

## The importance of order

It is important to remember that pipelines are sequential. This means that components earlier in a pipeline affect what later components receive.

In [None]:
# Use the new model to parse a new text string
text = "Xiong'an is a satellite city of Beijing."
nlp1 = spacy.load("en_core_web_sm")
doc=nlp1(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Xiong'an is a name of a city. We would want to label it as GPE, not ORG. 

In [None]:
# Create the EntityRuler
ruler = nlp1.add_pipe("entity_ruler")

# List of Entities and Patterns
patterns = [
                {"label": "GPE", "pattern": "Xiong'an"}
            ]

ruler.add_patterns(patterns)

# Get the entities
doc = nlp1(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Why do we still mislabel Xiong'an? This is because when we add the EntityRuler as a new pipe, it gets added at the end of the pipeline automatically. That means the EntityRuler will come after the built-in NER in spaCy. Since NER is a hard classification task, an entity that gets labeled will not be relabeled. If Xiong'an is labeled already by the built-in NER as ORG, it will not be relabeled by the EntityRuler that comes after. In order to give the EntityRuler primacy, we will have to put it in a position before the built-in NER when we add it so that it takes primacy over the built-in NER. 

In [None]:
# Load the model
nlp2 = spacy.load("en_core_web_sm")

# Create the EntityRuler and add it to the model
ruler = nlp2.add_pipe("entity_ruler", before='ner') # specify that the EntityRuler comes before built-in NER

# Add the new patterns to the ruler
patterns = [
                {"label": "GPE", "pattern": "Xiong'an"}
            ]

ruler.add_patterns(patterns)

# Use the new model to parse the text
doc = nlp2(text)

# Get the entities
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
# EntityRuler comes before the built in ner in nlp2
nlp2.pipe_names

So far, we only add exact strings to our EntityRuler. However, when we talk about patterns, we usually talk about more abstract patterns, not fixed strings. In the following, we will see an example where we write a regular expression pattern and add it to the EntityRuler.

## Write a regex pattern

Suppose we have a text written in English, except that the names are written in Latin. 

In [None]:
# English text with Latin names
text = "Marius was a consul in Rome. Marie is the vocative form."

We could write a function that captures the different forms of the name Marius.

In [None]:
# Write a function that captures the pattern for the Latin name Marius
def pattern(root):
    endings = ["us", "i", "o", "um", "e"] # the different endings of the name 
    patterns = [{"label": "PERSON", "pattern": root+ending} for ending in endings]
    return patterns
marius = pattern("Mari")
marius

In [None]:
# Create an empty English NLP model
nlp_latin = spacy.blank("en")

# Add an EntityRuler
nlp_latin_ruler = nlp_latin.add_pipe("entity_ruler")

# add the pattern for the Latin name Marius to the EntityRuler
nlp_latin_ruler.add_patterns(marius)

In [None]:
# Create a Doc object
doc_latin = nlp_latin(text)

# Iterate over the entities in Doc object and print them out
for ent in doc_latin.ents:
    print (ent.text, ent.label_)

We can also use regular expressions (regex) to specify the pattern.

In [None]:
# Write a function which returns the pattern for Latin name Marius
def latin_roots(root):
    return [{"label": "PERSON", "pattern": [{"TEXT": {"REGEX": "^" + root + r"(us|i|o|um|e)$"}}]}]

# Save the pattern to the variable marius2
marius2 = latin_roots("Mari")

# Create a blank English NLP model
nlp_latin2 = spacy.blank("en")

# Add an EntityRuler to the model
nlp_latin_ruler2 = nlp_latin2.add_pipe("entity_ruler")

# Add the pattern for Latin name Marius to the EntityRuler
nlp_latin_ruler2.add_patterns(marius2)

# Text to be parsed
text = "Marius was a consul in Rome. Marie is the vocative form. Caesar was a dictator."

# Create a Doc object using the new model with the regex pattern in EntityRuler
doc_latin2 = nlp_latin2(text)

# Iterate over the entities and print them out
for ent in doc_latin2.ents:
    print(ent.text, ent.label_)

<h2 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h2>

You have seen in coding challenge one that the off-the-shelf NER of Spacy mislabeled some entities. For example, "Hagrid", a person's name, is labeled as ORG. Suppose you have a file with all the characters' names in it. Can you make an EntityRuler and add it to the SpaCy pipeline so that all the person names will be labeled 'PERSON'. 

In [None]:
### Get the .csv file for this exercise
hp_file = './files/NER_HarryPotter_Characters.csv'

In [None]:
# Read in the data from the character csv file
chars_df = pd.read_csv(hp_file, delimiter=';')

# Take a look at the first five rows
chars_df.head()

In [None]:
# Get all parts from a character's name
chars_df = chars_df[['Name']] 
chars_df['split_name'] = chars_df['Name'].str.split(' ')
chars_df

In [None]:
# Get the first names and last names of the characters
chars_df['first_name'] = chars_df['split_name'].str[0]
chars_df['last_name'] = chars_df['split_name'].str[-1]

first_names = chars_df['first_name'].unique().tolist() # Put all unique first names in a list
last_names = chars_df['last_name'].unique().tolist() # Put all unique last names in a list

names = list(set(first_names) | set(last_names)) # the vertical bar | gives us the union of the two sets

names

Create an EntityRuler. In the ruler, add all characters' names as pattern and specify the label for them as "PERSON". Add the ruler as a new pipe. Last, add two new columns to the dataframe you created from the original NER_Harry_Potter_1.csv file, one storing the entities found in each sentence and one storing the labels for the entities. This time, all characters' names should be correctly labeled as "PERSON".

# Detecting languages in texts

When we work with a multilingual corpus, we will first want to know the different languages used in the corpus. There are different approaches to do this. In this section, I will introduce a third-party library Lingua for language detection. Currently, 75 languages are supported by Lingua. Lingua is an open-source project and the github repository for Lingua is here https://github.com/pemistahl/lingua-py.  

## Language detection with Lingua

In [None]:
# Install language detector

!pip3 install lingua-language-detector

In [None]:
# import the language detector builder
from lingua import LanguageDetectorBuilder

In [None]:
# build a language detector
detector = LanguageDetectorBuilder.from_all_languages().build()

In [None]:
# Use the detector to detect the language of a string
detector.detect_language_of("This is an English text")

In [None]:
# Use the detector to detect the language of a string
detector.detect_language_of("Este é um outro texto sem idioma especificado")

In [None]:
# Use the detector to detect the language of a string
detector.detect_language_of("这是一句中文")

Sometimes you may already know the range of languages in your corpus. You just want to identify the language for each document. In this case, you could narrow down the language detector to only a few languages. 

In [None]:
# build a language detector
from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

# Use the detector to decide between the given languages 
detector.compute_language_confidence_values("This is an English text")

## Multiple languages in the same file

The examples we go over just now assume that only one language is used in each document. However, the language detector we build cannot reliably detect multiple languages, because it will only output one language for a text by default. What if our text has multiple languages, such as the example below?

In [None]:
# a text string with multiple languages 
large_text = '''This is a text where the first line is in English.
Maar de tweede regel is in het Nederlands. 
Dies ist ein deutscher Text.'''

In [None]:
# build a language detector
languages = [Language.ENGLISH, Language.DUTCH, Language.GERMAN]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

If we run the detector over this text, we get the following output.

In [None]:
# Use the detector to decide the language of the text
detector.detect_language_of(large_text)

By default, Lingua returns the most likely language for a given input text. 

In [None]:
# Get the likelihood of the decision
confidence_values = detector.compute_language_confidence_values(large_text)
for confidence in confidence_values:
    print(f"{confidence.language.name}: {confidence.value:.2f}")

But this text has multiple languages. In this example text, each sentence is written in a different language. Therefore, we need to get each sentence string and run the detector over it.

In [None]:
# Create a Doc object 
doc = nlp(large_text)

# Iterate over each sentence and run the detector over it
for sent in doc.sents:
    print(f"Sentence: {sent.text.strip()}")
    print(detector.detect_language_of(sent.text))

# Bring everything together

In [None]:
# A document that has two languages, English and Spanish
multilingual_document = """This is a story about Margaret who speaks Spanish. 
'Juan Miguel es mi amigo y tiene veinte años.' Margeret said to her friend Sarah.
"""

In [None]:
# build a language detector
from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

In [None]:
# Load the relevant models
english_nlp = spacy.load("en_core_web_sm") # for English
spanish_nlp = spacy.load("es_core_news_sm") # for Spanish

In [None]:
# Create an NLP model and create a Doc object
multi_nlp = spacy.blank('en')

# Add sentencizer
multi_nlp.add_pipe('sentencizer')

# Create a Doc object
multi_doc = multi_nlp(multilingual_document.strip())


In [None]:
# Switching between languages with conditionals

for sent in multi_doc.sents:
    if detector.detect_language_of(sent.text).name == "ENGLISH":
        nested_doc = english_nlp(sent.text.strip())
    elif detector.detect_language_of(sent.text).name == "SPANISH":
        nested_doc = spanish_nlp(sent.text.strip())
    for ent in nested_doc.ents:
        print(ent.text, ent.label_)

In [None]:
!python3 -m spacy download en_core_web_sm # for English NER
!python3 -m spacy download en_core_web_md # for showing the word vectors

# Introduction to word embeddings

How do we represent word meanings in NLP? One way we can represent word meanings is to use word vectors. **Word embeddings** are vector representations of words.

## Distributional hypothesis

Word embeddings is inspired by the **distributional hypothesis** proposed by Harris ([1954](https://link.springer.com/chapter/10.1007/978-94-009-8467-7_1). This theory could be summarized as: words that have similar context will have similar meanings.

What does "context" mean in word embeddings? Basically, "context" means the neighboring words of a target word. 

Consider the following example. If we choose "village" as the target word and choose a fixed size context window of 2, the two words before "village" and the two words after "village" will constitute the context of the target word.

Treblinka is **a small** **<span style="color: blue;">village</span>** **in Poland.**



## Word2Vec

Google’s pre-trained word2vec model includes word vectors for a vocabulary of 3 million words and phrases that they trained on roughly 100 billion words from a Google News dataset. The vector length is 300 features, which means each of the 3 million words in the vocabulary is represented by a vector with 300 floating numbers. Word2Vec is one of the most popular techniques to learn word embeddings.

The training samples are the (target, context) pairs from the text data. For example, suppose your source text is the sentence "The quick brown fox jumps over the lazy dog". If you choose "quick" as your target word and have set a context window of size 2, you will get three training samples for it, i.e. (quick, the), (quick, brown) and (quick fox).   

**McCormick, C**. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://mccormickml.com/

The word2vec model is trained to accomplish the following task: given the input word $w_{1}$, for each word $w_{2}$ in our vocab, how likely $w_{2}$ is a context word of $w_{1}$.

The network is going to learn the statistics from the number of times each (target, context) shows up. So, for example, if you have a text about kings, queens and kingdoms, the network is probably going to get many more training samples of ("King", "Queen") than ("King", "kangaroo"). Therefore, if you give your trained model the word "King" as input, then it will output a much higher probability for "Queen" than it will for "kangaroo".

## Word vectors in SpaCy

We have used the small English model from spaCy in the previous two notebooks. Actually, there are medium size and large size English models from spaCy as well. Both are trained using the word2vec family of algorithms.

In [None]:
import spacy

# Load the medium size English model from spaCy
nlp = spacy.load('en_core_web_md')

# Get the word vector for the word "King"
nlp("King").vector

In [None]:
# Get the size of the vector
nlp("King").vector.size

In [None]:
# Get the similarity between the two words "King" and "Queen"
nlp("King").similarity(nlp("Queen"))

In [None]:
# Get the similarity between the two words "King" and "kangaroo"
nlp("King").similarity(nlp("kangaroo"))

# Word2Vec

The Word2Vec model is a machine learning model. This means that it is not based on explicitly defined rules, but derived from statistical data: instead of trying to come up with exceptionless rules, we use data to approximate them.
Let's use a simple example to understand the ML pipeline. Suppose you are interested in the relationship between the size and the price of a house in your neighborhood. Specifically, you would like to use the size of a house to predict its price. You collect information about the recently sold houses in your neighborhood. You note down their size and sale price. You draw a scatter plot like the following to examine the data. 

<center><img src='./files/NER_housebuying_scatter.png' width=300></center>

Now, you would like to derive a relationship between the house size and house price. Essentially, you fit a line to the data points. Your machine learning technique determines which line you end up with - you can use for example linear regression.

<center><img src='./files/NER_housebuying.png' width=300></center>

The function for this line is y = ax + b (where y is the price and x is the # of sqft). Of course, you would not just fit any line to your data points. You would want to fit a line so that the difference between the actual house prices and the predicted house prices is the smallest. Our task, then, reduces to the calculation of the value of a and b in the function y = ax + b so that the difference between the actual house prices and the predicted house prices is the smallest.

The ML method used in word2vec is more sophisticated. It is a shallow neural network with one hidden layer of neurons and one output layer of neurons. Chris McCormick has a very detailed explanation of this model in his blog post http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/. Despite this being a 'neural network' model, it is not a *deep learning* model (hence *shallow*) where the resulting behaviour gets much harder to interpret.

## Training and evaluation

Trained models should hold more than a memory of seen cases: the purpose is to build useful generalizations that work also for new data. Learning that “Amazon” right here is a company is not enough; it should derive from the context in which the word appears that "Amazon" is probably a company. But this is why the selection of training data matters: training a model on Wikipedia will give you much fewer sentences in the first person than on Twitter. Training on Wikipedia and then trying to analyse data from Twitter can therefore lead to problems.

In practice, to avoid 'overfitting' (good results in training but poor generalization), we separate the annotated samples into training and testing data. 

https://spacy.io/usage/training

The training data is used to hone a statistical model via predetermined algorithms. It does this by making guesses about what the proper labels are. It then checks its accuracy against the correct labels, i.e., the annotated labels, and makes adjustments accordingly. Once it is finished viewing and guessing across all the training data, the first **epoch**, or **iteration** over the data, is finished. At this stage, the model then tests its accuracy against the evaluation data. The training data is then randomized and given back to the system for x number of epochs.

# NER with EntityRuler vs. ML NER

In this section, we are going to make two models to do the same NER task, one doing NER with an EntityRuler and the other doing NER using word vectors.

First, let's download the two data files needed for this example. 

The first file stores the information about the spells in Harry Potter. 

In [None]:
import pandas as pd
hp_spells = './files/NER_HarryPotter_Spells.csv'
spells_df = pd.read_csv(hp_spells, sep=";")
spells_df

In the second file, we find the characters speaking and their speech. Notice that there is a column storing the spells found in the sentence if there is one. 

In [None]:
hp_film_spells = './files/NER_HarryPotter_FilmSpells.csv'
film_spells = pd.read_csv(hp_film_spells)
film_spells

Suppose we would like to create a model that can identify spells in a sentence and give it the label 'SPELL'.

## Create an NLP model with an EntityRuler to identify the spells

In the following, we will first create a NLP model with an entity ruler that identifies spells. This section can be seen as a review of what we have learned about EntityRuler in Wednesday's lesson.
Before we create a new EntityRuler, we will do some preprocessing of the data to get the patterns that we will add to the EntityRuler.

### Preprocessing the data

In [None]:
# Fill the NaN cells with an empty string
spells_df['Incantation'] = spells_df['Incantation'].fillna("")

# Get all spells
spells = spells_df['Incantation'].unique().tolist() # Put all strs in the 'Incantation' column in a list
spells = [spell for spell in spells if spell != ''] # Get all non-empty strs from the list, i.e. all the spells

# Take a look at the spells
spells

### Creating the patterns to be added to the EntityRuler

Recall from Wednesday's lesson that the patterns we add to an EntityRuler look like the following.

`patterns = [{"label": "GPE", "pattern": "Aars"}]`

In [None]:
# Write the pattern to be added to the ruler
patterns = [{"label":"SPELL", "pattern":spell} for spell in spells]

Now that we have the patterns ready, we can add them to an EntityRuler and add the ruler as a new pipe. 

In [None]:
# Create an EntityRuler and add the patterns to the ruler
entruler_nlp = spacy.blank('en') # Create a blank English model
ruler = entruler_nlp.add_pipe("entity_ruler") 
ruler.add_patterns(patterns)

In [None]:
test_text = """Ron Weasley: Wingardium Leviosa! Hermione Granger: You're saying it wrong. 
It's Wing-gar-dium Levi-o-sa, make the 'gar' nice and long. 
Ron Weasley: You do it, then, if you're so clever"""
doc = entruler_nlp(test_text)
for ent in doc.ents:
    print('EntRulerModel', ent.text, ent.label_)

In this model, we have basically hard written all spell strings in the EntityRuler. 

## Train a NLP model using ML to identify the spells

The format of the training data will look like the following. It is a list of tuples. In each tuple, the first element is the text string containing spells and the second element is a dictionary. The key of the dictionary is 'entities'. The value is a list of lists. In each list, we find the starting index, ending index and the label of the spell(s) found in the text string. 

`[
('Oculus Reparo', {'entities': [[0, 13, 'SPELL']]}),
('Alohomora', {'entities': [[0, 9, 'SPELL']]})
]`

The text strings we use for the training are from the 'Sentence' column of the film_spells dataframe.

In [None]:
# Take a look at the film_spells df
film_spells

Since we have hard written all spell strings in the EntityRuler and give them the label 'SPELL', we could just use this model to generate labeled data as our training data and evaluation data.

In [None]:
import nltk 
nltk.download('punkt')
# for sentence tokenization
nltk.download('punkt_tab')
def generate_labeled_data(ls_sents): # the input will be a list of strings
    text = ' '.join(ls_sents)
    sents = nltk.sent_tokenize(text)
    labeled_data = []
    for sent in sents:
        doc = entruler_nlp(sent) # create a doc object
        if doc.ents != (): # if there is at least one entity identified
            labeled_data.append((sent, {"entities":[[ent.start_char, ent.end_char, ent.label_] for ent in doc.ents]}))
    return labeled_data       

# Assign the result from the function to a new variable
training_validation_data = generate_labeled_data(film_spells['Sentence'].tolist())

# Take a look at the labeled data
training_validation_data

spaCy 3 requires that our data be stored in the proprietary `.spacy` format. To do that we need to use the `DocBin` class.

In [None]:
from spacy.tokens import DocBin 

db = DocBin() 

for text, annot in training_validation_data[:19*2]: # Get the first 38 tuples as the training data
    doc = entruler_nlp(text) # create a doc object
    doc.ents = [doc.char_span(ent[0], ent[1], label=ent[2]) for ent in annot['entities']]
    db.add(doc)
db.to_disk(f"./train_spells.spacy")

In [None]:
for text, annot in training_validation_data[19*2:]: # Get the rest tuples as the validation data
    doc = entruler_nlp(text) 
    doc.ents = [doc.char_span(ent[0], ent[1], label=ent[2]) for ent in annot['entities']]
    db.add(doc)
db.to_disk(f"./valid_spells.spacy")

Now we can finally start training our model! 

In [None]:
!python3 -m spacy init config --lang en --pipeline ner config.cfg --force

In [None]:
!python3 -m spacy train config.cfg --output ./output/spells-model/ --paths.train ./train_spells.spacy --paths.dev ./valid_spells.spacy

Now let's finally run our model!

In [None]:
# Load the best model
model_best = spacy.load('./output/spells-model/model-best')

In [None]:
# Let's try our model on this long text string
test_text = """53. Imperio - Makes target obey every command But only for really, really funny pranks. 52. Piertotum Locomotor - Animates statues On one hand, this is awesome. On the other, someone would use this to scare me.

51. Aparecium - Make invisible ink appear

Your notes will be so much cooler.

50. Defodio - Carves through stone and steel

Sometimes you need to get the eff out of there.

49. Descendo - Moves objects downward

You'll never have to get a chair to reach for stuff again.

48. Specialis Revelio - Reveals hidden magical properties in an object

I want to know what I'm eating and if it's magical.

47. Meteolojinx Recanto - Ends effects of weather spells

Otherwise, someone could make it sleet in your bedroom forever.

46. Cave Inimicum/Protego Totalum - Strengthens an area's defenses

Helpful, but why are people trying to break into your campsite?

45. Impedimenta - Freezes someone advancing toward you

"Stop running at me! But also, why are you running at me?"

44. Obscuro - Blindfolds target

Finally, we don't have to rely on "No peeking."

43. Reducto - Explodes object

The "raddest" of all spells.

42. Anapneo - Clears someone's airway

This could save a life, but hopefully you won't need it.

41. Locomotor Mortis - Leg-lock curse

Good for footraces and Southwest Airlines flights.

40. Geminio - Creates temporary, worthless duplicate of any object

You could finally live your dream of lying on a bed of marshmallows, and you'd only need one to start.

39. Aguamenti - Shoot water from wand

No need to replace that fire extinguisher you never bought.

38. Avada Kedavra - The Killing Curse

One word: bugs.

37. Repelo Muggletum - Repels Muggles

Sounds elitist, but seriously, Muggles ruin everything. Take it from me, a Muggle.

36. Stupefy - Stuns target

Since this is every other word of the "Deathly Hallows" script, I think it's pretty useful."""

# Create a doc object out of the text string using the trained model
doc = model_best(test_text)

# Find out the entities
for ent in doc.ents:
    print(ent.text, ent.label_)

Let's also try the model we created with an EntityRuler with all spell names hard written in it.

In [None]:
# Create a doc object out of the text string using the EntityRuler model
doc = entruler_nlp(test_text)

# Find out the entities
for ent in doc.ents:
    print(ent.text, ent.label_)

It seems in this example our EntityRuler model performs better than our trained model. Why do we think that is?

Part of the reason we aren't getting better results is something that Ines Montani describes in this Stack Overflow answer https://stackoverflow.com/questions/50580262/how-to-use-spacy-to-create-a-new-entity-and-learn-only-from-keyword-list/50603247#50603247

"The advantage of training the named entity recognizer to detect SPECIES in your text is that the model won't only be able to recognise your examples, but also generalise and recognise other species in context. If you only want to find a fixed set of terms and not more, a simpler, rule-based approach might work better for you."

# References
McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com

Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O'Reilly Media, Inc.

Spolsky, J. (2003, October 8). The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Joel on Software. https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/


# Who is discussed in the Stanford Encyclopedia?

## Version 1: Named Entity Recognition
1. Load SEP corpus
2. Recognize Named Entities with spaCy
3. Display results for places, people, and organizations.
4. split people results by Gender?

## Version 2: Gazetteer
Earlier we wrote that an NER model should not just remember a list of companies, but learn to infer from context what a company is. But if we are looking for specific enough strings, there may be other approaches. There are a lot of philosophers, but also not a LOT of philosophers. It is not a crazy thought to get them all into a list (unlike, for example, compiling a list of 'names of people in general').