# Named-Entity Recognition with Spacy

Natural Language Processing is a subfield of computer science that provides a broad set of tools for the extraction of information from texts by computational means. In the humanities, the field has proven useful for distant reading, particularly for the study of word frequencies, topic modelling, and sentiment analysis. 

There are several libraries for text analysis with Python. The most well-known are NLTK, the Stanford Core NLP and Spacy. These libraries provide widely-tested toolkits for various NLP processes.

Using Spacy has many advantages. Its functionalities are streamlined to deliver results quickly.  By default, the user is not overwhelmed with choices, while simultaneously allowing for user modifications.  Models can be retrained to improve results, and the pipeline (workflow) modified to suit our needs. Moreover, adding new categories to the model is easy.  And finally, they provide several models in multiple languages.  

In this notebook, we will use the Spacy library to conduct some simple Named-Entity Recognition analysis.

## Spacy Installation and Import

To use Spacy, we need to install their module and the statistical model we wish to use. Everything has been made available for this case study, so no installation is necessary. For future reference, instructions are available here ([spacy installation](https://spacy.io/usage); [model installation](https://spacy.io/usage/models)).

The first thing we will do is import the spacy module:

In [1]:
import spacy

## Understanding the Pipeline 

<div>
<img src=https://spacy.io/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg  width="700"/>
</div>

The pipeline is a Python object that acts like a function for processing texts. Conventionally, we will assign it to the variable 'nlp'.

Here, we are using Spacy's [small English model](https://spacy.io/models/en).

In [2]:
nlp = spacy.load('en_core_web_sm')

Within the pipeline, the text passes through a chain of processes. First, it is tokenized, or separated into the individual words contained in the document. 

Then, the NLP object applies other processes (by default, tagging, parsing, and named-entity recognition) to your text. These components are independent and can be disabled or dropped, depending on your NLP needs. New, custom components can be added (and the tokenizer can also be modified).

The NLP outputs a Doc object, which has many attributes and methods, including a set of Tokens with their NER tags. 


It is worth looking at what tags are available in the small English model:
CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART

We can use spacy.explain to get more information on tags whose meaning is not obvious:

In [12]:
spacy.explain('FAC')

'Buildings, airports, highways, bridges, etc.'

In [13]:
spacy.explain('GPE')

'Countries, cities, states'

In [14]:
spacy.explain('NORP')

'Nationalities or religious or political groups'

In [15]:
spacy.explain('LOC')

'Non-GPE locations, mountain ranges, bodies of water'

# NER with Spacy

Let's import some data and test spacy out.

In [3]:
import pandas as pd

VasariData = pd.read_csv('VasariData.csv')

In [4]:
VasariData

Unnamed: 0,Artist,Craft,Location,Bio
0,"FILIPPO LIPPI, CALLED FILIPPINO",PAINTER,FLORENCE,There was at this same time in Florence a pain...
1,BERNARDINO PINTURICCHIO,PAINTER,PERUGIA,Even as many are assisted by fortune without b...
2,FRANCESCO FRANCIA,GOLDSMITH AND PAINTER,BOLOGNA,"Francesco Francia, who was born in Bologna in ..."
3,"PIETRO PERUGINO [_PIETRO VANNUCCI, OR PIETRO D...",PAINTER,,How great a benefit poverty may be to men of g...
4,"VITTORE SCARPACCIA (CARPACCIO), AND OTHER VENE...",PAINTERS,LOMBARDY,It is very well known that when some of our cr...
5,"JACOPO, CALLED L'INDACO",PAINTER,,"Jacopo, called L'Indaco, who was a disciple of..."
6,LUCA SIGNORELLI CORTONA [_LUCA DA CORTONA_],PAINTER,,"Luca Signorelli, an excellent painter, of whom..."
7,"FILIPPO LIPPI, CALLED FILIPPINO",PAINTER,FLORENCE,There was at this same time in Florence a pain...
8,BERNARDINO PINTURICCHIO,PAINTER,PERUGIA,Even as many are assisted by fortune without b...
9,FRANCESCO FRANCIA,GOLDSMITH AND PAINTER,BOLOGNA,"Francesco Francia, who was born in Bologna in ..."


Above, we have a collection of 21 artist biographies from Vasari's Vitae. We can apply NER on these texts to see what locations, people, etc are present in them.

# Analyzing a single text

The first thing to note is that we can analyze texts one by one, by passing them through the pipeline individually.

Let's create a variable for an excerpt from one of these biographies, to illustrate.

In [25]:
Jacopo = VasariData.iloc[5,3][935:1891]
print(Jacopo)

 Jacopo worked for many years in Rome, or, to be more precise, he lived many years in Rome, working very little. By his hand, in that city, is the first chapel on the right hand as one enters the Church of S. Agostino by the door of the façade; on the vaulting of which chapel are the Apostles receiving the Holy Spirit, and on the wall below are two stories of Christ--in one His taking Peter and Andrew from their nets, and in the other the Feast of Simon and the Magdalene, in which there is a ceiling of planks and beams, counterfeited very well. In the panel of the same chapel, which he painted in oil, is a Dead Christ, wrought and executed with much mastery and diligence. In the Trinità at Rome, likewise, there is a little panel by his hand with the Coronation of Our Lady. But what need is there to say more about this man? What more, indeed, is there to say? It is enough that he loved gossiping as much as he always hated working and painting.


To do so, we just use the nlp as a function on our document, and assign this to our variable (conventionally, we will call this variable 'doc'). 

In [26]:
doc = nlp(Jacopo)

This creates a Doc object from our processed text. Our text has been tokenized and passed through any other processes in the pipeline, including named-entity recognition.

The Doc object is a sequence of tokens, and include several attributes and methods.

So, for instance, we could list entities within the document using Doc.ents.

In [27]:
print(doc.ents)

(many years, Rome, many years, Rome, first, one, the Church of S. Agostino, Apostles, the Holy Spirit, two, Christ, Peter, Andrew, the Feast of Simon, Magdalene, a Dead Christ, Trinità, Rome, the Coronation of Our Lady)


We could get a bit more sophisticated and include their labels:

In [28]:
for ent in doc.ents:
    print(ent, ent.label_)

many years DATE
Rome GPE
many years DATE
Rome GPE
first ORDINAL
one CARDINAL
the Church of S. Agostino ORG
Apostles GPE
the Holy Spirit WORK_OF_ART
two CARDINAL
Christ ORG
Peter PERSON
Andrew PERSON
the Feast of Simon LOC
Magdalene PERSON
a Dead Christ PERSON
Trinità WORK_OF_ART
Rome GPE
the Coronation of Our Lady ORG


As we can see, it's still quite difficult to determine whether these are correct.

# Displacy: Viewing Labels in Context

An alternative is to view them with Spacy's displacy visualizer, which shows the tags as markup within the text.

In [9]:
from spacy import displacy

In [29]:
displacy.render(doc, style='ent', jupyter=True)

As we can see, some of these tags are correct: Rome is a geopolitical entity, 'many years' does refer to time, etc. Others are not.

Let's learn how to evaluate how well the model is performing on our data, and then modify and train it to improve its results.

# Evaluating Model Performance

