## spaCy Notebook

Here is the notebook that will perform Named Entity Recognition using the spaCy module on the not-parsed data set


### Understanding spaCy
Ultimately it will be useful to use spaCy to find names in the publication statements, but first it is important to see what information spaCy can tell us.
<br><br>
Namely, the *labels* of the *entities*. *Labels* being the type of entity, in the example below the labels are PERSON, ORG, DATE etc. Entities are the nouns or information that the model recognizes as unique and specific. In the below example: Sebastian Thrun, Google, 2007 etc. This is different from any noun such as "companies" or "CEOs" which are not specific *Named Entities* and therefore not picked up by the model.
<br><br>
Example from spaCy website:<br>
text= <br>
("When Sebastian Thrun started working on self-driving cars at "<br>
  "Google in 2007, few people outside of the company took him "<br>
  "seriously. “I can tell you very senior CEOs of major American "<br>
  "car companies would shake my hand and turn away because I wasn’t "<br>
  "worth talking to,” said Thrun, in an interview with Recode earlier "<br>
  "this week.")<br>
  <br>
Entity, Label:
* Sebastian Thrun, PERSON
* Google, ORG
* 2007, DATE
* American, NORP
* Thrun, PERSON
* Recode, ORG
* earlier this week, DATE

In [10]:
# Import Necessary Modules
import pandas as pd
import spacy
# Load spaCy Model
# requires installing the model with python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

#define path for dataframe
printers_data_file_pubstmt_notparsed = 'data/printers_etc_pubstmt_notparsed.csv'
#create Dataframe
pubstmt_df = pd.read_csv(printers_data_file_pubstmt_notparsed)

In [15]:
pubstmt_df

Unnamed: 0.1,Unnamed: 0,tcpid,role,role_edited,name,source,title,author,parsedDate,date,place,pubStmt,nameResolved,viafId,namePreprocessed,namedEntities
0,2,A06567,printer.,printer,"Caxton, William, approximately 1422-1491 or 1492,",estc_ep,Stans puer ad mensam,"Lydgate, John, 1370?-1451?",1476.0,1476?],,"Printed by William Caxton, [Westminster : 1476?]",,,,[William Caxton]
1,5,A06543,printer.,printer,"Caxton, William, approximately 1422-1491 or 1492,",estc_ep,[The chorle and the birde],"Lydgate, John, 1370?-1451?",1477.0,1477?],,"W. Caxton, [Westminster : 1477?]",,,,[W. Caxton]
2,6,A06553,printer.,printer,"Caxton, William, approximately 1422-1491 or 1492,",estc_ep,[The horse the ghoos &amp; the sheep],"Lydgate, John, 1370?-1451?",1477.0,1477?],,"By W. Caxton, [Westminster : 1477?]",,,,[W. Caxton]
3,8,A06569,printer.,printer,"Caxton, William, approximately 1422-1491 or 1492,",estc_ep,The temple of glas,"Lydgate, John, 1370?-1451?",1477.0,1477?],,"By William Caxton, [Westminster : 1477?]",,,,[William Caxton]
4,10,A18294,printer.,printer,"Caxton, William, approximately 1422-1491 or 1492,",estc_ep,If it plese ony man spirituel or temporel to b...,"Caxton, William, ca. 1422-1491.",1477.0,1477?],,"W. Caxton, [Westminster : 1477?]",,,,[W. Caxton]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11371,78087,B27318,bookseller,bookseller,H.H.,tcp_ep,A new summons to Horn-Fair: to appear at Cucko...,,1700.0,[1700?],Black-Fryars,"Printed and sold by H.H. in Black-Fryars, [Lon...",,,,[]
11372,78088,B27318,printer,printer,H.H.,tcp_ep,A new summons to Horn-Fair: to appear at Cucko...,,1700.0,[1700?],Black-Fryars,"Printed and sold by H.H. in Black-Fryars, [Lon...",,,,[]
11373,78091,B28403,publisher,publisher,W.H.,tcp_ep,"The proceedings of the Court of Admiralty, by ...",,1700.0,1700,Fleet-Bridge,"Printed for W.H. near Fleet-Bridge, London : 1700",,,,[]
11374,78100,B43600,publisher,publisher,A.M.,tcp_ep,"The cooper of Norfolk, or, A true jest o'th' b...","M. P. (Martin Parker), d. 1656?",1700.0,[1700?],,Printed by and for W.O. for A.M. and sold by t...,,,,[]


### First Steps

It will be useful to figure out what labels are in the publication statements
<br><br>
The below code does this

In [11]:
# Define function to print entity labels
def return_entity_labels(text):
    doc = nlp(text)
    label_list = []
    for ent in doc.ents:
        label_list.append((ent.label_, ent.text))
    return label_list
# Print entity labels for each row in pubstmt_df['pubStmt']
for stmt in pubstmt_df['pubStmt'][:100]:
    print(return_entity_labels(stmt))

[('PERSON', 'William Caxton'), ('DATE', '1476')]
[('PERSON', 'W. Caxton'), ('DATE', '1477')]
[('PERSON', 'W. Caxton'), ('DATE', '1477')]
[('PERSON', 'William Caxton'), ('DATE', '1477')]
[('PERSON', 'W. Caxton'), ('DATE', '1477')]
[('PERSON', 'William Caxton')]
[('PERSON', 'W. Caxton'), ('DATE', '1477')]
[('PERSON', 'W. Caxton'), ('DATE', '1477')]
[('PERSON', 'William Caxton')]
[('GPE', 'London')]
[('GPE', 'London')]
[('PERSON', 'W. Caxton'), ('CARDINAL', '1480')]
[('PERSON', 'William Caxton'), ('CARDINAL', '1480')]
[('PERSON', 'William Caxton'), ('DATE', '2 July 1482')]
[('GPE', 'London'), ('DATE', '1482')]
[('PERSON', 'William Caxton'), ('GPE', 'thabbey'), ('GPE', 'london'), ('GPE', 'London'), ('DATE', 'October'), ('GPE', 'CCCC'), ('PERSON', 'Kyng Edward'), ('ORDINAL', 'fourth')]
[('PERSON', 'Caxton'), ('PERSON', 'Septembre'), ('PERSON', 'Kyng Richard'), ('CARDINAL', '1483'), ('GPE', 'CCCC')]
[('PERSON', 'Wilhelmum de Mechlinia'), ('GPE', 'London')]
[('ORG', 'Wylliam Caxton')]
[('PERS

In [5]:
#it will also be useful to get a list of all labels
def get_entity_labels(text):
    doc = nlp(text)
    labels = set([ent.label_ for ent in doc.ents])
    return labels

all_labels = set()

# Loop through the 'pubStmt' column and add the entity labels to the set
for stmt in pubstmt_df['pubStmt']:
    labels = get_entity_labels(stmt)
    all_labels.update(labels)

# Print the set of all unique entity labels
print(all_labels)

{'WORK_OF_ART', 'PRODUCT', 'MONEY', 'GPE', 'LAW', 'LANGUAGE', 'DATE', 'FAC', 'QUANTITY', 'ORG', 'CARDINAL', 'NORP', 'PERSON', 'EVENT', 'LOC', 'ORDINAL'}


## Next Step

The results from the first tests show that the labels we have in our Publication Statements are: <br>'CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERSON', 'PRODUCT', 'QUANTITY', 'WORK_OF_ART'<br>
The possible labels are: <br> 'CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', *'PERCENT'*, 'PERSON', 'PRODUCT', 'QUANTITY', *'TIME'*, 'WORK_OF_ART'

Also, while spaCy is mostly accurate it sometimes confuses labels, for example:
* Antwerp was labeled as a Person
* XXIX was labeled as a Person
* CCCC was labeled as a Geo-political Entity
* Wylliam Caxton was labeled as an Organization

These errors will be important to address in the future, but next step is to add a column in the dataframe for all named entities that could be the publisher.

## Create a Helper Function
This Function takes:
* a statement to search for names in
* and a Named Entity Recognition Model that can test for names

This Function will be used to extract the names

In [12]:
#Define a Function for Name Extraction
def extract_names(statement, ner_model):
    ner_in_statement = ner_model(statement) 
    return [ent.text for ent in ner_in_statement.ents if ent.label_ in ["PERSON"]]

In [14]:
pubstmt_df['namedEntities'] = pubstmt_df['pubStmt'].apply(extract_names, ner_model=nlp)
pubstmt_df['namedEntities']

0        [William Caxton]
1             [W. Caxton]
2             [W. Caxton]
3        [William Caxton]
4             [W. Caxton]
               ...       
11371                  []
11372                  []
11373                  []
11374                  []
11375                  []
Name: namedEntities, Length: 11376, dtype: object