# Entity Extratction with spaCy

## Overview

We're going to look for all the people mentioned in a pile of documents.

### Entites

"Entities" in documents are, generally, names -- names of people, places, and things such as companies. Finding out which entities are mentioned in a trove of documents can be pretty helpful, especially when you don't previously _know_ someone or some place is included the document.

There are services online that do this kind of extraction, including [DocumentCloud](https://www.documentcloud.org/) ([see how here](https://www.documentcloud.org/faq#faq-analyzing-1)), [Amazon Comprehend](https://aws.amazon.com/comprehend/features/) and [Google Natural Language](https://cloud.google.com/natural-language/).

### Using spaCy

We're going to do our entity extraction right here in our notebook using a pre-trained natural language model called [spaCy](https://spacy.io/). Specifically, we're using the spaCy [large English language model](https://spacy.io/models/en#en_core_web_lg) trained on the [OntoNotes dataset](https://catalog.ldc.upenn.edu/LDC2013T19) -- a trove of "telephone conversations, newswire, newsgroups, broadcast news, broadcast conversation, weblogs" that includes nearly 1.5 million English words.  

The spaCy project has a lot of great language features. We'll be looking at the [named entities feature](https://spacy.io/usage/linguistic-features#named-entities). Note also that there are [models for several languages](https://spacy.io/models) being developed in spaCy.


## The Plan

- We'll download the spaCy software and the large English language model.
- We'll also download a (smallish) pile of emails released in a court case.
- We'll learn how to use spaCy functions to extract entities
- We'll use the spaCy functions to scan all the pages of the emails.

## Credits

This notebook was written by John Keefe [Quartz](https://qz.com) at Quartz and includes document-processing code written included in [a blog post](https://qz.ai/discovering-interesting-documents-in-the-mauritius-leaks/) and a [Jupyter notebook](https://github.com/Quartz/aistudio-doc2vec-for-investigative-journalism/blob/master/Doc2vec%20for%20Investigative%20Journalism.ipynb) by Jeremy B. Merrill at Quartz, who used it to help find documents inside a document dump known as the [Mauritius Leaks](https://qz.com/1670632/how-quartz-used-ai-to-help-reporters-search-the-mauritius-leaks/).  

-- John Keefe, [Quartz](https://qz.com), October 2019

## Setup

### For those using Google Colaboratory ...

Be aware that Google Colab instances are ephemeral -- they vanish *Poof* when you close them, or after a period of sitting idle (currently 90 minutes), or if you use one for more than 12 hours.

If you're using Google Colaboratory, be sure to set your runtime to "GPU" which speeds up your notebook for machine learning:

![change runtime](https://qz-aistudio-public.s3.amazonaws.com/workshops/notebook_images/change_runtime_2.jpg)
![pick gpu](https://qz-aistudio-public.s3.amazonaws.com/workshops/notebook_images/pick_gpu_2.jpg)

### Everybody do this ...

Everyone needs to run the next cell, which initializes the Python libraries we'll use in this notebook.

In [0]:
## *EVERYBODY* SHOULD RUN THIS CELL
## This can take up to 3 minutes ... but that's normal
%cat /usr/local/cuda/version.txt

!pip install -U spacy[cuda100] --quiet
!python -m spacy download en_core_web_lg
!pip install PyPDF2 --quiet

import spacy
spacy.prefer_gpu()

import en_core_web_lg
import PyPDF2
import json
from os.path import exists

## The Data

In this tutorial, we're going to look at some emails from the office of New York City mayor Bill de Blasio that were released under the Freedom of Information Law. 

The emails were part of the ["Agent of the City" hubbub](https://www.ny1.com/nyc/all-boroughs/news/2018/05/24/agents-of-the-city-emails-released), in which 4,000 city emails were released. You can download the [original file here](https://a860-openrecords.nyc.gov/response/120252?token=c784372fd140497081b4bfcff9f0e3a0) -- though we'll be using a file containing just [the first 100 pages](https://qz-aistudio-public.s3.amazonaws.com/workshops/2018.05.24_BerlinRosen_Responsive_Records_100pgs.pdf) for this exercise. 

**Mount** **Drive** to work from

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [4]:
%cd /content/drive/'My Drive'/Colab_Notebooks/ml_journalists


/content/drive/My Drive/Colab_Notebooks/ml_journalists


In [5]:
# Run this cell to download the data we'll use for this exercise
!wget -N https://qz-aistudio-public.s3.amazonaws.com/workshops/deblasio_emails_data.zip --quiet
!unzip -q deblasio_emails_data.zip
print('Done!')

Done!


Let's look at what we have.

In [6]:
%ls data/

2018.05.24_BerlinRosen_Responsive_Records_100pgs.pdf  [0m[01;34mimages[0m/
[01;34mchoppers[0m/                                             not_yet_seen.png
imagenet_class_index.json


## Trying the entity extraction feature

In [0]:
# First we load the model into the notebook
nlp = en_core_web_lg.load()

In [0]:
# Now let's give it a try
doc = nlp(u"Nairobi governor, Mike Sonko got arrested in Voi by robots")


There's [a whole list of entities spaCy can detect](https://spacy.io/api/annotation#named-entities)!

In [0]:
my_story = """Nairobi Governor Mike Sonko has arrived at the Milimani Law Court ahead of his arraignment today over graft charges at City Hall.
Sonko was transferred from the EACC headquarters under tight police security and arrived at the Milimani Law Court at exactly 7.55 am.
A contingent of heavily armed security personnel has been deployed around the court with the roads leading to the area sealed off and traffic re-directed to other adjacent roads. 
"""

doc = nlp(my_story)

In [24]:
for entity in doc.ents:
    print(entity.text, entity.label_, spacy.explain(entity.label_))

Nairobi GPE Countries, cities, states
Mike Sonko PERSON People, including fictional
the Milimani Law Court ORG Companies, agencies, institutions, etc.
today DATE Absolute or relative dates or periods
City Hall FAC Buildings, airports, highways, bridges, etc.
Sonko PERSON People, including fictional
EACC ORG Companies, agencies, institutions, etc.
the Milimani Law Court ORG Companies, agencies, institutions, etc.
7.55 am TIME Times smaller than a day


## Load the emails into a "jsonl" file

JSONL is a file format that stores data in a JSON file, with each record living on its own line in the file.

This next block reads the PDF file and turns it into a JSONL file, which is much easier to work with.

In [0]:
# read the PDF file into a new file called 'nyc_docs.jsonl'
jsonl_file = "nyc_docs.jsonl"
if not exists(jsonl_file):
    pdf_file = open('data/2018.05.24_BerlinRosen_Responsive_Records_100pgs.pdf', 'rb')
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    with open(jsonl_file, 'w') as f:
        for page_num in range(read_pdf.getNumPages()):
            page = read_pdf.getPage(page_num)
            page_content = page.extractText().encode('utf-8').decode("utf-8") 
            f.write(json.dumps({"_source": {"content": page_content}, "_id": f"p{page_num+1}"}) + "\n")

In [28]:
# let's take a look at the first few lines of the file
!head nyc_docs.jsonl

{"_source": {"content": "  THE CITY OF NEW YORK\n OFFICE OF THE MAYOR\n NEW YORK, NY 10007\n  May \n24, 2018  Dear \nRequester\n,  This letter is in\n response to \nprevious \nrequest\ns pursuant to the Freedom of Information Law \nreceived\n by this Office, seeking\n generally\n  Correspondence between \nthe Office of the Mayor and \nJonathan Rosen or \nBerlinRosen.\n  Due to the number of FOIL requests the Mayor\u00d5s Office has received for similar \ncommunications, as a courtesy the documents being disclosed to you today include materials that \nare outside the scope\n of your requests.\n !The responsive records comprise four volumes of material:\n A. Pages 3\n-729: Material previously \nwithheld in full or in part\n pursuant to the inter\n-agency \nexemption \n\u00a487(2\n)(g) within the time range. This volume also includes 73 pages of \nmateria\nl previously withheld in full pursuant to \n\u00a487(2\n)(b).\n  Range: January 1, 2014 to April 3, 2015.\n B. Pages 730\n-2844: Mater

Each line in the JSON file now represents a single page in the original document. So now we'll step through each line (aka page) and grab all the entities in the text. Then we'll print out all the entities.

## Finding and listing the names

In [30]:
with open(jsonl_file, 'r') as f:        # open the jsonl file
    for line in f:                      # loop through each line ...
        line = json.loads(line)            # read the line 
        text = line["_source"]["content"]  # grab the text of the email
        page_number = line["_id"]          # grab the page number we're on
        doc = nlp(text)                    # load the text into the nlp model
        for ent in doc.ents:               # loop through each entity in the text...
            if (ent.label_ == "PERSON"):      # if the entity is a person's name ...
                print(page_number, ent.text)  # print the page number and the name

p1 Jonathan Rosen
p1 BerlinRosen
p2 Jimmy Pan
p3 Phil

p3 Henry Goldman
p3 mailto
p3 Phil

p3 Harvey Weinstein
p3 Gwyneth Paltrow
p3 Henry Goldman
p4 Jonathan Rosen
p4 Ragone
p4 Peter
p4 De Blasia's
p4 Ross
p4 Bill de Blasio
p4 Peter Ragone Cc
p4 Jonathan Rosen
p4 Ragone
p4 DeBlasio Clips
p4 DeBlasio
p4 DeBlasio
p4 Bill De Blasia's
p4 John Kanas
p4 oney
p4 Betty Liu
p4 De Blasia
p5 Wilbur Ross
p5 Steve Schwarzman
p6 Klein
p6 Emma Wolfe
p6 Monica
p6 Kenneth Lovett http://www.nydailynews.com/new-york/rejecti ng-nyc
p6 Mayor de Blasia's
p6 Jeffrey Klein
p6 Mayor de Blasia's
p6 Klein
p6 Klein
p6 Dean Skelos
p6 Klein
p6 Skelos
p6 de Blasia's
p6 loor
p6 Blasia
p6 Klein
p6 h
iking the city income tax on
p8 Jonathan Rosen
p8 Ragone
p8 Peter
p8 Wolfe
p8 Emma
p8 Hawkins
p8 George Arzt's
p8 Richard Brodsky's
p8 Richard Brodsky
p8 Hank Sheinkopf
p8 Bill de Blasia
p8 CRAIN
p8 CRAIN
p8 Bill de Blasia
p8 Andrew Cuomo
p8 Cuomo
p9 Cuomo
p9 de Blasio
p9 Heather Briccetti
p9 Cuomo
p9 de Blasio
p9 Sheldon

Really we want a list of _names_ not pages, right?

In [0]:
list_of_names = {}

with open(jsonl_file, 'r') as f:
    for line in f:
        line = json.loads(line)
        text = line["_source"]["content"]
        page_number = line["_id"]
        doc = nlp(text)
        
        # loop through the entities in the page
        for ent in doc.ents:
            
            # is the entity is a person ...
            if (ent.label_ == "PERSON"):
                
                # check if we already have this entity
                if ent.text in list_of_names:
                    
                    # add this page to the entity's list of pages
                    list_of_names[ent.text] += " " + page_number
                    
                else:
                    
                    # otheriwise start a list of pages
                    list_of_names[ent.text] = page_number

In [32]:
list_of_names

{'/#&16)16\'85)H76)+"*6)"\')"*1)8&,8,1\'%)+#$$)(,6)!,0\')\'4\'()!$,1\')6': 'p24',
 "4)#-+(!-,!,>>+6!$,&!)#!+?*9&/'8+!,#!-.+!#+?-!:.)/+!,>!-.+!@ABCDE!*)5:)'%#!-,:)//!F)$,6!(+!G9)/',7/!:6+HI": 'p20',
 '4)%/\'&"#0E,+$I\'H#4\'J/04)$I\'0+&/$\'#$6\'+,"/0&\'K&+B/\'0/4#,/6\',+\'(#0$/0I\'&+B/\'+,"/0\')&&G/&\',++L\'#$6\'$+,\'/$+G': 'p67',
 '4,-(&)"(#/0)&(4/+"5': 'p32',
 '48"-5I\n': 'p32',
 'Adam Dickter': 'p33',
 'Adams': 'p13 p13 p13 p14 p14 p14 p14 p14 p15 p15 p15 p15 p15 p18 p18 p18 p18 p18 p19 p19 p19 p19 p20 p32 p32 p32 p32 p32 p32',
 'Alexandra': 'p59 p65',
 'Alexandra Jonathan Rosen': 'p65',
 'Alicia': 'p53',
 'Alicia glen"s': 'p58',
 "Alicia glen's": 'p58 p58',
 'Alison Baumann': 'p75 p75 p75 p78 p80 p80 p80',
 'Alison Baumann\n': 'p75 p75 p78 p78 p80 p81',
 'Alison Novak': 'p83 p87',
 'Amie Gross': 'p84',
 'Amy Boyle': 'p84 p88',
 'Andrew Brent': 'p58',
 'Andrew Cuomo': 'p8 p38',
 'Ann-Asch': 'p34',
 'Anna\n': 'p58 p58',
 'Anne Carson': 'p84 p88',
 'Anthony Marx': 'p39',
 'Arianna Rosen

In [33]:
for name, pages in sorted(list_of_names.items()):
    print(name, "(" + pages + ")" )

/#&16)16'85)H76)+"*6)"')"*1)8&,8,1'%)+#$$)(,6)!,0')'4'()!$,1')6 (p24)
4)#-+(!-,!,>>+6!$,&!)#!+?*9&/'8+!,#!-.+!#+?-!:.)/+!,>!-.+!@ABCDE!*)5:)'%#!-,:)//!F)$,6!(+!G9)/',7/!:6+HI (p20)
4)%/'&"#0E,+$I'H#4'J/04)$I'0+&/$'#$6'+,"/0&'K&+B/'0/4#,/6',+'(#0$/0I'&+B/'+,"/0')&&G/&',++L'#$6'$+,'/$+G (p67)
4,-(&)"(#/0)&(4/+"5 (p32)
48"-5I
 (p32)
Adam Dickter (p33)
Adams (p13 p13 p13 p14 p14 p14 p14 p14 p15 p15 p15 p15 p15 p18 p18 p18 p18 p18 p19 p19 p19 p19 p20 p32 p32 p32 p32 p32 p32)
Alexandra (p59 p65)
Alexandra Jonathan Rosen (p65)
Alicia (p53)
Alicia glen"s (p58)
Alicia glen's (p58 p58)
Alison Baumann (p75 p75 p75 p78 p80 p80 p80)
Alison Baumann
 (p75 p75 p78 p78 p80 p81)
Alison Novak (p83 p87)
Amie Gross (p84)
Amy Boyle (p84 p88)
Andrew Brent (p58)
Andrew Cuomo (p8 p38)
Ann-Asch (p34)
Anna
 (p58 p58)
Anne Carson (p84 p88)
Anthony Marx (p39)
Arianna Rosenberg (p84 p88)
Arnie Gross (p88)
Ashley Thompson (p47)
B. PogrebinMichael (p35)
Barbara Gratz (p85)
Ben (p26 p26 p30 p30 p53)
Ben Furnas (p41 p5

Once you know a name is _there_ then you can search for it [in the original document](https://qz-aistudio-public.s3.amazonaws.com/workshops/2018.05.24_BerlinRosen_Responsive_Records_100pgs.pdf).