# Exploring Named Entities in documents using Spacy

As a brief look into various ways we might slice up and investigate a large group of documents, here we will explore some possibilities for filtering and sifting using named entities and document clusters.

## Imports

First, let's just import all the things we will need. Specifically, we will be using [Spacy](https://spacy.io/) for NLP, and [Scikit-learn](https://scikit-learn.org) for K-means clustering.

In [12]:
from collections import Counter
import os
import spacy
from spacy import displacy

## Load up the document set

We will use a subset of the documents Spacy's NLP processing is a bit heavy to run all all the documents during earlyexploration. Before moving on, use the `tools.subset` tool or other means to create a data subset specified by document titles or other metadata.

Here, we'll define a generator to yield all the documents of the subset so we can avoid too much memory overhead.

In [13]:
DATADIR = '../data/DocumentCloud/subset'

def documents(datadir=DATADIR):
    for fn in os.listdir(datadir):
        yield open(os.path.join(datadir, fn)).read()

In [14]:
def find_docs(substring, limit=None):
    """A crude document search utility"""
    count = 0
    for doc in documents():
        if substring.lower() in doc.lower():
            count += 1
            yield doc
            if limit is not None and count >= limit:
                break

## Spacy for NLP

Spacy is a pretty great NLP library that does decent [Named Entity Recognition (NER)](https://spacy.io/usage/linguistic-features#named-entities) out of the box, among [other standard NLP things](https://spacy.io/usage/linguistic-features), and has some pretty slick [builtin visualizations](https://spacy.io/usage/visualizers) as well. You will need to be sure to have loaded a spacy language model into your local environment. E.g.:

```
 $ python -m spacy download en
```

https://spacy.io/usage/models#quickstart

In [15]:
nlp = spacy.load('en')

## Explore the entities

Spacy docs for more info: https://spacy.io/usage/linguistic-features#named-entities-101

First, let's just count up all the entities and take a look at the top terms that are not just numbers.

In [18]:
# This could get heavy if you have a large set of documents. You may want to work with a specific subset

counter = Counter()

for doc in documents():
    doc = nlp(doc)
    counter.update([e.text.replace('\n', ' ') for e in doc.ents])

In [19]:
# print the top entities and their counts
'|'.join([f'{t} ({n})' for t, n in counter.most_common(100) if not t.isdigit()])

'CTA (465)|Chicago (392)|Illinois (264)|Committee (253)|Peterson (240)|Chicago Transit Board (237)|\uf0b7 (234)|Silva (167)|the Chicago Transit Authority (153)|GREGORY P. LONGHINI (135)|Irvine (126)|CHICAGO TRANSIT AUTHORITY  (124)|West Lake Street (123)|Fuller (122)|one (115)|five (105)|Second Floor (90)|six (86)|Boardroom (84)|Board (83)|Davis (83)|Meeting Notices, Agendas (82)|THE TRANSIT BOARD (82)|Minutes (81)|ADA (80)|Authority (78)|Terry Peterson (78)|Miller (71)|Alejandro Silva (68)|BUDGET (67)|AUDIT (65)|minutes (63)|Finance (61)|Transit Board Meetings (60)|COMMITTEE ON FINANCE (60)|two (58)|the Chicago Transit Board (54)|Facilitator Serpe (53)|Budget (52)|Amy Serpe (52)|Contract Number (50)|Kevin Irvine (49)|NOTICE OF COMMITTEE (47)|D. (47)|Alva Rosales (46)|ADA Advisory Committee (44)|Bowen (44)|Youngblood (41)|up to 36 months (41)|567 W. Lake Street (39)|the City of Chicago (38)|Employee Retirement Review Committee (38)|THE EMPLOYEE RETIREMENT REVIEW COMMITTEE (37)|Johnny M

## Take a look at some documents

Given the possibility that you've found some interesting entities, you might try looking at some relevant documents.

In [24]:
search_string = 'ADA Compliance'
docs = [nlp(doc) for doc in find_docs(search_string)]
len(docs)

51

In [25]:
doc = docs[0]
displacy.render(doc, jupyter=True, style='ent')

To more easily peruse a large number of entitiy-annotated documents, you could write them out to the filesystem.

In [26]:
entities_dir = '../data/entities'

for i, doc in enumerate(docs):
    html = displacy.render(doc, jupyter=False, style='ent')
    with open(os.path.join(entities_dir, f'{i}.html'), 'w') as outfile:
        outfile.write(html)

You can easily navigate the resulting pages by hosting them with Python's http server:

```
python -m http.server
```

From within the output directory.