# Spacy Named Entity Recognition Experiments
This notebook is for practice with Spacy's named entity recognition.  The aim being to formulate the process for taking the text created from the PDF and recognizing the required features to extract and save in the Database.

## Imports

In [2]:
!python3 -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 4.5 MB/s eta 0:00:01
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.3.1-py3-none-any.whl size=12047106 sha256=6c1055fe53a15f4c07523fd42c0b3ebe3c1b5aac3001697681d49a4c762e3dd3
  Stored in directory: /private/var/folders/vg/jzyp1_n94pq55jdvv6mtktw40000gn/T/pip-ephem-wheel-cache-43mup_6f/wheels/b7/0d/f0/7ecae8427c515065d75410989e15e5785dd3975fe06e795cd9
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.3.1
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [4]:
# Imports for Doc Analysis
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
# nlp = spacy.load('en_core_web_sm')

# Imports for uploading PDF and converting it to text
from PIL import Image
import pytesseract
from pdf2image import convert_from_path, convert_from_bytes

ModuleNotFoundError: No module named 'spacy'

In [3]:
def get_text(path):
    """
    Takes the path to a PDF file, creates a PIL image, and then reads through the image
    convertin the images into text.
    
    INPUT: path
    
    RETURNS: Text Object
    """
    fulltext = ''
    
    # Read in the pdf to the PIL image
    pil_image = convert_from_path(path)
    
    # Iterate through PIL image and convert each page to text
    text = [str(pytesseract.image_to_string(image)) for image in pil_image]

    # Iterate through the raw text and format returns
    for t in text:
        fulltext += t
        
    return fulltext

In [26]:
spacy.explain("NORP")

'Nationalities or religious or political groups'

### Extracting basic entities
The function below will start to extract entities out of PDF files whose path is fed to it.  The goal for right now is to streamline the process from path to entities in order to make comparisons between different pdf entities to look for commonalities in trying to extract relevant features with a high degree of accuracy. 

In [9]:
def get_entities(pdf_path):
    """
    Function that takes in a path to a pdf file, uses the get_text() function to convert 
    the file to text, and then uses SpaCy to extract and return relevant entities.
    
    INPUT: path to pdf file
    
    RETURNS: list of people, orgs, and other entities as relevant for testing.
    """
    
    # Convert the pdf file to text, and then fit it to the NLP Model
    text = get_text(pdf_path)
    doc = nlp(text)
    
    # Setup lists to append entities to
    people = set()
    orgs = set()
    norps = set()
    gpes = set()
    
    # Iterate through the entities, compare for relevant entities, and append them to 
    # the relevant list
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            # print('Name: ', ent.text)  # Printing lines for pattern testing
            people.add(ent.text)
        elif ent.label_ == "ORG":
            # print('ORG: ', ent.text)
            orgs.add(ent.text)
        elif ent.label_ == "NORP":
            # print('NORP: ', ent.text)
            norps.add(ent.text)
        elif ent.label_ == "GPE":
            # print('GPE: ', ent.text)
            gpes.add(ent.text)
            
    # Return the entity lists
    print('Entity Extraction Complete')
    return people, orgs, norps, gpes
    

In [16]:
people, orgs, norps, gpes = get_entities('test.pdf')

Entity Extraction Complete


In [18]:
print(len(gpes))
print(gpes)
print(len(people))
print(people)
print(len(orgs))
print(orgs)
print(len(norps))
print(norps)

9
{'U.S.', 'Virginia', 'Mexico', 'the United States', 'Maryland', 'the United\n\nStates', 'Falls Church', 'Baltimore', 'MD 21201'}
14
{'joWOeIT MMM', 'File Nos', 'Charles K.\n', 'Sweeney', 'Maureen A. Sweeney', 'Ferino Sanchez Seltik', 'Jennifer Piateski', 'Lopez-Mendoza', 'DAVID W. CROSLAND', 'Donna Carr', 'Denna Cane', 'Jennifer BE', 'Maureen A.', 'John'}
31
{'Cite', 'DAVILA', 'I&N', 'Free State Reporting, Inc.', 'Leesburg Pike', 'an Appellate Court', 'Office of the Clerk', 'Contractor', 'Sony', 'Executive Office for Immigration Review\n\nBoard of Immigration Appeals', 'Government', 'DHS/ICE Office of Chief Counsel', 'OARD', 'Section C', 'U.S. Department of Justice Decision', 'Matter of Toro', 'FERINO', 'Executive Office for Immigration Review\n\n', 'the Board of Immigration Appeals', 'Immigration Judges de', 'The Board', 'the Supreme Court', 'the Executive Office', 'Board', 'U.S. Department of Justice', 'the Immigration Court', 'United States Immigration', 'the\n\nImmigration and Na

In [19]:
people, orgs, norps, gpes = get_entities('appeal_test.pdf')

Entity Extraction Complete


In [20]:
print(len(gpes))
print(gpes)
print(len(people))
print(people)
print(len(orgs))
print(orgs)
print(len(norps))
print(norps)

10
{'TX', 'Houston', 'Northpoint Drive', 'Virginia', 'the United States', 'Guatemala', 'remand', 'Falls Church', 'Copan', 'Honduras'}
18
{'Jose', 'Cynthia L. Crosby', 'Sheridan Green Law PLLC', 'Keily Janeth', 'Esquire\n\n', 'ANALYSIS', 'Gonzales', 'AXXX XXX 957', 'Chen', 'Sheridan G. Green', 'Tab D', 'Nimmo Bhagat', 'Sheridan Gary DHSI/ICE Office', 'QJ', 'Gavino Pineda', 'Maura Suyapa Varela-Erazo', 'Tortus', 'Honduras'}
28
{'Office of the Clerk EJ', 'I&N', 'the U.S. Department of State', 'The Fifth Circuit', 'The Department of Homeland Security', 'Executive Office for Immigration Review', 'CREDIBILITY\n\n', 'Bi Department', 'Whenthe-Court', 'AXXX XXX 957', 'Homeland Security', 'United', 'U.S. Department of Justice Decision', 'Department', 'INA Section', 'DHS', 'Board of Immigration Appeals', 'the Board of Immigration Appeals', 'the Country Report', 'GE', 'the United Nations', 'U.S. Department of Justice\n\nExecutive Office for Immigration Review\n\n \n\n', 'Board', '|-589', 'Torture'

In [21]:
people, orgs, norps, gpes = get_entities('test1.pdf')

Entity Extraction Complete


In [22]:
print(len(gpes))
print(gpes)
print(len(people))
print(people)
print(len(orgs))
print(orgs)
print(len(norps))
print(norps)

6
{'MA', 'Boston', 'Virginia', 'Florida', 'Falls Church', 'Denna'}
9
{'LAUDELINO', 'Mark D.', 'P.O. Box 8728', 'Miller', 'Neil P.\n\n', 'Donna Carr', 'Gwendylan Tregerman', 'Mark D. Cooper', 'Esq'}
18
{'Cite', 'Board', 'U.S. Department of Justice Decision', 'U.S. Department of Justice', 'JOAO', 'Executive Office for Immigration Review\n\nBoard of Immigration Appeals', 'SNjay', 'DHS', 'Office of the Clerk\n\n \n\nCooper', 'The Department of Homeland Security', 'Joao Silva Laudelino', 'DHS/ICE Office of Chief Counsel', 'Falls Church', 'the Board of Immigration Appeals', 'Leesburg Pike', 'Enclosure\n\n', 'JOAO SILVA LAUDELINO', 'Executive Office for Immigration Review'}
0
set()


### Refining the entities to extract specific information
We now have a straightforward process for extracting out lists of entities from a path to a pdf file.  However, we have the issue of getting many entities that are similar in nature, and we need to determine heuristics for determining which specific entities match for feature extraction. (e.g. We can pull out several names, but need a workable heuristic to reliably determine who is who)