# Spacy Named Entity Recognition Experiments
This notebook is for practice with Spacy's named entity recognition.  The aim being to formulate the process for taking the text created from the PDF and recognizing the required features to extract and save in the Database.

## Imports

In [34]:
!python3 -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [35]:
# Imports for Doc Analysis
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
# nlp = spacy.load('en_core_web_sm')

# Imports for uploading PDF and converting it to text
from PIL import Image
import pytesseract
from pdf2image import convert_from_path, convert_from_bytes

In [51]:
def get_text(path):
    """
    Takes the path to a PDF file, creates a PIL image, and then reads through the image
    convertin the images into text.
    
    INPUT: path
    
    RETURNS: Text Object
    """
    fulltext = ''
    
    # Read in the pdf to the PIL image
    pil_image = convert_from_path(path)
    
    # Iterate through PIL image and convert each page to text
    text = [str(pytesseract.image_to_string(image)) for image in pil_image]

    # Iterate through the raw text and format returns
    for t in text:
        fulltext += t
        
    return fulltext

In [26]:
spacy.explain("NORP")

'Nationalities or religious or political groups'

### Extracting basic entities
The function below will start to extract entities out of PDF files whose path is fed to it.  The goal for right now is to streamline the process from path to entities in order to make comparisons between different pdf entities to look for commonalities in trying to extract relevant features with a high degree of accuracy. 

In [52]:
def get_entities(pdf_path):
    """
    Function that takes in a path to a pdf file, uses the get_text() function to convert 
    the file to text, and then uses SpaCy to extract and return relevant entities.
    
    INPUT: path to pdf file
    
    RETURNS: list of people, orgs, and other entities as relevant for testing.
    """
    
    # Convert the pdf file to text, and then fit it to the NLP Model
    text = get_text(pdf_path)
    doc = nlp(text)
    
    # Setup lists to append entities to
    people = set()
    orgs = set()
    norps = set()
    gpes = set()
    
    # Iterate through the entities, compare for relevant entities, and append them to 
    # the relevant list
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            # print('Name: ', ent.text)  # Printing lines for pattern testing
            people.add(ent.text)
        elif ent.label_ == "ORG":
            # print('ORG: ', ent.text)
            orgs.add(ent.text)
        elif ent.label_ == "NORP":
            # print('NORP: ', ent.text)
            norps.add(ent.text)
        elif ent.label_ == "GPE":
            # print('GPE: ', ent.text)
            gpes.add(ent.text)
            
    # Return the entity lists
    print('done')
    return people, orgs, norps, gpes
    

In [53]:
people, orgs, norps, gpes = get_entities('test.pdf')

done


In [54]:
print(gpes)
print(people)
print(orgs)
print(norps)

{'Virginia', 'U.S.', 'the United\n\nStates', 'Falls Church', 'MD 21201', 'Maryland', 'the United States', 'Mexico', 'Baltimore'}
{'Lopez-Mendoza', 'Jennifer Piateski', 'File Nos', 'Jennifer BE', 'Denna Cane', 'joWOeIT MMM', 'Donna Carr', 'Charles K.\n', 'John', 'Sweeney', 'DAVID W. CROSLAND', 'Maureen A.', 'Maureen A. Sweeney', 'Ferino Sanchez Seltik'}
{'Court', 'Cite', 'Office of the Clerk', 'U.S. Department of Justice', 'Sony', 'Matter of Toro', 'the Executive Office', 'Executive Office for Immigration Review\n\nBoard of Immigration Appeals', 'Section C', 'the Executive Office for Immigration Review', 'an Appellate Court', 'Government', 'Free State Reporting, Inc.', 'the Immigration Court', 'United States Immigration', 'Leesburg Pike', 'the\n\nImmigration and Nationality Act', 'the Supreme Court', 'FERINO', 'Executive Office for Immigration Review\n\n', 'Board', 'The Board', 'DHS/ICE Office of Chief Counsel', 'I&N', 'DAVILA', 'Contractor', 'OARD', 'Esquire\n\nUniversity of Maryland I

In [55]:
people, orgs, norps, gpes = get_entities('appeal_test.pdf')

done


In [56]:
print(gpes)
print(people)
print(orgs)
print(norps)

{'Virginia', 'Northpoint Drive', 'Copan', 'Falls Church', 'Honduras', 'TX', 'the United States', 'remand', 'Guatemala', 'Houston'}
{'Sheridan Gary DHSI/ICE Office', 'Sheridan Green Law PLLC', 'Esquire\n\n', 'AXXX XXX 957', 'Honduras', 'Nimmo Bhagat', 'Chen', 'QJ', 'Gavino Pineda', 'Cynthia L. Crosby', 'Sheridan G. Green', 'Tab D', 'ANALYSIS', 'Tortus', 'Maura Suyapa Varela-Erazo', 'Gonzales', 'Jose', 'Keily Janeth'}
{'XZ', 'The Fifth Circuit', 'Court', 'the Country Report', 'the U.S. Department of State', '|-589', 'Board of Immigration Appeals', 'Whenthe-Court', 'IJ', 'Torture', 'Department', 'Office of the Clerk EJ', 'AXXX XXX 957', 'INA Section', 'Executive Office for Immigration Review', 'Board', 'DHS', 'U.S. Department of Justice\n\nExecutive Office for Immigration Review\n\n \n\n', 'I&N', 'the Board of Immigration Appeals', 'GE', 'Bi Department', 'CREDIBILITY\n\n', 'U.S. Department of Justice Decision', 'The Department of Homeland Security', 'United', 'Homeland Security', 'the Uni

In [57]:
people, orgs, norps, gpes = get_entities('test1.pdf')

done


In [58]:
print(gpes)
print(people)
print(orgs)
print(norps)

{'Virginia', 'MA', 'Falls Church', 'Denna', 'Florida', 'Boston'}
{'P.O. Box 8728', 'LAUDELINO', 'Gwendylan Tregerman', 'Miller', 'Mark D. Cooper', 'Donna Carr', 'Esq', 'Mark D.', 'Neil P.\n\n'}
{'DHS', 'DHS/ICE Office of Chief Counsel', 'JOAO SILVA LAUDELINO', 'JOAO', 'Cite', 'Leesburg Pike', 'Executive Office for Immigration Review\n\nBoard of Immigration Appeals', 'Enclosure\n\n', 'the Board of Immigration Appeals', 'Falls Church', 'SNjay', 'Office of the Clerk\n\n \n\nCooper', 'U.S. Department of Justice', 'U.S. Department of Justice Decision', 'Executive Office for Immigration Review', 'The Department of Homeland Security', 'Board', 'Joao Silva Laudelino'}
set()


### Refining the entities to extract specific information
We now have a straightforward process for extracting out lists of entities from a path to a pdf file.  However, we have the issue of getting many entities that are similar in nature, and we need to determine heuristics for determining which specific entities match for feature extraction. (e.g. We can pull out several names, but need a workable heuristic to reliably determine who is who)