# AI - Natural Language Processing
## Part 2 - Functionalize NLP for entities


## Step 1. Install Spacy

If this first time ever using spacy on this computer, you must first do either the ```!conda install``` or ```!pip install```:

### TURN OFF FOR COLAB
Run for ANACONDA

In [1]:
conda install -c conda-forge spacy

done
Solving environment: done


  current version: 23.1.0
  latest version: 23.10.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.10.0



# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


#### Which language model is best for you?
<a href="https://spacy.io/usage/models">https://spacy.io/usage/models</a>

In [2]:
## import libary.
import pandas as pd
import spacy
import glob

## Step 2. Install language model


### ANACONDA ONLY

In [3]:
conda install -c conda-forge spacy-model-en_core_web_sm

done
Solving environment: done


  current version: 23.1.0
  latest version: 23.10.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.10.0



# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [4]:
## import that language model
import en_core_web_sm

### Place English libary into a ```nlp``` pipeline

In [5]:
## build nlp pipeline (a function will tokenize, parse and ner for us)
nlp = en_core_web_sm.load()

In [6]:
## what type of object is nlp
type(nlp)

spacy.lang.en.English

In [7]:
### Sample English text:
text = u'''\
On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, \
creator of the VoIP service Skype, for $8.5 billion. \
Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. \
Sandeep Junnarkar got this from Wikipedia. \
But he'd rather head to Paris, France to see the Mona Lisa at the Louvre. \
The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." \
Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.
'''

text

'On May 10, 2011, Microsoft announced its acquisition of\xa0Skype Technologies, creator of the\xa0VoIP\xa0service\xa0Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he\'d rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.\n'

In [8]:
## let's run the nlp function and create a spacy doc
nlp(text)

On May 10, 2011, Microsoft announced its acquisition of Skype Technologies, creator of the VoIP service Skype, for $8.5 billion. Microsoft is headquartered near Seattle Washington while Skype remains in Palo Alto, California. Sandeep Junnarkar got this from Wikipedia. But he'd rather head to Paris, France to see the Mona Lisa at the Louvre. The Hudson River should really be called by its original native name, Mahicantuck, which means "the river that flows two ways." Mahicantuck flows for 315 miles to the Atlantic Ocean from its source at Mt. Mercy, the tallest peak in New York state.

## Import hearings
Download <a href="https://drive.google.com/file/d/1EUYLeHpHAAW2MGsrT6_jov9cJ-IuDLg-/view?usp=sharing">this senate hearing</a> and turn it into a spacy doc.

In [21]:
## pull hearing into notebook
judges = glob.glob("*Judicial*")
judges

['Judicial Assignments & Locations _ NYCOURTS.GOV.pdf']

### FUNCTION TO TOKENIZE

In [25]:
## create function to read a globbed list
## Define function
def tokenize_file(file):        
## Open the file
    with open(judges[0], "r") as text:
## turn io object into string 
        all_text = text.read()
## Run through the pipline to tokenize ## Return 
        return nlp(all_text)
        
      

In [26]:
tokenize_file(judges)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 10: invalid continuation byte

In [None]:
## save senate hearing as a nlp doc
doc = tokenize_file(judges)
doc

In [None]:
## write function to tokenize


In [None]:
## tockenize hearing


### Named Entity Recognition (NER)

#### Spacy easily returns the words that matter to us like names of companies, people, places, art works, numbers, etc.

- ```.ents``` ------------> Finds all entities in doc spacy object.

- ```ent.text``` ------------> The actual text.

- ```ent.label``` ------------> A numeric code for the entity.

- ```ent.label_``` ------------> The word's entity category.

- ```spacy.explain(ent.label_)``` ---------> A description of the category.




In [None]:
## Try this functions in the first 10 tokenized elements
for word in doc.ents[:10]:
    print(word.text, word.label_, spacy.explain(word.label_))

## Specialized function to capture entity types

In [None]:
## create function to return list of dictionaries of entities and entity labels
## function to find entities
## define function
def cat_entities(doc):
    '''
    Create function to return list of
    dictionaries of entities and entity labels
    '''
    ent_lit = []
    if doc.ents:
        for word in doc.ents:
            temp_dict = {"word": word.text,
                         "label": word.label_,
                         "meaning": spacy.explain(word.label_)}
            ent_list.append(temp_dict)
    else:
        print(f"There are no entities in this list")
        
    return pd.DataFrame(ent_list)

In [None]:
## test it to find orgs
df = cat_entities(doc)
df

In [None]:
## search for people only
df.query("label == 'PERSON'")

In [None]:
## search for orgs only
df.query("label == 'ORG'")

In [None]:
df.nunique()