# ADAPT Pro - Topic 3 - Automation, Visualization and Best Practices

# NLP Demos

**Useful links**
- stack post on using tabula package: https://jpmc.stackenterprise.co/questions/56013
- tabula package documentaiton: https://tabula-py.readthedocs.io/en/latest/

In [3]:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup

#NLP Packages
import spacy
import en_core_web_sm

## Grab some text


In [4]:
proxies = { 'http' : 'http://proxy.jpmchase.net:8443/',
           'https' : 'http://proxy.jpmchase.net:8443/' }

In [5]:
url = 'https://www.equedia.com/comp-disclosure/'
response = requests.get(url, proxies=proxies)

In [7]:
soup = BeautifulSoup(response.text, 'html5lib')
paragraphs = [p.text for p in soup.find_all('p')]

In [13]:
paragraphs[23]

## NLP Processing
- the nlp() function from SpaCy will recognize any entities in the paragraphs
- for a list of entity types see SpaCy documentation

In the example below, SpaCy found the following entities:
- Cardinal - words related to numerical values
- Ordinal - numerical words related to order (e.g. first, second, etc.)
- Date - absolute or relative dates (e.g. March 1, or second quarter)
- GPE - countries, cities, etc.
- Money - words related to monetary values
- Org - company names

In [14]:
nlp = en_core_web_sm.load()

In [17]:
nlpData = nlp(paragraphs[23])
labels = [x.label_ for x in nlpData.ents]

In [18]:
Counter(labels)

In [20]:
ent_type = 'DATE'
items = [x.text for x in nlpData.ents if x.label_ == ent_type]
items

In [22]:
ent_type = 'MONEY'
items = [x.text for x in nlpData.ents if x.label_ == ent_type]
items

In [23]:
ent_type = 'ORG'
items = [x.text for x in nlpData.ents if x.label_ == ent_type]
items

In [24]:
#Code modified to print out all results by entity
keysList = Counter(labels).keys()
results = {}
for key in keysList:
    results[key] = [x.text for x in nlpData.ents if x.label_ == key]
results

## Extracting parts of sentence
Another way to categorize words using NLP is by part of sentence, e.g. nouns or verbs.

In [26]:
print("Noun phrases:", [chunk.text for chunk in nlpData.noun_chunks])

In [27]:
print("Verbs:", [token.lemma_ for token in nlpData if token.pos_ == "VERB"])