# Spacy.io NLP stuff

[Spacy](https://spacy.io/) is "Industrial-Strength Natural Language Processing" (NLP)

```bash
pip install spacy
python -m spacy download en # downloads English NLP model info
```

There are other, non-English [language models](https://spacy.io/usage/models).

Let's load the Tesla IPO again:

In [2]:
! curl https://www.sec.gov/Archives/edgar/data/1318605/000119312510017054/ds1.htm > /tmp/TeslaIPO.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2306k    0 2306k    0     0  4669k      0 --:--:-- --:--:-- --:--:-- 4659k


In [3]:
import sys
from bs4 import BeautifulSoup

def html2text(html_text):
    soup = BeautifulSoup(html_text, 'html.parser')
    text = soup.get_text()
    return text

with open("/tmp/TeslaIPO.html", "r") as f:
    html_text = f.read()
tsla = html2text(html_text)
print(tsla[0:100].split())

['S-1', '1', 'ds1.htm', 'REGISTRATION', 'STATEMENT', 'ON', 'FORM', 'S-1', 'Registration', 'Statement', 'on', 'Form', 'S-1', 'Table', 'of', 'Co']


## Tokenizing with Spacy

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [32]:
doc = nlp(tsla[0:5000])
for token in doc[:30]:
    if len(str(token).strip())>0:
        print(token.text.strip())

S-1
1
ds1.htm
REGISTRATION
STATEMENT
ON
FORM
S-1
Registration
Statement
on
Form
S-1
Table
of
Contents
As
filed
with
the
Securities
and
Exchange


## Parts of speech

In [6]:
import pandas as pd
winfo = []
for token in doc[:20]:
    winfo.append([token.text, token.pos_, token.is_stop])
    
pd.DataFrame(data=winfo, columns=['word','part of speech', 'stop word'])

Unnamed: 0,word,part of speech,stop word
0,\n,SPACE,False
1,S-1,NOUN,False
2,\n,SPACE,False
3,1,NUM,False
4,\n,SPACE,False
5,ds1.htm,NOUN,False
6,\n,SPACE,False
7,REGISTRATION,NOUN,False
8,STATEMENT,NOUN,False
9,ON,PROPN,True


In [18]:
winfo = []
for ent in doc.ents[:30]:
    winfo.append([ent.text, ent.label_])
pd.DataFrame(data=winfo, columns=['word', 'label'])

Unnamed: 0,word,label
0,1,CARDINAL
1,the Securities and Exchange Commission,ORG
2,"January 29, 2010",DATE
3,UNITED STATES,ORG
4,SECURITIES AND,ORG
5,EXCHANGE,ORG
6,Washington,GPE
7,D.C.,GPE
8,FORM,ORG
9,STATEMENT,PERSON


## Visualizing entities in notebook

In [8]:
from spacy import displacy
displacy.render(doc[:100], style='ent')

## Splitting into sentences

In [24]:
winfo = []
for s in doc.sents:
    winfo.append([s.text])
pd.DataFrame(data=winfo, columns=['sentence'])

Unnamed: 0,sentence
0,\nS-1\n1\nds1.htm\nREGISTRATION STATEMENT ON F...
1,Registration Statement on Form S-1\n\n\nTable ...
2,As filed with the Securities and Exchange Comm...
3,333- \n UNITED STATES
4,SECURITIES AND EXCHANGE COMMISSION Washington...
5,FORM
6,S-1 \n REGISTRATION STATEMENT
7,UNDER
8,"THE SECURITIES ACT OF 1933 Tesla Motors, ..."
9,(Exact name of Registrant as\nspecified in its...


## Exercise

Extract any word in the TSLA doc that is a number per Spacy. See [Spacy 101](https://spacy.io/usage/spacy-101). Your output should look like (assuming you used `doc = nlp(tsla[0:5000])`):

```
[1, 29, 2010, 20549, 1933, 3711, 91, 2197729, 3500, 94304, 650, 413, 4000, 3500, 94304, 650, 413, 4000, 650, 94304, 650, 493, 9300, 2550, 94304, 650, 251, 5000, 415, one, 0.001, 100,000,000, 7,130, 1, 457, 1933, 2, 1933, 29, 2010]
```

See [solution](https://github.com/parrt/msds692/tree/master/notes/code/spacy) if you get stuck.