# Spacy.io NLP stuff

[Spacy](https://spacy.io/) is "Industrial-Strength Natural Language Processing" (NLP)

```bash
pip install spacy
python -m spacy download en # downloads English NLP model info
```

There are other, non-English [language models](https://spacy.io/usage/models).

Let's load the Tesla IPO again:

In [17]:
! curl https://www.sec.gov/Archives/edgar/data/1318605/000119312510017054/ds1.htm > /tmp/TeslaIPO.html




In [18]:
import sys
from bs4 import BeautifulSoup

def html2text(html_text):
    soup = BeautifulSoup(html_text, 'html.parser')
    text = soup.get_text()
    return text

with open("/tmp/TeslaIPO.html", "r") as f:
    html_text = f.read()
tsla = html2text(html_text)
print(tsla[0:100].split())

['S-1', '1', 'ds1.htm', 'REGISTRATION', 'STATEMENT', 'ON', 'FORM', 'S-1', 'Registration', 'Statement', 'on', 'Form', 'S-1', 'Table', 'of', 'Co']


## Tokenizing with Spacy

In [20]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [21]:
doc = nlp(tsla[0:5000])
type(doc)

spacy.tokens.doc.Doc

In [10]:
for token in doc[:30]:
    if len(str(token).strip())>0:
        print(token.text.strip())

S-1
1
ds1.htm
REGISTRATION
STATEMENT
ON
FORM
S-1
Registration
Statement
on
Form
S-1
Table
of
Contents
As
filed
with
the
Securities
and
Exchange


## Parts of speech

In [11]:
import pandas as pd
winfo = []
for token in doc[100:120]:
    winfo.append([token.text, token.pos_, token.is_stop])
    
pd.DataFrame(data=winfo, columns=['word','part of speech', 'stop word'])

Unnamed: 0,word,part of speech,stop word
0,jurisdiction,NOUN,False
1,of,ADP,True
2,incorporation,NOUN,False
3,or,CCONJ,True
4,organization,NOUN,False
5,),PUNCT,False
6,\n \n,SPACE,False
7,(,PUNCT,False
8,Primary,PROPN,False
9,Standard,PROPN,False


In [14]:
winfo = []
for ent in doc.ents[:20]:
    winfo.append([ent.text, ent.label_])
pd.DataFrame(data=winfo, columns=['word', 'label'])

Unnamed: 0,word,label
0,1,CARDINAL
1,the Securities and Exchange Commission,ORG
2,"January 29, 2010",DATE
3,UNITED STATES,ORG
4,SECURITIES AND,ORG
5,EXCHANGE,ORG
6,Washington,GPE
7,D.C.,GPE
8,FORM,ORG
9,STATEMENT,PERSON


**Word vectors**

In [16]:
winfo = []
for t in doc[100:110]:
    winfo.append([t.text, t.vector])
pd.DataFrame(data=winfo, columns=['word', 'vector'])

Unnamed: 0,word,vector
0,jurisdiction,"[3.4729223, -1.4104788, -1.2004784, -0.5751643..."
1,of,"[-0.5583954, -1.4372051, -5.764125, -3.389831,..."
2,incorporation,"[6.166033, -1.3807518, -0.19106722, -1.9463961..."
3,or,"[2.0970569, 1.9565225, -4.5255666, -0.60888207..."
4,organization,"[7.056058, 0.38515526, 1.7288362, 0.21033919, ..."
5,),"[2.3581414, 1.1409386, 1.0774156, -0.045503706..."
6,\n \n,"[0.3358876, -0.98331153, -4.240477, 0.67844033..."
7,(,"[2.8860693, 0.09956763, -2.7654376, 2.4729605,..."
8,Primary,"[-1.6166165, -3.735238, -2.8839834, -2.1430888..."
9,Standard,"[0.5131076, -4.1868796, -3.940477, -2.1611662,..."


## Visualizing entities in notebook

In [39]:
from spacy import displacy
displacy.render(doc[100:180], style='ent')

## Splitting into sentences

In [24]:
winfo = []
for s in doc.sents:
    winfo.append([s.text])
pd.DataFrame(data=winfo, columns=['sentence'])

Unnamed: 0,sentence
0,\nS-1\n1\nds1.htm\nREGISTRATION STATEMENT ON F...
1,Registration Statement on Form S-1\n\n\nTable ...
2,As filed with the Securities and Exchange Comm...
3,333- \n UNITED STATES
4,SECURITIES AND EXCHANGE COMMISSION Washington...
5,FORM
6,S-1 \n REGISTRATION STATEMENT
7,UNDER
8,"THE SECURITIES ACT OF 1933 Tesla Motors, ..."
9,(Exact name of Registrant as\nspecified in its...


## Exercise

Extract any word in the TSLA doc that is a number per Spacy. See [Spacy 101](https://spacy.io/usage/spacy-101). Your output should look like (assuming you used `doc = nlp(tsla[0:5000])`):

```
[1, 29, 2010, 20549, 1933, 3711, 91, 2197729, 3500, 94304, 650, 413, 4000, 3500, 94304, 650, 413, 4000, 650, 94304, 650, 493, 9300, 2550, 94304, 650, 251, 5000, 415, one, 0.001, 100,000,000, 7,130, 1, 457, 1933, 2, 1933, 29, 2010]
```

See [solution](https://github.com/parrt/msds692/tree/master/notes/code/spacy) if you get stuck.