# <center>Other NLP Packages: spaCy, Gensim, and Stanza (Stanford NLP)</center>

References: 
- https://nlpforhackers.io/complete-guide-to-spacy/
- https://radimrehurek.com/gensim/models/phrases.html
- https://stanfordnlp.github.io/stanza/

## 1. spaCy
- spaCy is a relatively new framework in the Python Natural Language Processing, but is getting popular
- Provides models for Part Of Speech tagging, Named Entity Recognition and Dependency Parsing
<img src='https://spacy.io/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg' width = "70%">
- Supports 8 languages out of the box
- Provides easy and beautiful visualizations
- PProvides pretrained word vectors
- installation:
  1. `pip install spacy`
  2. `python -m spacy download en`

In [98]:
# Exercise 1.1. Load package and language library

import spacy
nlp = spacy.load('en')

# if you downloaded en_core_web_sm use the following:
#import en_core_web_sm 
#nlp = en_core_web_sm.load()

In [99]:
# Exercise 1.2. Get POS, lemmatization, and other NLP tasks all in one task

doc = nlp("Next week I'll be in Madrid.")
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}".format(
        token.text,         # original text
        token.lemma_,       # lemma
        token.is_punct,     # is it a punctuation ?
        token.is_space,     # is it a space
        token.pos_,         # The simple part-of-speech tag.
        token.tag_          # The detailed part-of-speech tag
    ))

Next	next	False	False	ADJ	JJ
week	week	False	False	NOUN	NN
I	-PRON-	False	False	PRON	PRP
'll	will	False	False	VERB	MD
be	be	False	False	VERB	VB
in	in	False	False	ADP	IN
Madrid	Madrid	False	False	PROPN	NNP
.	.	True	False	PUNCT	.


In [100]:
# Exercise 1.3. Segment by sentences

doc = nlp("These are apples. These are oranges.")
 
for sent in doc.sents:
    print(sent)

These are apples.
These are oranges.


In [102]:
# Exercise 1.4. Entity Recognition

doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ'")
for ent in doc.ents:
    print(ent.text, "\t", ent.label_)

2 	 CARDINAL
9 a.m. 	 TIME
30% 	 PERCENT
just 2 days 	 DATE
WSJ 	 ORG


In [103]:
# Exercise 1.5. Visulaize named entities

from spacy import displacy
 
doc = nlp('I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
displacy.render(doc, style='ent', jupyter=True)


In [104]:
# Exercise 1.6. Visualized dependency graph

from spacy import displacy
 
doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
 

## 2. Stanza

Stanza is a Python natural language analysis package. 
- Full neural network pipeline for robust text analytics, including 
    - Tokenization, multi-word token (MWT) expansion, 
    - Lemmatization, 
    - Part-of-speech (POS)
    - Dependency Parsing
    - Named Entity Recognition
    - Sentiment Analysis
- Pretrained neural models supporting 66 (human) languages
- A stable, officially maintained Python interface to Stanford CoreNLP.

<img src='https://stanfordnlp.github.io/stanza/assets/images/pipeline.png' width="50%" >

In [None]:
# Installation
#! pip install stanza
# stanza.download('en')

In [106]:
import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,ner,pos,lemma')
doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")

for sentence in doc.sentences:  # segment into sentences
    for word in sentence.words: # tokenize into words
        print("{0}\t{1}\t{2}".format(
            word.text,       # original text
            word.lemma,      # lemma
            word.upos        # universal part-of-speech tag.
        ))
    
    print("\n")
    print("Entities:")
    for ent in sentence.ents: # Get entities
        print("{0}\t{1}".format(
            ent.text,        # original text
            ent.type         # entity type
        ))

2020-10-14 22:39:07 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | ewt       |
| pos       | ewt       |
| lemma     | ewt       |
| ner       | ontonotes |

2020-10-14 22:39:07 INFO: Use device: cpu
2020-10-14 22:39:07 INFO: Loading: tokenize
2020-10-14 22:39:07 INFO: Loading: pos
2020-10-14 22:39:08 INFO: Loading: lemma
2020-10-14 22:39:08 INFO: Loading: ner
2020-10-14 22:39:08 INFO: Done loading processors!


I	I	PRON
just	just	ADV
bought	buy	VERB
2	2	NUM
shares	share	NOUN
at	at	ADP
9	9	NUM
a.m.	a.m.	NOUN
because	because	SCONJ
the	the	DET
stock	stock	NOUN
went	go	VERB
up	up	ADV
30	30	NUM
%	%	SYM
in	in	ADP
just	just	ADV
2	2	NUM
days	day	NOUN
according	accord	VERB
to	to	ADP
the	the	DET
WSJ	WSJ	PROPN


Entities:
2	CARDINAL
9 a.m.	TIME
30%	PERCENT
just 2 days	DATE
WSJ	ORG


## 3. gensim
- Gensim is an open source Python library for NLP, with a focus on topic modeling.
- It is not an everything-including-the-kitchen-sink NLP research library (like NLTK); instead, Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling, including 
  - Word2Vec word embedding 
  - Topic modeling
  - Text preprocessing like **phrase extraction**
  
- Gensim Phrase Model: 
    - `gensim.models.phrases.Phrases(sentences, min_count, threshold, max_vocab_size, delimiter, scoring, ...)`
        - `sentences`: list of sentences or iterables, each of which can be a document
        - `min_count`: Ignore all words and bigrams with total collected count lower than this value.
        - `threshold`: Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words $a$ followed by $b$ is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function.
        - `max_vocab_size`: Maximum size (number of tokens) of the vocabulary. 
        - `delimiter`: Glue character used to join collocation tokens, should be a byte string (e.g. '\_').
        - `scoring`: Specify how potential phrases are scored. 
           - `default` - original_scorer(), by Mikolov et al. (2013) (https://arxiv.org/pdf/1310.4546.pdf)
           - `npmi` - npmi_scorer().

In [107]:
# Read an online text file (Apple's annual disclosure)
import urllib
url = "https://www.sec.gov/Archives/edgar/data/320193/000091205700053623/a2032880z10-k.txt"

file = urllib.request.urlopen(url)
text = file.read().decode('utf-8')

In [108]:
# Exercise 2.1. Find bigrams using gensim

import nltk
from nltk.collocations import *

from gensim.models.phrases import Phrases, Phraser


# Tokenize the text into tokens
pattern=r'\w[\w\',-]*\w'                        
words=nltk.regexp_tokenize(text.lower(), pattern)

# Train phrase model to find phrases using original_scorer
phrases = Phrases([words], min_count=5, threshold=50)

# get unique set of phrases and sorted by score in descending order
items = sorted(set(phrases.export_phrases([words])), key=lambda item: -item[1])

# print top 50 phrases
for phrase, score in items[0:50]:
    print("{0}:\t{1:.2f}".format(phrase, score))

b'firmly committed':	802.83
b'legal proceedings':	714.61
b'fred anderson':	677.00
b'lawrence ellison':	635.21
b'property plant':	625.28
b'gareth chang':	625.28
b'probable but':	588.50
b'united states':	584.88
b'jerome york':	577.18
b'nasdaq national':	577.18
b'asia pacific':	574.42
b'valuation allowance':	559.69
b'6,134 5,941':	555.81
b'japanese yen':	476.40
b'g4 cube':	470.80
b'set forth':	461.75
b'arthur levinson':	416.85
b'millard drexler':	416.85
b'sufficient quantities':	389.79
b'public offering':	370.54
b'vice president':	367.51
b'pro forma':	357.30
b'senior vice':	337.81
b'accounts receivable':	336.29
b'in-process research':	333.48
b'jonathan rubinstein':	333.48
b'professionally oriented':	333.48
b'part ii':	332.79
b'mac os':	330.34
b'entered into':	329.37
b'steven jobs':	322.73
b'adversely affected':	317.60
b'gross margin':	303.17
b'william campbell':	303.17
b'agreement dated':	277.90
b'hereby incorporated':	268.44
b'obtain sufficient':	267.98
b'restructuring actions':	246.82
b

In [109]:
# Exercise 2.2. Find bigrams by NPMI

# find phrases using NPMI

phrases = Phrases([words], min_count=5, threshold=0.5, \
                  scoring='npmi')

# get unique set of phrases and sorted by score in descending order
items = sorted(set(phrases.export_phrases([words])), key=lambda item: -item[1])

# print top 20 phrases
for phrase, score in items[0:50]:
    print("{0}:\t{1:.2f}".format(phrase, score))

b'6,134 5,941':	1.00
b'firmly committed':	1.00
b'gilbert amelio':	1.00
b'british pound':	1.00
b'legal proceedings':	0.98
b'united states':	0.98
b'japanese yen':	0.98
b'matching contributions':	0.98
b'final assembly':	0.98
b'lawrence ellison':	0.97
b'arthur levinson':	0.97
b'millard drexler':	0.97
b'fred anderson':	0.96
b'mac os':	0.96
b'gareth chang':	0.95
b'pro forma':	0.95
b'vice president':	0.95
b'property plant':	0.94
b'nasdaq national':	0.94
b'jerome york':	0.94
b'jonathan rubinstein':	0.94
b'professionally oriented':	0.94
b'probable but':	0.94
b'asia pacific':	0.93
b'valuation allowance':	0.93
b'william campbell':	0.93
b'mitchell mandich':	0.92
b'set forth':	0.92
b'ronald johnson':	0.91
b'g4 cube':	0.91
b'public offering':	0.91
b'form 10-k':	0.90
b'accounts receivable':	0.90
b'sufficient quantities':	0.90
b'part ii':	0.90
b'601 309':	0.89
b'sets forth':	0.89
b'agreement dated':	0.89
b'senior vice':	0.89
b'fair value':	0.88
b'in-process research':	0.88
b'intellectual property':	0.

In [110]:
# Exercise 2.3. Tokenize by unigrams and bigrams

# Initialize phrase tokenizer
bigram = Phraser(phrases)

sent="Improved profitability was driven by the 30% increase in net sales, stable overall gross margins in 2000 as compared to 1999, and a relatively modest increase in operating expenses before special charges of 18%."
print(bigram[nltk.word_tokenize(sent.lower())])

['improved', 'profitability', 'was', 'driven', 'by', 'the', '30', '%', 'increase', 'in', 'net_sales', ',', 'stable', 'overall', 'gross_margins', 'in', '2000', 'as_compared', 'to', '1999', ',', 'and', 'a', 'relatively', 'modest', 'increase', 'in', 'operating_expenses', 'before', 'special_charges', 'of', '18', '%', '.']
