# <center>Other NLP Packages: spaCy, Gensim, and Textacy</center>

References: 
- https://nlpforhackers.io/complete-guide-to-spacy/
- https://radimrehurek.com/gensim/models/phrases.html
- https://stanfordnlp.github.io/stanza/

## 1. spaCy
- spaCy is a relatively new framework in the Python Natural Language Processing, but is getting popular
- Provides models for Part Of Speech tagging, Named Entity Recognition and Dependency Parsing
<img src='https://spacy.io/images/pipeline.svg' width = "60%">
- Supports 8 languages out of the box
- Provides easy and beautiful visualizations
- PProvides pretrained word vectors
- installation:
  1. `pip install spacy`
  2. `python -m spacy download en` or `python -m spacy download en_core_web_sm`

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
pip install spacy


Note: you may need to restart the kernel to use updated packages.


In [4]:
# Installation
pip install spacy

SyntaxError: invalid syntax (1106444910.py, line 2)

In [None]:
# Exercise 1.1. Load package and language library

import spacy
nlp = spacy.load('en_core_web_sm')

# if you downloaded en_core_web_sm use the following:
#import en_core_web_sm 
#nlp = en_core_web_sm.load()

In [1]:
# Exercise 1.2. Get POS, lemmatization, and other NLP tasks all in one task

doc = nlp("Next week I'll be in Madrid.")
print(f"text\tlemma\tpunct?\tspace?\tstopword?\tpos\ttag")
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t\t{5}\t{6}".format(
        token.text,         # original text
        token.lemma_,       # lemma
        token.is_punct,     # is it a punctuation ?
        token.is_space,     # is it a space
        token.is_stop,
        token.pos_,         # The simple part-of-speech tag.
        token.tag_          # The detailed part-of-speech tag
    ))

NameError: name 'nlp' is not defined

Segment documents in sentences. Note here spacy has a hierarchical structure:
- Top: nlp object
- Level 1: Sentences
- Level 2: Tokens of each sentence
- Level 3: A token has a variety of properties, e.g., tag, pos, is_punct.


In [11]:
# Exercise 1.3. Segment by sentences. 
# Note doc.sents contains all sentence objects

doc = nlp("These are apples. These are oranges.")
 
for i, sent in enumerate(doc.sents):
    print(i, sent)
    print(f"text\tlemma\tpunct?\tspace?\tstopword?\tpos\ttag")
    for token in sent:
        print("{0}\t{1}\t{2}\t{3}\t{4}\t\t{5}\t{6}".format(
        token.text,         # original text
        token.lemma_,       # lemma
        token.is_punct,     # is it a punctuation ?
        token.is_space,     # is it a space
        token.is_stop,
        token.pos_,         # The simple part-of-speech tag.
        token.tag_          # The detailed part-of-speech tag
    ))
    print("\n")

0 These are apples.
text	lemma	punct?	space?	stopword?	pos	tag
These	these	False	False	True		PRON	DT
are	be	False	False	True		AUX	VBP
apples	apple	False	False	False		NOUN	NNS
.	.	True	False	False		PUNCT	.


1 These are oranges.
text	lemma	punct?	space?	stopword?	pos	tag
These	these	False	False	True		PRON	DT
are	be	False	False	True		AUX	VBP
oranges	orange	False	False	False		NOUN	NNS
.	.	True	False	False		PUNCT	.




In [12]:
# Exercise 1.4. Entity Recognition

doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ'")
for ent in doc.ents:
    print(ent.text, "\t\t", ent.label_)

2 		 CARDINAL
9 a.m. 		 TIME
30% 		 PERCENT
just 2 days 		 DATE
WSJ 		 ORG


In [13]:
# Exercise 1.5. Visulaize named entities

from spacy import displacy
 
doc = nlp('I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
print(doc)
displacy.render(doc, style='ent', jupyter=True)


In [14]:
# Exercise 1.6. Visualized dependency graph

from spacy import displacy
 
doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
 

## 2. Textacy

Textacy is a Python library for performing a variety of (NLP) tasks, built on the high-performance spaCy library. With the fundamentals — tokenization, part-of-speech tagging, dependency parsing, etc. 

For details, check https://textacy.readthedocs.io/en/latest/index.html

In [12]:
# Installation.
#! pip install textacy


from textacy import preprocessing

In [7]:
text = (
     "Since the so-called \"statistical revolution\" in the late 1980s and mid 1990s, "
     "much Natural Language Processing research has relied heavily on machine learning. "
     "Formerly, many language-processing tasks typically involved the direct hand coding "
     "of rules, which is not in general robust to natural language variation. "
     "The machine-learning paradigm calls instead for using statistical inference "
     "to automatically learn such rules through the analysis of large corpora "
     "of typical real-world examples."
 )

In [8]:
# remove punctuation

from textacy import preprocessing

text_remove_punct = preprocessing.remove.punctuation(text)
print(text_remove_punct)

# remove whitespace
preprocessing.normalize.whitespace(text_remove_punct)

Since the so called  statistical revolution  in the late 1980s and mid 1990s  much Natural Language Processing research has relied heavily on machine learning  Formerly  many language processing tasks typically involved the direct hand coding of rules  which is not in general robust to natural language variation  The machine learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real world examples 


'Since the so called statistical revolution in the late 1980s and mid 1990s much Natural Language Processing research has relied heavily on machine learning Formerly many language processing tasks typically involved the direct hand coding of rules which is not in general robust to natural language variation The machine learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real world examples'

In [9]:
# make spacy doc
import textacy

# load English language model
en = textacy.load_spacy_lang("en_core_web_sm")

doc = textacy.make_spacy_doc(text, lang=en)

In [10]:
# extract bigrams and trigrams
list(textacy.extract.ngrams(doc, n = (2,3), filter_stops=True, \
                            filter_punct=True, filter_nums=True))

[statistical revolution,
 late 1980s,
 mid 1990s,
 Natural Language,
 Language Processing,
 Processing research,
 relied heavily,
 machine learning,
 processing tasks,
 tasks typically,
 typically involved,
 direct hand,
 hand coding,
 general robust,
 natural language,
 language variation,
 learning paradigm,
 paradigm calls,
 calls instead,
 statistical inference,
 automatically learn,
 large corpora,
 typical real,
 world examples,
 1980s and mid,
 Natural Language Processing,
 Language Processing research,
 research has relied,
 heavily on machine,
 processing tasks typically,
 tasks typically involved,
 involved the direct,
 direct hand coding,
 coding of rules,
 robust to natural,
 natural language variation,
 learning paradigm calls,
 paradigm calls instead,
 inference to automatically,
 learn such rules,
 analysis of large,
 corpora of typical]

In [11]:
# Extract key terms

from textacy.extract import keyterms as kt
kt.textrank(doc, normalize="lemma", 
            include_pos =('NOUN', 'PROPN', 'ADJ'),
            window_size = 3,
            topn=10)

[('Natural Language Processing research', 0.05801429969492685),
 ('natural language variation', 0.047242782528362955),
 ('direct hand coding', 0.03817002256131215),
 ('statistical inference', 0.03359364905765383),
 ('machine learning', 0.03326750227139413),
 ('statistical revolution', 0.030629349149736886),
 ('late 1980', 0.026186881678224555),
 ('general robust', 0.025591268483622656),
 ('processing task', 0.025420643340329566),
 ('typical real', 0.025416316515447297)]

In [13]:
# extract key terms 

# If you get an error related to NetworkX, downgrade networkX version to 2.5

from textacy.extract import keyterms as kt
kt.textrank(doc, normalize="lemma", topn=10)

[('Natural Language Processing research', 0.059959246697826624),
 ('natural language variation', 0.04488350959275309),
 ('direct hand coding', 0.037736661821063354),
 ('statistical inference', 0.03432557996664981),
 ('statistical revolution', 0.034007535820683756),
 ('machine learning', 0.03305919655573349),
 ('late 1980', 0.026499549123496648),
 ('processing task', 0.0256684200517989),
 ('general robust', 0.024835834233545625),
 ('typical real', 0.02381966456140927)]

## 3. gensim
- Gensim is an open source Python library for NLP, with a focus on topic modeling.
- It is not an everything-including-the-kitchen-sink NLP research library (like NLTK); instead, Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling, including 
  - Word2Vec word embedding 
  - Topic modeling
  - Text preprocessing like **phrase extraction**
  
- Gensim Phrase Model: 
    - `gensim.models.phrases.Phrases(sentences, min_count, threshold, max_vocab_size, delimiter, scoring, ...)`
        - `sentences`: list of sentences or iterables, each of which can be a document
        - `min_count`: Ignore all words and bigrams with total collected count lower than this value.
        - `threshold`: Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words $a$ followed by $b$ is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function.
        - `max_vocab_size`: Maximum size (number of tokens) of the vocabulary. 
        - `delimiter`: Glue character used to join collocation tokens, should be a byte string (e.g. '\_').
        - `scoring`: Specify how potential phrases are scored. 
           - `default` - original_scorer(), by Mikolov et al. (2013) (https://arxiv.org/pdf/1310.4546.pdf)
           - `npmi` - npmi_scorer().

In [15]:
# Exercise 2.1. Find bigrams using gensim
import gensim
import nltk
from nltk.collocations import *

from gensim.models.phrases import Phrases, Phraser


words=nltk.corpus.inaugural.words()

# Train phrase model to find phrases using original_scorer
phrases = Phrases([words], min_count=2, threshold=50)

# get unique set of phrases and sorted by score in descending order
items = sorted(set(phrases.find_phrases([words]).items()), key=lambda item: -item[1])

# print top 50 phrases
for phrase, score in items[0:50]:
    print("{0}:\t{1:.2f}".format(phrase, score))

Santo_Domingo:	8549.11
specie_payments:	6411.83
Indian_tribes:	6411.83
Social_Security:	6411.83
Founding_Fathers:	6411.83
Abraham_Lincoln:	6411.83
illegal_liquor:	5129.47
merchant_marine:	4808.88
Western_Hemisphere:	4808.88
Supreme_Court:	4274.56
founding_documents:	4274.56
lock_type:	3847.10
Old_World:	3108.77
inland_frontiers:	2747.93
¡_Xand:	2564.73
coordinate_branches:	2355.37
Thomas_Jefferson:	2331.58
eighteenth_amendment:	2137.28
Chief_Magistrate:	1998.49
extra_session:	1972.87
Great_Britain:	1923.55
George_Washington:	1846.61
silent_prayer:	1810.40
faithfully_executed:	1803.33
entangling_alliances:	1748.68
fervent_supplications:	1709.82
nuclear_weapons:	1648.76
distinguished_guests:	1538.84
Civil_War:	1465.56
onward_march:	1424.85
Chief_Justice:	1373.96
plainly_written:	1373.96
middle_class:	1221.30
fifteenth_amendment:	1068.64
earliest_practicable:	961.78
preceding_term:	961.78
fertile_soil:	961.78
walk_humbly:	874.34
World_War:	814.20
tariff_bill:	744.60
regular_session:	739.8

In [16]:
# Exercise 2.2. Find bigrams by NPMI

# find phrases using NPMI

phrases = Phrases([words], min_count=2, threshold=0.5, \
                  scoring='npmi')

# get unique set of phrases and sorted by score in descending order
items = sorted(set(phrases.find_phrases([words]).items()), key=lambda item: -item[1])

# print top 20 phrases
for phrase, score in items[0:50]:
    print("{0}:\t{1:.2f}".format(phrase, score))

Porto_Rico:	1.00
Panama_Canal:	1.00
reverend_clergy:	1.00
Information_Age:	1.00
Santo_Domingo:	1.00
Rocky_Mountains:	1.00
Philippine_Islands:	1.00
Social_Security:	0.97
Founding_Fathers:	0.97
Indian_tribes:	0.97
'_s:	0.96
specie_payments:	0.96
Abraham_Lincoln:	0.96
illegal_liquor:	0.95
merchant_marine:	0.95
Majority_Leader:	0.94
electors_residing:	0.94
Western_Hemisphere:	0.94
founding_documents:	0.94
Old_World:	0.93
sheet_anchor:	0.93
lock_type:	0.93
Supreme_Court:	0.92
cleaner_environment:	0.92
Middle_East:	0.92
secondary_boycott:	0.92
Dingley_Act:	0.92
start_afresh:	0.92
elective_franchise:	0.90
Permanent_Court:	0.90
inland_frontiers:	0.90
Chief_Magistrate:	0.88
United_States:	0.88
200th_anniversary:	0.88
Thomas_Jefferson:	0.88
¡_Xand:	0.88
coordinate_branches:	0.87
Chief_Justice:	0.87
Pacific_Coast:	0.87
exclusive_metallic:	0.87
extra_session:	0.86
eighteenth_amendment:	0.86
Senator_Dole:	0.86
Senator_Mathias:	0.86
temporary_restraining:	0.86
entangling_alliances:	0.85
fugitive_sla

In [17]:
# Exercise 2.3. Tokenize by unigrams and bigrams

# Initialize phrase tokenizer
bigram = Phraser(phrases)


#sent = nltk.corpus.inaugural.raw('2009-Obama.txt')

sent = '''That we are in the midst of crisis is now well understood. Our nation is at war, against a far-reaching network of violence and hatred. Our economy is badly weakened, a consequence of greed and irresponsibility on the part of some, but also our collective failure to make hard choices and prepare the nation for a new age. Homes have been lost; jobs shed; businesses shuttered. Our health care is too costly; our schools fail too many; and each day brings further evidence that the ways we use energy strengthen our adversaries and threaten our planet.'''

print(sent)

print(bigram[sent.split()])

That we are in the midst of crisis is now well understood. Our nation is at war, against a far-reaching network of violence and hatred. Our economy is badly weakened, a consequence of greed and irresponsibility on the part of some, but also our collective failure to make hard choices and prepare the nation for a new age. Homes have been lost; jobs shed; businesses shuttered. Our health care is too costly; our schools fail too many; and each day brings further evidence that the ways we use energy strengthen our adversaries and threaten our planet.
['That', 'we', 'are', 'in', 'the', 'midst', 'of', 'crisis', 'is', 'now', 'well', 'understood.', 'Our', 'nation', 'is', 'at', 'war,', 'against', 'a', 'far-reaching', 'network', 'of', 'violence', 'and', 'hatred.', 'Our', 'economy', 'is', 'badly', 'weakened,', 'a', 'consequence', 'of', 'greed', 'and', 'irresponsibility', 'on', 'the', 'part', 'of', 'some,', 'but', 'also', 'our', 'collective', 'failure', 'to', 'make', 'hard_choices', 'and', 'prepar