## AML7_NLP: Practical illustration of typical NLP steps

In [None]:
%load_ext autoreload
%autoreload 2

### Extract text from HTML

The course repo has a subdirectory called `html` which includes some example HTML files. 

In [None]:
%sx ls html/

['article1.html', 'article2.html', 'article3.html', 'article4.html']

Select one of those files to use as an example, and take a look at its HTML content.

In [None]:
file = "html/article1.html"
print(open(file, "r").readlines())

['<!DOCTYPE html>\n', '<html lang="en">\n', '<head>\n', " <title>The current state of machine intelligence 3.0 - O'Reilly Media</title>\n", '</head>\n', '<body>\n', '<div id="article-body">\n', '<p>Almost a year ago, we published our now-annual <a href="https://www.oreilly.com/ideas/the-current-state-of-machine-intelligence-2-0">landscape</a> of machine intelligence companies, and goodness have we seen a lot of activity since then. This year\'s landscape has <em>a third more companies</em> than our first one did two years ago, and it feels even more futile to try to be comprehensive, since this just scratches the surface of all of the activity out there.</p>\n', '\n', '<p>As has been the case for the last couple of years, our fund still obsesses over "problem first" machine intelligence -- we\'ve invested in 35 machine intelligence companies solving 35 meaningful problems in areas from security to recruiting to software development. (Our fund focuses on the future of work, so there are

Next, use [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to extract text out of the HTML. Following the [DOM](https://en.wikipedia.org/wiki/Document_Object_Model) structure of the HTML document, select the `<div/>` that encloses the article text, then iterate through the `<p/>` paragraphs to extract the text from each.

In [None]:
# !conda install -c conda-forge beautifulsoup4 --yes

In [None]:
# %sx read -p 'y/n?: '

In [None]:
from bs4 import BeautifulSoup

with open(file) as f:
    soup = BeautifulSoup(f, "html.parser")

    for div in soup.find_all("div", id="article-body"):
        for p in div.find_all("p"):
            print(p.get_text())

Almost a year ago, we published our now-annual landscape of machine intelligence companies, and goodness have we seen a lot of activity since then. This year's landscape has a third more companies than our first one did two years ago, and it feels even more futile to try to be comprehensive, since this just scratches the surface of all of the activity out there.
As has been the case for the last couple of years, our fund still obsesses over "problem first" machine intelligence -- we've invested in 35 machine intelligence companies solving 35 meaningful problems in areas from security to recruiting to software development. (Our fund focuses on the future of work, so there are some machine intelligence domains where we invest more than others.)
At the same time, the hype around machine intelligence methods continues to grow: the words "deep learning" now equally represent a series of meaningful breakthroughs (wonderful) but also a hyped phrase like "big data" (not so good!). We care abou

### Concerns about characters

The following shows examples of how to use [codecs](https://docs.python.org/3/library/codecs.html) and [normalize unicode](https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize). NB: the example text comes from the article "[Metal umlat](https://en.wikipedia.org/wiki/Metal_umlaut)".

In [None]:
x = "Rinôçérôse screams ﬂow not unlike an encyclopædia, \
'TECHNICIÄNS ÖF SPÅCE SHIP EÅRTH THIS IS YÖÜR CÄPTÅIN SPEÄKING YÖÜR ØÅPTÅIN IS DEA̋D' to Spın̈al Tap."
type(x)

str

The variable `x` is a *string* in Python:

In [None]:
repr(x)

'"Rinôçérôse screams ﬂow not unlike an encyclopædia, \'TECHNICIÄNS ÖF SPÅCE SHIP EÅRTH THIS IS YÖÜR CÄPTÅIN SPEÄKING YÖÜR ØÅPTÅIN IS DEA̋D\' to Spın̈al Tap."'

Its translation into [ASCII](http://www.asciitable.com/) is unusable by parsers:

In [None]:
ascii(x)

'"Rin\\xf4\\xe7\\xe9r\\xf4se screams \\ufb02ow not unlike an encyclop\\xe6dia, \'TECHNICI\\xc4NS \\xd6F SP\\xc5CE SHIP E\\xc5RTH THIS IS Y\\xd6\\xdcR C\\xc4PT\\xc5IN SPE\\xc4KING Y\\xd6\\xdcR \\xd8\\xc5PT\\xc5IN IS DEA\\u030bD\' to Sp\\u0131n\\u0308al Tap."'

Encoding as [UTF-8](http://unicode.org/faq/utf_bom.html) doesn't help much:

https://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

In [None]:
x.encode('utf8')  

b"Rin\xc3\xb4\xc3\xa7\xc3\xa9r\xc3\xb4se screams \xef\xac\x82ow not unlike an encyclop\xc3\xa6dia, 'TECHNICI\xc3\x84NS \xc3\x96F SP\xc3\x85CE SHIP E\xc3\x85RTH THIS IS Y\xc3\x96\xc3\x9cR C\xc3\x84PT\xc3\x85IN SPE\xc3\x84KING Y\xc3\x96\xc3\x9cR \xc3\x98\xc3\x85PT\xc3\x85IN IS DEA\xcc\x8bD' to Sp\xc4\xb1n\xcc\x88al Tap."

Ignoring difficult characters is perhaps an even worse strategy:

In [None]:
x.encode('ascii', 'ignore')

b"Rinrse screams ow not unlike an encyclopdia, 'TECHNICINS F SPCE SHIP ERTH THIS IS YR CPTIN SPEKING YR PTIN IS DEAD' to Spnal Tap."

However, one can *normalize* then encode…

In [None]:
import unicodedata

unicodedata.normalize('NFKD', x).encode('ascii','ignore')

b"Rinocerose screams flow not unlike an encyclopdia, 'TECHNICIANS OF SPACE SHIP EARTH THIS IS YOUR CAPTAIN SPEAKING YOUR APTAIN IS DEAD' to Spnal Tap."

Even before this normalization and encoding, you may need to convert some characters explicitly **before** parsing. For example:

In [None]:
x = "The sky “above” the port … was the color of ‘cable television’ – tuned to the Weather Channel®"
ascii(x)

"'The sky \\u201cabove\\u201d the port \\u2026 was the color of \\u2018cable television\\u2019 \\u2013 tuned to the Weather Channel\\xae'"

Consider the results for that line:

In [None]:
unicodedata.normalize('NFKD', x).encode('ascii', 'ignore')

b'The sky above the port ... was the color of cable television  tuned to the Weather Channel'

...which still drops characters that may be important for parsing a sentence.

So an alternative would be:

In [None]:
x = x.replace('“', '"').replace('”', '"')
x = x.replace("‘", "'").replace("’", "'")
x = x.replace('…', '...').replace('–', '-')
x = unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8')
print(x)

The sky "above" the port ... was the color of 'cable television' - tuned to the Weather Channel


### Statistical parsing

NLP used to be mostly concerned about transformational grammars, linguistic theory by Chomsky, etc. 

ML techniques allow much simpler approaches called statistical parsing.

<font color='blue'>Probabilistic methods</font> split texts into sentences, annotate words with part-of-speech, chunk noun phrases, resolve named entities, estimate sentiment scores, etc.

Let's start with a simple paragraph:

In [None]:
text = """
Increasingly, customers send text to interact or leave comments, 
which provides a wealth of data for text mining.  That’s a great 
starting point for developing custom search, content recommenders, 
and even AI applications.
"""
repr(text)

"'\\nIncreasingly, customers send text to interact or leave comments, \\nwhich provides a wealth of data for text mining.  That’s a great \\nstarting point for developing custom search, content recommenders, \\nand even AI applications.\\n'"

Notice how there are explicit *line breaks* in the text. Let's write some code to flow the paragraph without any line breaks:

In [None]:
text = " ".join(map(lambda x: x.strip(), text.split("\n"))).strip()
repr(text)

"'Increasingly, customers send text to interact or leave comments, which provides a wealth of data for text mining.  That’s a great starting point for developing custom search, content recommenders, and even AI applications.'"

#### We’ll use a few popular NLP resources for parsing text:

* [spaCy](https://spacy.io/)  – one of the top NLP libraries in Python 

* [TextBlob](http://textblob.readthedocs.io/) – a Python library that provides a consistent API for leveraging other resources

* [WordNet](https://wordnet.princeton.edu/) – think of it as somewhere between a large thesaurus and a database

One important step is to annotate the words in each sentence with a tag that describes its part of speech: noun, verb, adjective, determinant, adverb, etc.


#### Splitting sentences

Now we can use **spaCy** to **split** the paragraph into sentences:

In [None]:
# !conda install -c conda-forge spacy --yes

# pip install -U spacy
# python -m spacy download en_core_web_sm

In [None]:
import spacy

# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm") 

In [None]:
doc = nlp(text) #, parse=True)

for span in doc.sents:
    print("> ", span)

>  Increasingly, customers send text to interact or leave comments, which provides a wealth of data for text mining.  
>  That’s a great starting point for developing custom search, content recommenders, and even AI applications.


#### PoS annotation

Next we take a sentence and **annotate** it with part-of-speech (PoS) tags:

In [None]:
for span in doc.sents:
    for i in range(span.start, span.end):
        token = doc[i]
        print(i, token.text, token.tag_, token.pos_)

0 Increasingly RB ADV
1 , , PUNCT
2 customers NNS NOUN
3 send VBP VERB
4 text NN NOUN
5 to IN ADP
6 interact VB VERB
7 or CC CCONJ
8 leave VB VERB
9 comments NNS NOUN
10 , , PUNCT
11 which WDT DET
12 provides VBZ VERB
13 a DT DET
14 wealth NN NOUN
15 of IN ADP
16 data NNS NOUN
17 for IN ADP
18 text NN NOUN
19 mining NN NOUN
20 . . PUNCT
21   _SP SPACE
22 That DT DET
23 ’s VBZ VERB
24 a DT DET
25 great JJ ADJ
26 starting NN NOUN
27 point NN NOUN
28 for IN ADP
29 developing VBG VERB
30 custom NN NOUN
31 search NN NOUN
32 , , PUNCT
33 content NN NOUN
34 recommenders NNS NOUN
35 , , PUNCT
36 and CC CCONJ
37 even RB ADV
38 AI NNP PROPN
39 applications NNS NOUN
40 . . PUNCT


Given these annotations for part-of-speech tags, we can <font color='blue'>**lemmatize**</font> nouns and verbs to get their root forms. This will also singularize the plural nouns:

In [None]:
for span in doc.sents:
    for i in range(span.start, span.end):
        token = doc[i]
        print(i, token.text, token.tag_, token.pos_, token.lemma_)

0 Increasingly RB ADV increasingly
1 , , PUNCT ,
2 customers NNS NOUN customer
3 send VBP VERB send
4 text NN NOUN text
5 to IN ADP to
6 interact VB VERB interact
7 or CC CCONJ or
8 leave VB VERB leave
9 comments NNS NOUN comment
10 , , PUNCT ,
11 which WDT DET which
12 provides VBZ VERB provide
13 a DT DET a
14 wealth NN NOUN wealth
15 of IN ADP of
16 data NNS NOUN datum
17 for IN ADP for
18 text NN NOUN text
19 mining NN NOUN mining
20 . . PUNCT .
21   _SP SPACE  
22 That DT DET that
23 ’s VBZ VERB ’
24 a DT DET a
25 great JJ ADJ great
26 starting NN NOUN starting
27 point NN NOUN point
28 for IN ADP for
29 developing VBG VERB develop
30 custom NN NOUN custom
31 search NN NOUN search
32 , , PUNCT ,
33 content NN NOUN content
34 recommenders NNS NOUN recommender
35 , , PUNCT ,
36 and CC CCONJ and
37 even RB ADV even
38 AI NNP PROPN AI
39 applications NNS NOUN application
40 . . PUNCT .


In [None]:
### Lemmatization - for other languages 

# https://towardsdatascience.com/state-of-the-art-multilingual-lemmatization-f303e8ff1a8

We can also lookup synonyms and definitions for each word, using **synsets** from [WordNet](https://wordnet.princeton.edu/).

Have in mind that `spaCy` is designed to be an **opinionated** API, and it omits support for much of the value of `WordNet`. However, we can use [TextBlob](http://textblob.readthedocs.io/) instead:

In [None]:
# ! conda install -c conda-forge textblob --yes

In [None]:
from textblob import Word

w = Word("comments")

for synset, definition in zip(w.get_synsets(), w.define()):
    print(synset, definition)

Synset('remark.n.01') a statement that expresses a personal opinion or belief or adds information
Synset('comment.n.02') a written explanation or criticism or illustration that is added to a book or other textual material
Synset('gossip.n.02') a report (often malicious) about the behavior of other people
Synset('comment.v.01') make or write a comment on
Synset('comment.v.02') explain or interpret something
Synset('gloss.v.02') provide interlinear explanations for words or phrases


### Noun phrase chunking

Sometimes it's useful to use **noun phrase chunking** to extract key phrases…

In [None]:
text = "That's a great starting point for developing custom search, content recommenders, and even AI applications."
doc = nlp(text)

repr(doc)

"That's a great starting point for developing custom search, content recommenders, and even AI applications."

First let's look at the individual **keywords**:

In [None]:
for token in doc:
    print(token)

That
's
a
great
starting
point
for
developing
custom
search
,
content
recommenders
,
and
even
AI
applications
.


Contrast those results with **noun phrases**:

In [None]:
for np in doc.noun_chunks:
    print(np)

a great starting point
custom search
content recommenders
even AI applications


There's definitely more information in the key phrase `custom search` than there is in the individual keywords `custom` and `search`.

In [None]:
from textblob import TextBlob

blob = TextBlob(text)
print(blob.tags)

[('That', 'DT'), ("'s", 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('starting', 'JJ'), ('point', 'NN'), ('for', 'IN'), ('developing', 'VBG'), ('custom', 'JJ'), ('search', 'NN'), ('content', 'NN'), ('recommenders', 'NNS'), ('and', 'CC'), ('even', 'RB'), ('AI', 'NNP'), ('applications', 'NNS')]


In [None]:
print(blob.noun_phrases)

['custom search', 'content recommenders', 'ai']


### Named entity resolution (NER)

Often you want to identify the proper noun phrases within a text. For that we can use *named-entity resolution* [NER](https://spacy.io/docs/usage/entity-recognition):

In [None]:
# import spacy
# nlp = spacy.load("en")

text = "He'd been trained by the best, McCoy Pauley and Bobby Quine, legends in Memphis, and now Chiba City as well."
doc = nlp(text)

repr(doc)

"He'd been trained by the best, McCoy Pauley and Bobby Quine, legends in Memphis, and now Chiba City as well."

Clearly these entities enrich the key phrases extracted from a text:

In [None]:
for entity in doc.ents:
    print(entity.label_, entity.text)

PERSON McCoy Pauley
PERSON Bobby Quine
GPE Memphis
GPE Chiba City


### Store annotated text as JSON files

Much of the preceding NLP code has been worked into a small library, and we'll call functions from that library to help keep these notebooks more readable. Take a look at the source code in `pynlp.py`, and an example usage:

In [None]:
import pynlp

html_file = "html/article1.html"
json_file = "a1.json"

pynlp.full_parse(html_file, json_file)

That extracts text from HTML in the first article, then stores the parsed and annotated text as JSON, one line per sentence. Let's look at the first two sentences:

In [None]:
%sx more a1.json

['[["Almost", "almost", "RB"], ["a", "a", "DT"], ["year", "year", "NN"], ["ago", "ago", "RB"], [",", ",", "."], ["we", "we", "PRP"], ["published", "publish", "VBD"], ["our", "our", "PRP$"], ["now", "now", "RB"], ["-", "-", "."], ["annual", "annual", "JJ"], ["landscape", "landscape", "NN"], ["of", "of", "IN"], ["machine", "machine", "NN"], ["intelligence", "intelligence", "NN"], ["companies", "company", "NNS"], [",", ",", "."], ["and", "and", "CC"], ["goodness", "goodness", "NN"], ["have", "have", "VBP"], ["we", "we", "PRP"], ["seen", "see", "VBN"], ["a", "a", "DT"], ["lot", "lot", "NN"], ["of", "of", "IN"], ["activity", "activity", "NN"], ["since", "since", "IN"], ["then", "then", "RB"], [".", ".", "."]]',
 '[["This", "this", "DT"], ["year", "year", "NN"], ["\'s", "\'s", "POS"], ["landscape", "landscape", "NN"], ["has", "has", "VBZ"], ["a", "a", "DT"], ["third", "third", "RB"], ["more", "more", "JJR"], ["companies", "company", "NNS"], ["than", "than", "IN"], ["our", "our", "PRP$"], ["f

Extract/parse/save-to-JSON for each of the example HTML files:

In [None]:
html_file = "html/article2.html"
json_file = "a2.json"

pynlp.full_parse(html_file, json_file)

In [None]:
html_file = "html/article3.html"
json_file = "a3.json"

pynlp.full_parse(html_file, json_file)

In [None]:
html_file = "html/article4.html"
json_file = "a4.json"

pynlp.full_parse(html_file, json_file)

In [None]:
%sx more a4.json

['[["Learning", "learning", "NN"], ["is", "is", "VBZ"], ["n\'t", "n\'t", "RB"], ["a", "a", "DT"], ["one", "one", "CD"], ["-", "-", "."], ["shot", "shot", "NN"], ["process", "process", "NN"], [":", ":", "."], ["take", "take", "VB"], ["the", "the", "DT"], ["course", "course", "NN"], [",", ",", "."], ["pass", "pass", "VB"], ["an", "an", "DT"], ["exam", "exam", "NN"], [",", ",", "."], ["and", "and", "CC"], ["get", "get", "VB"], ["out", "out", "RP"], [".", ".", "."]]',
 '[["You", "you", "PRP"], ["learn", "learn", "VBP"], ["by", "by", "IN"], ["interacting", "interact", "VBG"], ["with", "with", "IN"], ["instructors", "instructor", "NNS"], ["and", "and", "CC"], ["students", "student", "NNS"], [",", ",", "."], ["by", "by", "IN"], ["assessing", "assess", "VBG"], ["your", "your", "PRP$"], ["progress", "progress", "NN"], [",", ",", "."], ["and", "and", "CC"], ["using", "use", "VBG"], ["that", "that", "DT"], ["to", "to", "TO"], ["plan", "plan", "VB"], ["your", "your", "PRP$"], ["next", "next", "JJ"

### Example: TF-IDF

Here we use results from our parsing to calculate a <font color='blue'>term frequency - inverse document frequency (TF-IDF)</font> metric to construct **feature vectors** per document. First we'll load a **stopword** list, for common words to ignore from the analysis:

In [None]:
import pynlp

stopwords = pynlp.load_stopwords("stop.txt")
print(stopwords)

{'i', 'for', 'about', 'my', 'this', 'some', 'their', 'we', 'handle', 'you', 'since', 'the', 'which', 'come', 'of', 'or', 'one', 'all', 'up', 'find', 'that', 'an', 'each', 'as', 'try', 'now', 'where', 'see', 'such', 'other', 'feel', 'a', 'have', 'it', 'like', 'around', 'next', 'what', 'much', 'with', 'on', 'get', 'not', 'take', 'how', 'at', 'us', 'go', 'but', 'more', 'use', 'few', 'both', "n't", 'by', 'its', 'our', 'there', 'who', 'two', 'write', 'over', 'can', 'when', 'be', 'let', 'out', 'your', 'they', 'in', 'same', 'if', 'and', 'new', 'just', 'to', 'want', 'so', 'then', 'than', 'while', 'from', 'do', 'also'}


Next, we'll use a function from our `pynlp` library to iterate through the keywords for one of the parsed HTML documents:

In [None]:
%sx ls *.json

['a1.json', 'a2.json', 'a3.json', 'a4.json']

In [None]:
json_file = "a1.json"

for lex in pynlp.lex_iter(json_file):
    print(lex)

WordNode(raw='Almost', root='almost', pos='RB')
WordNode(raw='a', root='a', pos='DT')
WordNode(raw='year', root='year', pos='NN')
WordNode(raw='ago', root='ago', pos='RB')
WordNode(raw=',', root=',', pos='.')
WordNode(raw='we', root='we', pos='PRP')
WordNode(raw='published', root='publish', pos='VBD')
WordNode(raw='our', root='our', pos='PRP$')
WordNode(raw='now', root='now', pos='RB')
WordNode(raw='-', root='-', pos='.')
WordNode(raw='annual', root='annual', pos='JJ')
WordNode(raw='landscape', root='landscape', pos='NN')
WordNode(raw='of', root='of', pos='IN')
WordNode(raw='machine', root='machine', pos='NN')
WordNode(raw='intelligence', root='intelligence', pos='NN')
WordNode(raw='companies', root='company', pos='NNS')
WordNode(raw=',', root=',', pos='.')
WordNode(raw='and', root='and', pos='CC')
WordNode(raw='goodness', root='goodness', pos='NN')
WordNode(raw='have', root='have', pos='VBP')
WordNode(raw='we', root='we', pos='PRP')
WordNode(raw='seen', root='see', pos='VBN')
WordNode

We need to initialize some data structures for counting keywords. BTW, if you've heard about how Big Data projects use [word count](http://spark.apache.org/examples.html) programs to demonstrate their capabilities, here's a major use case for that. 

Even so, our examples are conceptually simple, built for relatively small files, and are not intended to scale:

In [None]:
from collections import defaultdict

files = ["a4.json", "a3.json", "a2.json", "a1.json"]
files_tf = {}

d = len(files)
df = defaultdict(int)

Iterate through each parsed file, tallying counts for `tf` for each document while also tallying counts for `df` across all documents:

In [None]:
for json_file in files:
    tf = defaultdict(int)

    for lex in pynlp.lex_iter(json_file):
        if (lex.pos != ".") and (lex.root not in stopwords):
            tf[lex.root] += 1

    files_tf[json_file] = tf

    for word in tf.keys():
        df[word] += 1

## print results for just the last file in the sequence
print(json_file, files_tf[json_file])

a1.json defaultdict(<class 'int'>, {'almost': 2, 'year': 5, 'ago': 2, 'publish': 1, 'annual': 1, 'landscape': 2, 'machine': 6, 'intelligence': 6, 'company': 4, 'goodness': 1, 'lot': 1, 'activity': 2, "'s": 2, 'has': 2, 'third': 1, 'first': 2, 'did': 1, 'even': 1, 'futile': 1, 'comprehensive': 1, 'scratch': 1, 'surface': 1, 'been': 1, 'case': 1, 'last': 2, 'couple': 1, 'fund': 2, 'still': 1, 'obsess': 1, 'problem': 3, "'ve": 1, 'invest': 2, '35': 2, 'solve': 2, 'meaningful': 2, 'area': 1, 'security': 1, 'recruit': 1, 'software': 1, 'development': 1, 'focus': 1, 'future': 1, 'work': 1, 'are': 2, 'domain': 1, 'time': 1, 'hype': 2, 'method': 2, 'continue': 1, 'grow': 1, 'word': 1, 'deep': 1, 'learning': 1, 'equally': 1, 'represent': 1, 'series': 1, 'breakthrough': 1, 'wonderful': 1, 'phrase': 1, 'big': 1, 'datum': 1, 'good': 1, 'care': 1, 'whether': 1, 'founder': 2, 'right': 1, 'fanciest': 1, 'favor': 1, 'those': 1, 'apply': 1, 'technology': 1, 'thoughtfully': 1, 'biggest': 1, 'change': 1,

Let's take a look at the `df` results overall. If there are low-information common words in the list that you'd like to filter out, move them to your **stopword** list.

In [None]:
for word, count in sorted(df.items(), key=lambda kv: kv[1], reverse=True):
  print(word, count)

learning 4
's 4
has 4
learn 3
been 3
was 3
year 3
time 3
machine 3
is 2
interact 2
student 2
step 2
feedback 2
whether 2
o'reilly 2
company 2
publish 2
book 2
were 2
provide 2
've 2
past 2
training 2
apply 2
people 2
start 2
attendance 2
early 2
deep 2
problem 2
post 2
become 2
thing 2
local 2
are 2
many 2
exist 2
programming 2
solve 2
biggest 2
open 2
first 2
datum 2
work 2
shot 1
process 1
course 1
pass 1
exam 1
instructor 1
assess 1
progress 1
plan 1
ongoing 1
loop 1
involve 1
everyone 1
classroom 1
virtual 1
physical 1
media 1
always 1
begin 1
mid-'80s 1
editorial 1
guidance 1
author 1
friend 1
look 1
reader 1
shoulder 1
wise 1
experienced 1
advice 1
2016 1
're 1
long 1
instructional 1
video 1
conference 1
introduce 1
live 1
online 1
addition 1
person 1
location 1
standard 1
expert 1
seasoned 1
realize 1
way 1
experience 1
analyze 1
participate 1
group 1
50 1
% 1
attend 1
team 1
People 1
hang 1
together 1
during 1
break 1
knowledge 1
particular 1
situation 1
bring 1
back 1
co 1
- 1

Finally, we make a second pass through the data, using the `df` counts to normalize `tf` counts, calculating the `tfidf` metrics for each keyword:

In [None]:
import math

for json_file in files:
    tf = files_tf[json_file]
    keywords = []

    for word, count in tf.items():
        # Note the 1 added in tf and idf (to avoid problems with 0 counts)
        tfidf = float(count) * math.log((d + 1.0) / (df[word] + 1.0))
        keywords.append((json_file, tfidf, word,))

Let's take a look at the results for one of the files:

In [None]:
for json_file, tfidf, word in sorted(keywords, key=lambda x: x[1], reverse=True):
    print("%s\t%7.4f\t%s" % (json_file, tfidf, word))

a1.json	 5.4977	intelligence
a1.json	 2.0433	company
a1.json	 1.8326	almost
a1.json	 1.8326	ago
a1.json	 1.8326	landscape
a1.json	 1.8326	activity
a1.json	 1.8326	last
a1.json	 1.8326	fund
a1.json	 1.8326	invest
a1.json	 1.8326	35
a1.json	 1.8326	meaningful
a1.json	 1.8326	hype
a1.json	 1.8326	method
a1.json	 1.8326	founder
a1.json	 1.8326	mix
a1.json	 1.8326	hear
a1.json	 1.5325	problem
a1.json	 1.3389	machine
a1.json	 1.1157	year
a1.json	 1.0217	first
a1.json	 1.0217	solve
a1.json	 1.0217	are
a1.json	 0.9163	annual
a1.json	 0.9163	goodness
a1.json	 0.9163	lot
a1.json	 0.9163	third
a1.json	 0.9163	did
a1.json	 0.9163	even
a1.json	 0.9163	futile
a1.json	 0.9163	comprehensive
a1.json	 0.9163	scratch
a1.json	 0.9163	surface
a1.json	 0.9163	case
a1.json	 0.9163	couple
a1.json	 0.9163	still
a1.json	 0.9163	obsess
a1.json	 0.9163	area
a1.json	 0.9163	security
a1.json	 0.9163	recruit
a1.json	 0.9163	software
a1.json	 0.9163	development
a1.json	 0.9163	focus
a1.json	 0.9163	future
a1.json	 0.

Question: how does that vector of ranked keywords compare with your reading of the text from the HTML file?

### Example: semantic similarity

Cosine similarity between 2 vectors $a, b$ is easy to define:

$$simcos(a,b) = \frac{a \cdot b}{||a||x||b||} $$

Another similarity measure is Jaccard similarity. 

We can improve on speed in computing similarities using <font color='blue'>MinHash</font> as an approximation of <font color='blue'>Jaccard similarity</font>. We can create a function to calculate a <font color='blue'>MinHash</font> using package 'datasketch':

In [None]:
### ! conda install -c conda-forge datasketch --yes
# ! pip install datasketch

In [None]:
from datasketch import MinHash

def mh_digest (data):
    mh = MinHash(num_perm=512)

    for d in data:
        mh.update(d.encode('utf8'))

    return mh

Then we'll iterate through each parsed document, adding the keywords to the MinHash:

In [None]:
import pynlp

files = ["a4.json", "a3.json", "a2.json", "a1.json"]

stopwords = pynlp.load_stopwords("stop.txt")
files_set = {}
files_mh = {}

for json_file in files:
    keywords = set([])

    for lex in pynlp.lex_iter(json_file):
        if (lex.pos != ".") and (lex.root not in stopwords):
            keywords.add(lex.root)

    files_set[json_file] = keywords
    files_mh[json_file] = mh_digest(keywords)

    print(json_file, keywords)

a4.json {'learning', 'book', 'classroom', 'group', 'instructor', 'experienced', 'year', 'People', 'situation', 'co', 'progress', 'experience', "mid-'80s", 'is', '2016', 'editorial', 'person', 'always', 'exam', "'re", 'student', 'participate', 'early', 'involve', 'ongoing', 'course', "o'reilly", 'were', 'standard', 'hang', 'was', "'ve", 'wise', 'video', 'been', 'media', 'plan', 'feedback', 'conference', '-', 'attendance', 'together', 'way', 'step', 'whether', 'during', 'author', 'has', 'team', 'analyze', 'training', 'physical', 'learn', 'interact', 'attend', 'everyone', 'past', 'provide', 'location', 'particular', 'assess', 'realize', 'pass', 'process', 'apply', 'people', 'bring', 'individual', 'shoulder', 'look', 'virtual', 'online', '%', 'break', 'worker', 'shot', 'loop', 'seasoned', '50', 'introduce', 'long', 'company', 'live', 'friend', 'expert', 'sum', "'s", 'start', 'publish', 'begin', 'instructional', 'knowledge', 'reader', 'addition', 'back', 'guidance', 'advice'}
a3.json {'lear

Let's compare the HTML documents, using a pairwise <font color='blue'>MinHash</font>:

In [None]:
import itertools

sim = []

for i1, i2 in itertools.combinations(range(len(files)), 2):
    j = files_mh[files[i1]].jaccard(files_mh[files[i2]])
    sim.append((j, files[i1], files[i2],))

for jaccard, file1, file2 in sorted(sim, key=lambda x: x[0], reverse=True):
    print("%0.4f\t%s\t%s" % (jaccard, file1, file2))

0.0762	a4.json	a2.json
0.0723	a4.json	a1.json
0.0684	a4.json	a3.json
0.0664	a3.json	a2.json
0.0566	a3.json	a1.json
0.0410	a2.json	a1.json


Note the top-ranked ("most similar") pair, where both `html/article2.html` and `html/article4.html` are about learning. Take a look at their overlapping keywords:

In [None]:
files_set["a4.json"] & files_set["a2.json"]

{"'s",
 'attendance',
 'book',
 'early',
 'feedback',
 'has',
 'interact',
 'learn',
 'learning',
 "o'reilly",
 'provide',
 'start',
 'student',
 'was',
 'were'}

In [None]:
# 2nd most similar
files_set["a4.json"] & files_set["a1.json"]

{"'s",
 "'ve",
 'apply',
 'been',
 'company',
 'has',
 'learning',
 'people',
 'publish',
 'whether',
 'year'}

### Example: Word2Vect

The data file `terms.tsv` has 10K elements, which is a subset from a **much** larger file.
This represents the keyphrases from 843 unique documents.
Realistically, you want many more documents in a *Word2Vec* model before the results begin to make a lot of sense.

Even so, this is enough to show how to call the functions from [gensim](https://radimrehurek.com/gensim/models/word2vec.html).

In [None]:
import csv
import gensim
import logging
import sys

model_file = "model.dat"
term_path = "terms.tsv"

Load the parsed keyphrases into a list called `sentences`, where each "sentence" is the list of keyphrases from one document.

In [None]:
sentences = []
sent = []
last_doc = None

with open(term_path) as f:
    for term, doc, rank in csv.reader(f, delimiter="\t"):
        rank = float(rank)

        if doc != last_doc:
            if last_doc:
                sentences.append(sent)
                sent = []

            last_doc = doc

        sent.append(term)

    # handle the dangling last element
    sentences.append(sent)

print(len(sentences))

843


Set up logging (which is required by `gensim`) then train `word2vec` on each "sentence". Then save the model to the `model.dat` file.

In [None]:
FORMAT = "%(asctime)s : %(levelname)s : %(message)"
logging.basicConfig(format=FORMAT, level=logging.ERROR)

model_new = gensim.models.Word2Vec(sentences, min_count=1, size=200) # min_count=1
model_new.save(model_file)

In [None]:
model_new.save? # model saved in package 'gensim' directory

[0;31mSignature:[0m [0mmodel_new[0m[0;34m.[0m[0msave[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Save the model.
This saved model can be loaded again using :func:`~gensim.models.word2vec.Word2Vec.load`, which supports
online training and getting vectors for vocabulary words.

Parameters
----------
fname : str
    Path to the file.
[0;31mFile:[0m      ~/anaconda3/envs/AML_TF/lib/python3.7/site-packages/gensim/models/word2vec.py
[0;31mType:[0m      method


In [None]:
# gensim.models.Word2Vec?

If you need to load a trained model, use:
`model = gensim.models.Word2Vec.load(model_file)`

In [None]:
%sx ls -lth model.dat terms.tsv # model_new.dat

['-rw-r--r--  1 iordan  staff   3.0M Mar 17 15:48 model.dat',
 '-rwxr-xr-x@ 1 iordan  staff   371K Nov 15  2017 terms.tsv']

Here's a helper method, which queries the resulting model for "neighbor" keyphrases:

In [None]:
def get_synset (model, query, topn=2):
    try:
        return sorted(model.most_similar(positive=[query], topn=topn), key=lambda x: x[1], reverse=True)
    except KeyError:
        return []

Now we can query the model interactively through a mini REPL:

In [None]:
# try: market, hotel

# to stop: EXIT!

In [None]:
NUM_RESULTS = 10

while True:
    try:
        query = input("\nquery? ")
        if query=='EXIT!':
            break
        synset = get_synset(model_new, query, topn=NUM_RESULTS)
        print("most similar to", query, ":", synset)
    except KeyError:
        print("not found")


query?  market


  This is separate from the ipykernel package so we can avoid doing imports until


most similar to market : [('time', 0.9984865188598633), ('has', 0.9984476566314697), ('go', 0.998352587223053), ('home', 0.9982441663742065), ('business', 0.998213529586792), ('make', 0.9981980323791504), ('world', 0.9981462955474854), ('information', 0.9981448650360107), ('investment', 0.9981405735015869), ('data', 0.9981072545051575)]



query?  EXIT!
