In [None]:
%logstop
%logstart -rtq ~/.logs/ML_Natural_Language_Processing.py append
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

# Natural Language Processing

Natural language processing (NLP) is the field devoted to methods and algorithms for processing human (natural) languages for computers. NLP is a vast discipline that is actively being researched. For this notebook, we will be concerned with NLP tools and techniques we can use for machine learning applications. Some examples of machine learning applications using NLP include sentiment analysis, topic modeling, and language translation. In NLP, the following terms have specific meanings:

* **Corpus**: The body/collection of text being investigated.
* **Document**: The unit of analysis, what is considered a single observation.

Examples of corpora include a collection of reviews and tweets, the text of the _Iliad_, and Wikipedia articles. Documents can be whatever you decided, it is what your model will consider an observation. For the example when the corpus is a collection of reviews or tweets, it is logical to make the document a single review or tweet. For the example of the text of the _Iliad_, we can set the document size to a sentence or a paragraph. The choice of document size will be influenced by the size of our corpus. If it is large, it may make sense to call each paragraph a document. As is usually the case, some design choices that need to be made.

For this notebook, we will build a classifier to discern homonyms, words that are spelled the same but that have different meanings. The exact use case we will explore is to discern if the word "python" refers to the programming language or the animal.

## NLP with spaCy

spaCy is a Python package that bills itself as "industrial-strength" natural language processing. We will use the tools spaCy provides in conjunction with scikit-learn. More about spaCy can be found [here](https://spacy.io/).

In [2]:
import spacy

# load text processing pipeline
nlp = spacy.load('en_core_web_sm')

# nlp accepts a string
doc = nlp("Let's try out spacy. We can easily divide our text into sentences! I've run out of ideas.")

# iterate through each sentence
for sent in doc.sents:
    print(sent)

# index words
print(doc[0])
print(doc[6])

Let's try out spacy.
We can easily divide our text into sentences!
I've run out of ideas.
Let
We


Another nice feature from spaCy is part-of-speech tagging, the process of identifying whether a word is a noun, adjective, adverb, etc. A processed word has the attribute `pos_` and `tag_`; the former identifies the simple part of speech (e.g., noun) wile the latter identifies the more detailed part of speech (e.g., proper noun). The meaning of the resulting abbreviations of the `tag_` are listed [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) or can be revealed by running `spacy.explain` function.

In [3]:
doc = nlp("The quick brown fox jumped over the lazy dog. Mr. Peanut wears a top hat.")
tags = set()

# reveal part of speech
for word in doc:
    tags.add(word.tag_)
    print((word.text, word.pos_, word.tag_))

# revealing meaning of tags
print()
for tag in tags:
    print(tag, spacy.explain(tag))

('The', 'DET', 'DT')
('quick', 'ADJ', 'JJ')
('brown', 'ADJ', 'JJ')
('fox', 'NOUN', 'NN')
('jumped', 'VERB', 'VBD')
('over', 'ADP', 'IN')
('the', 'DET', 'DT')
('lazy', 'ADJ', 'JJ')
('dog', 'NOUN', 'NN')
('.', 'PUNCT', '.')
('Mr.', 'PROPN', 'NNP')
('Peanut', 'PROPN', 'NNP')
('wears', 'VERB', 'VBZ')
('a', 'DET', 'DT')
('top', 'ADJ', 'JJ')
('hat', 'NOUN', 'NN')
('.', 'PUNCT', '.')

VBZ verb, 3rd person singular present
DT determiner
NN noun, singular or mass
IN conjunction, subordinating or preposition
VBD verb, past tense
NNP noun, proper singular
. punctuation mark, sentence closer
JJ adjective (English), other noun-modifier (Chinese)


## Obtaining a corpus

Before we can move on with our analysis, we need to obtain a corpus. For our intended classifier, we need documents pertaining to python the animal and Python the programming language. Let's use Wikipedia articles to form our corpus. Luckily, there's a Python package called `wikipedia` that makes it easy to fetch articles. We will create documents based on the sentences in the articles. The function allows us to pass multiples pages in constructing the documents, allowing us to prevent one class of documents from dominating the corpus.

In [7]:
import wikipedia

def pages_to_sentences(*pages):
    """Return a list of sentences in Wikipedia articles."""
    sentences = []
    
    for page in pages:
        p = wikipedia.page(page)
        doc = nlp(p.content)
        sentences += [sent.text for sent in doc.sents]
    
    return sentences

animal_sents = pages_to_sentences("Reticulated python", "Ball Python")
language_sents = pages_to_sentences("Python (programming language)")
documents = animal_sents + language_sents

print(language_sents[:5])
print()
print(animal_sents[:5])

['Python is an interpreted high-level general-purpose programming language.', 'Its design philosophy emphasizes code readability with its use of significant indentation.', 'Its language constructs as well as its object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.', 'Python is dynamically-typed and garbage-collected.', 'It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented and functional programming.']

['The reticulated python (Malayopython reticulatus) is a python species native to South and Southeast Asia.', "It is the world's longest snake, and listed as least concern on the IUCN Red List because of its wide distribution.", 'In several countries in its range, it is hunted for its skin, for use in traditional medicine, and for sale as a pet.', 'It is an excellent swimmer, has been reported far out at sea, and has colonized many small islands within its range.', '\n']


## Bag of words model

Machine learning models needs to ingest data in a structured form, a matrix where the rows represents observations and the columns are features/attributes. When working with text data, we need a method to convert this unstructured data into a form that the machine learning model can work with. Let's consider our motivating example to create a classifier to discern the usage of "python" in a document. We understand that documents referring to the programming language will use words such as "integer", "byte", and "error" at higher frequency than documents that refer to python the animal. The reverse is true for words such as "bite", "snake", and "pet". One technique to _transform_ text data into a matrix is to count the number of appearances of each word in each document. This technique is called the **bag of words** model. The model gets its name because each document is viewed as a bag holding all the words, disregarding word order, context, and grammar. After applying the bag of words model to a corpus, the resulting matrix will exhibit patterns that a machine learning model can exploit. See the example below for the result of applying the bag of words model to a corpus of two documents.

Document 0: "The python is a large snake, although the snake is not venomous." <br>
Document 1: "Python is an interpreted programming language for general purpose programming." <br>
<br>

| although | an | for | general | interpreted | is | language | large | not | programming | purpose | python | snake | the | venomous |
|:--------:|----|-----|---------|-------------|----|----------|-------|-----|-------------|---------|--------|-------|-----|----------|
|     1    | 0  | 0   | 0       | 0           | 2  | 0        | 1     | 1   | 0           | 0       | 1      | 2     | 2   | 1        |
|     0    | 1  | 1   | 1       | 1           | 1  | 1        | 0     | 0   | 2           | 1       | 1      | 0     | 0   | 0        |


### CountVectorizer :-


In [10]:
from sklearn.feature_extraction.text import CountVectorizer

bag_of_words = CountVectorizer()
bag_of_words.fit(documents)
word_counts = bag_of_words.transform(documents)

print(word_counts)
word_counts

  (0, 230)	1
  (0, 282)	1
  (0, 1363)	1
  (0, 1547)	1
  (0, 1705)	1
  (0, 2034)	2
  (0, 2161)	1
  (0, 2163)	1
  (0, 2383)	1
  (0, 2384)	1
  (0, 2396)	1
  (0, 2564)	1
  (0, 2596)	1
  (1, 230)	1
  (1, 281)	1
  (1, 346)	1
  (1, 602)	1
  (1, 794)	1
  (1, 1363)	1
  (1, 1368)	1
  (1, 1380)	1
  (1, 1382)	1
  (1, 1462)	1
  (1, 1496)	1
  (1, 1497)	1
  :	:
  (853, 2034)	1
  (854, 258)	1
  (855, 68)	1
  (855, 96)	1
  (855, 138)	1
  (855, 1364)	1
  (857, 52)	1
  (857, 1573)	1
  (857, 2495)	1
  (858, 75)	1
  (858, 846)	1
  (858, 1268)	1
  (858, 1990)	1
  (858, 2034)	1
  (859, 169)	1
  (859, 1984)	1
  (859, 2757)	1
  (860, 81)	1
  (860, 123)	1
  (860, 138)	1
  (860, 1364)	1
  (863, 975)	1
  (863, 1493)	1
  (865, 1785)	1
  (865, 2745)	1


<866x2834 sparse matrix of type '<class 'numpy.int64'>'
	with 9107 stored elements in Compressed Sparse Row format>

The `transform` method returns a sparse matrix. A sparse matrix is a more efficient manner of storing a matrix. If a matrix has mostly zero entries, it is better to just store the non-zero entries and their occurrence, their row and column

In [11]:
# get word counts
counts_animal = bag_of_words.transform(animal_sents)
counts_language = bag_of_words.transform(language_sents)

# index for "programming"
ind_programming = bag_of_words.vocabulary_['programming']

# total counts across all documents
print(counts_animal.sum(axis=0)[0, ind_programming])
print(counts_language.sum(axis=0)[0, ind_programming])

0
32


## Term frequency-inverse document frequency

**CountVectorizer** creates a feature matrix of raw counts. Using raw counts has two problems, documents vary widely in length and the counts will be large for common words such as "the" and "is". We need to use a weighting scheme that considers the aforementioned attributes. The term frequency-inverse document frequency, **tf-idf** for short, is a popular weighting scheme to improve the simple count based data from the bag of words model. It is the product of two values, the term frequency and the inverse document frequency. There are several variants but the most popular is defined below.

* **Term Frequency:**
$$ \mathrm{tf}(t, d) = \frac{\mathrm{counts}(t, d)}{\sqrt{\sum_{t \in d} \mathrm{counts}(t, d)^2}}, $$
    where $\mathrm{counts}(t, d)$ is the raw count of term $t$ in document $d$ and $t \in d$ are the terms in document $d$. The normalization results in a vector of unit length.

* **Inverse Document Frequency:**
$$ \mathrm{idf}(t, D) = \ln\left(\frac{\text{number of documents in corpus } D}{1 + \text{number of documents with term } t}\right). $$
    Every counted term $t$ in the corpus will have its own idf weight. The $1+$ in the denominator is to ensure no division by zero if a term does not appear in the corpus. The idf weight is simply the log of the inverse of a term's document frequency.
    
With both $\mathrm{tf}(t, d)$ and $\mathrm{idf}(t, D)$ calculated, the tf-idf weight is

$$ \mathrm{tfidf}(t, d, D) = \mathrm{tf}(t, d) \mathrm{idf}(t, D).$$

With the idf weighting, words that are very common throughout the documents get weighted down. The reverse is true; the count of rare words get weighted up. With the tf-idf weighting scheme, a machine learning model will have an easier time to learn patterns to properly predict labels.

In [12]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
tfidf_weights = tfidf.fit_transform(word_counts)
print(tfidf_weights)

  (0, 2596)	0.13953740627450906
  (0, 2564)	0.12010123805255166
  (0, 2396)	0.29868630762321047
  (0, 2384)	0.34820978703735816
  (0, 2383)	0.3608325622415421
  (0, 2163)	0.3291762394384941
  (0, 2161)	0.23654005300487888
  (0, 2034)	0.25611491396866004
  (0, 1705)	0.3216226411298091
  (0, 1547)	0.34820978703735816
  (0, 1363)	0.15772873545610774
  (0, 282)	0.3608325622415421
  (0, 230)	0.12557303557114396
  (1, 2799)	0.2590477608040859
  (1, 2776)	0.2590477608040859
  (1, 2564)	0.18415095607188434
  (1, 2358)	0.18511665219431173
  (1, 2090)	0.2669546385877343
  (1, 1793)	0.16797282373031253
  (1, 1780)	0.11301567849150926
  (1, 1516)	0.2669546385877343
  (1, 1497)	0.28910800238573126
  (1, 1496)	0.22898751922244057
  (1, 1462)	0.2669546385877343
  (1, 1382)	0.28910800238573126
  :	:
  (853, 797)	0.765489540789116
  (854, 258)	1.0
  (855, 1364)	0.4492928625542091
  (855, 138)	0.4492928625542091
  (855, 96)	0.5460182448030878
  (855, 68)	0.5460182448030878
  (857, 2495)	0.64087564445505

We no longer have raw counts in our feature matrix. Let's use the `idf_` attribute of the fitted tf-idf transformer to inspect the top idf weights and their corresponding terms.

In [13]:
top_idf_indices = tfidf.idf_.argsort()[:-20:-1]
ind_to_word = bag_of_words.get_feature_names()

for ind in top_idf_indices:
    print(tfidf.idf_[ind], ind_to_word[ind])

7.071891796220597 zope
7.071891796220597 leone
7.071891796220597 labs
7.071891796220597 lacey
7.071891796220597 ladd
7.071891796220597 lakes
7.071891796220597 land
7.071891796220597 boo
7.071891796220597 bombana
7.071891796220597 lantz
7.071891796220597 laos
7.071891796220597 laptop
7.071891796220597 bohol
7.071891796220597 board
7.071891796220597 blue
7.071891796220597 last
7.071891796220597 blotches
7.071891796220597 blender
7.071891796220597 latin




## Additional Preprocessing:-

So far, we have discussed how using tf-idf rather than raw counts will improve the performance of our machine learning model. There are several other approaches that can boost performance; we will discuss techniques that improve the signal in our data set. Note, the following techniques may marginally increase model performance. It may be best to create a baseline model and measure the increased performance with the new model additions.

### Stop words

Words such as "the", "a", and "or" are so common throughout our corpus that they do not contribute any signal to our data set. Further, omitting these words will reduce our already high dimensional data set. It is best to not have these words as features and not be counted in the analysis. The set of words that will not factor into our analysis are called **stop words**.

spaCy provides a set of around 300 commonly used English words. When using stop words, it is best to examine the entries in case there are certain words you want to be included or not included. Since the words are provided as a Python set, we can use methods available to set objects to modify entries of the set object.

In [14]:
from spacy.lang.en import STOP_WORDS

print(type(STOP_WORDS))
STOP_WORDS_python = STOP_WORDS.union({"python"})
STOP_WORDS_python

<class 'set'>


{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

### Stemming and lemmatization

In our current analysis, words like "python" and "pythons" will be counted as separate words. We understand that they represent the same concept and want them to be treated as the same word. The same applies to other words like "run", "runs", "ran", and "running", they all represent the same meaning. **Stemming** is the process of reducing a word to its stem. Note, the stemming process is not 100% effective and sometimes the resulting stem is not an actual word. For example, the popular Porter stemming algorithm applied to "argues" and "arguing" returns **argu**.

**Lemmatization** is the process of reducing a word to its lemma, or the dictionary form of the word. It is a more sophisticated process than stemming as it considers context and part of speech. Further, the resulting lemma is an actual word. spaCy does not have a stemming algorithm but does offer lemmatization. Each word analyzed by spaCy has the attribute **lemma_** which returns the lemma of the word.

In [15]:
print([word.lemma_ for word in nlp('run runs ran running')])
print([word.lemma_ for word in nlp('buy buys buying bought')])
print([word.lemma_ for word in nlp('see saw seen seeing')])

['run', 'run', 'run', 'run']
['buy', 'buy', 'buying', 'buy']
['see', 'saw', 'see', 'see']


In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

def lemmatizer(text):
    return [word.lemma_ for word in nlp(text)]

# we need to generate the lemmas of the stop words
stop_words_str = " ".join(STOP_WORDS) # nlp function needs a string
stop_words_lemma = set(word.lemma_ for word in nlp(stop_words_str))

tfidf_lemma = TfidfVectorizer(max_features=100, 
                              stop_words=stop_words_lemma.union({"python"}),
                              tokenizer=lemmatizer)

tfidf_lemma.fit(documents)
print(tfidf_lemma.get_feature_names())



['\n', '\n\n', '\n\n\n', ' ', '"', '(', ')', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', ':', ';', '=', 'add', 'allow', 'ball', 'block', 'breed', 'c', 'captivity', 'class', 'code', 'common', 'compile', 'cpython', 'describe', 'design', 'development', 'division', 'e.g.', 'eat', 'egg', 'example', 'expression', 'feature', 'female', 'find', 'ft', 'function', 'human', 'implementation', 'include', 'integer', 'island', 'java', 'kill', 'language', 'large', 'later', 'length', 'library', 'like', 'list', 'long', 'm.', 'male', 'measure', 'method', 'module', 'new', 'number', 'object', 'old', 'operator', 'oz', 'pattern', 'prey', 'program', 'programming', 'propose', 'provide', 'r.', 'range', 'reference', 'release', 'report', 'reticulate', 'reticulated', 'small', 'snake', 'specie', 'standard', 'statement', 'string', 'support', 'syntax', 'time', 'type', 'value', 'variable', 'version', 'write', 'year']


### Tokenization and n-grams

Tokenization refers to dividing up a document into pieces to be counted. In our analysis so far, we are only counting words. However, it may be useful to count a sequence of words such as "natural environment" and "virtual environment". Counting these **bigrams** for our word usage analyzer may boost performance. More generally, an n-gram refers to the n sequence of words. In `scikit-learn`, n-grams can be included by setting `ngram_range=(min_n, max_n)` for the vectorizer, where `min_n` and `max_n` are the lower and upper bound of the range of n-grams to include. For example, `ngram_range=(1, 2)` will include words and bigrams while `ngram_range=(2, 2)` will only count bigrams. Let's see what are the most frequent bigrams in our corpus.

In [17]:
bigram_counter=CountVectorizer(max_features=20, ngram_range=(2,2), stop_words=STOP_WORDS.union({"python"}))
bigram_counter.fit(documents)

bigram_counter.get_feature_names()



['23 ft',
 'ball pythons',
 'design philosophy',
 'floating point',
 'ft 10',
 'ft length',
 'isbn 978',
 'lb oz',
 'new features',
 'object oriented',
 'oriented programming',
 'programming language',
 'programming languages',
 'reference implementation',
 'reticulated pythons',
 'scripting language',
 'spam eggs',
 'standard library',
 'van rossum',
 'year old']

## Document similarity

After we have transformed our corpus into a matrix, we can interpret our data set as representing a set of vectors in a $p$-dimensional space, where each document is its own vector. One common analysis is to find similar documents. The cosine similarity is a metric that measure how well aligned in space are two vectors, equal to the cosine of the angle in between the two vectors. If the vectors are perfectly aligned, they point in the same direction, the angle they form is 0 and the similarity score is 1. If the vectors are orthogonal, forming an angle of 90 degrees, the similarity metric is 0. Mathematically, the cosine similarity metric is equal to the dot product of two vectors, normalized,

$$ \frac{v_1 \cdot v_2}{\|v_1 \|\|v_2 \|}, $$

where $v_1$ and $v_2$ are two document vectors and $\| v_1 \|$ and $\| v_2 \|$ are their lengths.

## Word usage classifier

Let's build a word usage classifier with all the techniques we have seen. The model will include:

* tf-idf weighting
* stop words
* words and bigrams
* lemmatization

Applying the above techniques should result in a data set with enough signal that a machine learning model can learn from. For this exercise, we will use the naive Bayes model; a probabilistic model that calculates conditional probabilities using Bayes theorem. The term naive is applied because it assumes the features are conditionally independent from each other. You can think of a naive Bayes classifier working by determining what class should a document be assigned based upon the frequencies of words in the different classes in the training set. Naive Bayes is often used as benchmark model for NLP as it is quick to train. More about the model in general can be found [here](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) and details of the `scikit-learn` implementation is found [here](https://scikit-learn.org/stable/modules/naive_bayes.html). After training our model, we will see how well it performs for a chosen set of sentences.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# create data set and labels
documents = animal_sents + language_sents
labels = ["animal"]*len(animal_sents) + ["language"]*len(language_sents)

# lemma of stop words
stop_words_str = " ".join(STOP_WORDS)
stop_words_lemma = set(word.lemma_ for word in nlp(stop_words_str))

# create and train pipeline
tfidf = TfidfVectorizer(stop_words=stop_words_lemma, tokenizer=lemmatizer, ngram_range=(1, 2))
pipe = Pipeline([('vectorizer', tfidf), ('classifier', MultinomialNB())])
pipe.fit(documents, labels)

print("Training accuracy: {}".format(pipe.score(documents, labels)))

Training accuracy: 0.8995381062355658


In [19]:
test_docs = ["My Python program is only 100 bytes long.",
             "A python's bite is not venomous but still hurts.",
             "I can't find the error in the python code.",
             "Where is my pet python; I can't find her!",
             "I use for and while loops when writing Python.",
             "The python will loop and wrap itself onto me.",
             "I use snake case for naming my variables.",
             "My python has grown to over 10 ft long!",
             "I use virtual environments to manage package versions.",
             "Pythons are the largest snakes in the environment."]

class_labels = ["animal", "language"]
y_proba = pipe.predict_proba(test_docs)
predicted_indices = (y_proba[:, 1] > 0.5).astype(int)

for i, index in enumerate(predicted_indices):
    print(test_docs[i], "--> {} at {:g}%".format(class_labels[index], 100*y_proba[i, index]))

My Python program is only 100 bytes long. --> language at 69.1922%
A python's bite is not venomous but still hurts. --> animal at 51.0419%
I can't find the error in the python code. --> language at 80.4892%
Where is my pet python; I can't find her! --> language at 55.9227%
I use for and while loops when writing Python. --> language at 85.1932%
The python will loop and wrap itself onto me. --> language at 69.4983%
I use snake case for naming my variables. --> language at 59.8769%
My python has grown to over 10 ft long! --> animal at 60.9395%
I use virtual environments to manage package versions. --> language at 80.3619%
Pythons are the largest snakes in the environment. --> animal at 63.807%
