<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_01_spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# T81-558: Applications of Deep Neural Networks
**Module 11: Natural Language Processing and Speech Recognition**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 11 Material

* **Part 11.1: Getting Started with Spacy in Python** [[Video]](https://www.youtube.com/watch?v=A5BtU9vXzu8&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_01_spacy.ipynb)
* Part 11.2: Word2Vec and Text Classification [[Video]](https://www.youtube.com/watch?v=nWxtRlpObIs&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_02_word2vec.ipynb)
* Part 11.3: What are Embedding Layers in Keras [[Video]](https://www.youtube.com/watch?v=OuNH5kT-aD0&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_03_embedding.ipynb)
* Part 11.4: Natural Language Processing with Spacy and Keras [[Video]](https://www.youtube.com/watch?v=BKgwjhao5DU&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_04_text_nlp.ipynb)
* Part 11.5: Learning English from Scratch with Keras and TensorFlow [[Video]](https://www.youtube.com/watch?v=Y1khuuSjZzc&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN&index=58) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_05_english_scratch.ipynb)

# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [1]:
try:
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: not using Google CoLab


# Part 11.1: Getting Started with Spacy in Python

When we apply neural networks to Natural Language Processing (NLP), we must decide if you want to operate at the word or character level.  Up to this point, we've worked primarily at the character level, which was the case for the Treasure Island text pirate story generator that we previously saw.  Likewise, we used word-level NLP for the image caption generator.  In this module, the focus will be primarily on word-level NLP. Notably, we will examine some of the NLP tools that we can be used to process words before we send them to a neural network.  There are two prevalent NLP libraries for Python:

* [NLTK](https://www.nltk.org/)
* [Spacy](https://spacy.io/)

In this course, we will focus on Spacy.  I prefer spacy because of the object abstraction of sentences that it provides.  However, both are widely used libraries.

### Installing Spacy

You can install Spacy with a simple PIP install. Spacy was included in the list of packages to install for this course.  You will need to ensure that you've installed a language with Spacy.  If you do not, you will get the following error:

```
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem 
to be a shortcut link, a Python package or a valid path to a 
data directory.
```

To install English, use the following command:

```
python -m spacy download en
```



### Tokenization

Tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.  Consider how the program might break up the following sentences into words.

* This is a test.
* Ok, but what about this?
* Is U.S.A. the same as USA.?
* What is the best data-set to use?
* I think I will do this-no wait; I will do that.

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying a U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
a
U.K.
startup
for
$
1
billion


You can also obtain the part of speech for each word.  Common parts of speech include nouns, verbs, pronouns, and adjectives.

In [3]:
for word in doc:  
    print(word.text,  word.pos_)

Apple PROPN
is AUX
looking VERB
at ADP
buying VERB
a DET
U.K. PROPN
startup NOUN
for ADP
$ SYM
1 NUM
billion NUM


Spacy includes functions to check if parts of a sentence appear to be numbers, acronyms, or other entities.

In [4]:
for word in doc:
    print(f"{word} is like number? {word.like_num}")

Apple is like number? False
is is like number? False
looking is like number? False
at is like number? False
buying is like number? False
a is like number? False
U.K. is like number? False
startup is like number? False
for is like number? False
$ is like number? False
1 is like number? True
billion is like number? True


### Sentence Diagramming

For years grade school children have had to endure "sentence diagramming."  Such diagrams can help students to understand sentence structure.  Spacy provides a means to produce similar sentence diagrams.  For example, the sentence "My name is Jeff" is diagrammed as follows by Spacy as Figure 11.SPACY.

**Figure 11.SPACY: Spacy Sentence Diagram**
![Spacy Diagram](./images/spacy-diagram.png)

I provide the code needed to diagram this sentence below.  This code generates a scrollable, interactive display.  Because of this interactivity, you will need to stop this cell in Jupyter to continue.  Also, the code does not show its output until you run this cell.  You will not see it on GitHub or the printed book.

In [2]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"My name is Jeff.")
displacy.serve(doc, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


**Note, you will have to manually stop the above cell**

In [3]:
print(doc)

My name is Jeff.


The following code shows how to reduce words to their stems.  Here the sentence words are reduced to their most basic form.  For example, "striped" to "stripe."

In [9]:
import spacy

# Initialize spacy 'en' model, keeping only tagger 
# component needed for lemmatization
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

sentence = "The striped bats are hanging on their feet for best"

# Parse the sentence using the loaded 'en' model object `nlp`
doc = nlp(sentence)

# Extract the lemma for each token and join

" ".join([token.lemma_ for token in doc])

'the stripe bat be hang on -PRON- foot for good'

### Stop Words

Stop words are words which are filtered out before or after processing of natural language text. Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools. Spacy contains a list of stop words that you may wish to exclude or treat with less importance in your programs.

In [10]:
from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS)

{'toward', 'around', 'yours', 'moreover', 'for', 'and', 'many', 'have', 'noone', 'thereby', 'must', 'among', 'beside', 'make', 'might', "'s", 'please', 'three', 'indeed', 'seem', 'ever', 'he', 'ten', 'too', 'already', 'who', 'had', 'wherever', 'back', 'hence', 'when', 'done', 'now', '’m', 'therefore', 'using', 'elsewhere', 'will', 'them', 'being', 'anything', 'twelve', 'part', 'ca', 'nothing', 'get', 'nobody', 'we', 'somehow', 'alone', 'beforehand', 'ourselves', 'show', 'off', 'him', 'amongst', 'although', 'doing', 'a', 'am', 'anyone', 'as', 'something', 'else', 'those', 'if', 'whoever', 'has', 'does', 'this', 'only', 'becomes', 'twenty', 'by', 'became', 'except', 'well', 'any', 'across', 'fifty', 'neither', 'in', 'front', 'formerly', 'perhaps', 'through', 'whence', '‘re', 'again', 'to', 're', 'below', 'whose', 'hers', 'first', 'most', 'did', 'serious', 'may', 'they', 'herein', 'everyone', 'it', 'anyhow', 'whither', 'was', 'his', 'whenever', 'itself', 'call', 'becoming', 'fifteen', 'th