# Machine Learning with PyTorch

## Natural Language Processing with AllenNLP

<font size="+1">What is AllenNLP?</font>
<a href="AllenNLP_0.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1"><u><b>What is SpaCy?</b></u></font>
<a href="AllenNLP_1.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">Hight Level Interfaces to NLP using PyTorch</font>
<a href="AllenNLP_2.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">Sentiment Analysis</font>
<a href="AllenNLP_3.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">Part-of-Speech Tagging</font> 
<a href="AllenNLP_4.ipynb"><img src="img/open-notebook.png" align="right"/></a>

## What is SpaCy?

SpaCy starts to get us a bit further afield of this tutorial on PyTorch.  In different ways, PyTorch and SpaCy form the main building blocks for AllenNLP.  Although having different design philosophies, SpaCy occupies a similar space to the libraries [NLTK (Natural Language Toolkit)](https://github.com/nltk/nltk) and [CoreNLP](https://stanfordnlp.github.io/CoreNLP/).

In SpaCy, you can build your own linguistic models, but it also comes with built models and high-level tools with features such as:

* Non-destructive tokenization
* Named entity recognition
* Support for 49+ languages
* 16 statistical models for 9 languages
* Pre-trained word vectors
* Easy deep learning integration
* Part-of-speech tagging
* Labelled dependency parsing
* Syntax-driven sentence segmentation
* Built in visualizers for syntax and NER
* Convenient string-to-hash mapping
* Export to numpy data arrays
* Efficient binary serialization
* Easy model packaging and deployment
* Robust, rigorously evaluated accuracy

### Simple examples

These are taken from the SpaCy website and documentation with minimal changes, just for illustration.  Built-in tools do many of the things we demonstrate customer neural network models for.  The two sides play well together; the general problems are well solved, but we can build on the basics with PyTorch, SpaCy, and AllenNLP to create more customized models and tools.

The `en_core_web_sm` model builds in many useful analyses of English sentences.  Similar models exist for other languages.  SpaCy has an interesting approach of providing one monolithic core model that serves many purposes.  Hence you often start working with the library by creating a general `nlp` object that can analyze text in multiple ways.

In [None]:
import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)

In [None]:
# Analyze syntax
print("Noun phrases:\n-", "\n- ".join(chunk.text for chunk in doc.noun_chunks))

print("\nVerbs:\n-", "\n- ".join(token.lemma_ for token in doc if token.pos_ == "VERB"))

In [None]:
# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text.ljust(18), "|", entity.label_)

### Lemmatization

Roughly the same thing as *stemming* (but utilizing contextual clues better), lemmatization lets us find the root forms of words. Depending on purposes, it can often be useful to reduce the feature set in a vocabulary or "bag of words" in order to simplify modeling.  For example, if you wished to identify conceptual areas addressed in a text, dealing with many declensional forms of words is probably simply overhead and noise.  Models of the sort I mention often address *topicality*.

In [None]:
# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

sentence = "The striped bats are hanging on their feet for best"

# Parse the sentence using the loaded 'en' model object `nlp`
doc = nlp(sentence)

# Extract the lemma for each token and join
" ".join([token.lemma_ for token in doc])

### Sentence visualization

In [None]:
from spacy import displacy
nlp = spacy.load('en_core_web_sm')

doc = nlp('Apple is looking at buying U.K. startup for $1 billion')
displacy.render(doc, style="dep")

In [None]:
import pandas as pd

pd.DataFrame([(t.text, t.lemma_, t.pos_, t.tag_, t.dep_,
               t.shape_, t.is_alpha, t.is_stop) for t in doc],
             columns=['Text', 'Lemma', "Pos", "Tag", "Dep", 
                      "Shape", "is_alpha", "is_stop"])