# Natural Language Processing Pipelines

In this lesson, you'll be introduced to some of the steps involved in a NLP pipeline:

1. Text Processing

>* Cleaning
>* Normalization
>* Tokenization
>* Stop Word Removal
>* Part of Speech Tagging
>* Named Entity Recognition
>* Stemming and Lemmatization

2. Feature Extraction

>* Bag of Words
>* TF-IDF
>* Word Embeddings

3. Modeling

## How NLP Pipelines Work
The 3 stages of an NLP pipeline are: Text Processing -> Feature Extraction -> Modeling.

1. **Text Processing:** Take raw input text, clean it, normalize it, and convert it into a form that is suitable for feature extraction.

2. **Feature Extraction:** Extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.

3. **Modeling:** Design a statistical or machine learning model, fit its parameters to training data, use an optimization procedure, and then use it to make predictions about unseen data.

This process isn't always linear and may require additional steps.

## Stage 1: Text Processing
The first chunk of this lesson will explore the steps involved in text processing, the first stage of the NLP pipeline.

* **Extracting plain text:** Textual data can come from a wide variety of sources: the web, PDFs, word documents, speech recognition systems, book scans, etc. Your goal is to extract plain text that is free of any source specific markup or constructs that are not relevant to your task.
* **Reducing complexity:** Some features of our language like capitalization, punctuation, and common words such as a, of, and the, often help provide structure, but don't add much meaning. Sometimes it's best to remove them if that helps reduce the complexity of the procedures you want to apply later.

In this lesson...
> You'll prepare text data from different sources with the following text processing steps:

1. **Cleaning** to remove irrelevant items, such as HTML tags
2. **Normalizing** by converting to all lowercase and removing punctuation
3. Splitting text into words or **tokens**
4. Removing words that are too common, also known as **stop words**
5. Identifying different **parts of speech** and **named entities**
6. Converting words into their dictionary forms, using **stemming and lemmatization**

After performing these steps, your text will capture the essence of what was being conveyed in a form that is easier to work with.

### Text Processing: Cleaning
Let's walk through an example of cleaning text data from a popular source - the web. You'll be introduced to helpful tools in working with this data, including the `requests` library, **regular expressions**, and `Beautiful Soup`.

**Documentation for Python Libraries:**
* [Requests](http://docs.python-requests.org/en/master/user/quickstart/#make-a-request)
* [Regular Expressions](https://docs.python.org/3/library/re.html)
* [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)


### EXAMPLE:
```python
# import statements
import requests
from bs4 import BeautifulSoup

# fetch web page
r = requests.get("https://www.udacity.com/courses/all")
r
>>> <Response [200]>

soup = BeautifulSoup(markup=r.text, features="lxml")
soup
>>> <!DOCTYPE html>
>>> <html lang="en-US"><head>
>>> <meta charset="utf-8"/>
>>> <script class="ng-star-inserted" ...
>>> ...
>>> &q;type&q;:&q;category&q;,&q;matchCriteria&q;:{&q;withKey&q;
>>> :&q;VR development&q;}}]}]}</script></body></html>
        
# Find all course summaries
summaries = soup.find_all("div", {"class": "course-summary-card"})
print('Number of Courses:', len(summaries))
>>> Number of Courses: 250

# print the first summary in summaries
print(summaries[0].prettify())
>>> <div _ngcontent-sc154="" class="...">
>>>  <ir-catalog-card _ngcontent-sc154="" _nghost-sc157="">
>>>   <div _ngcontent-sc157="" class="card-wrapper is-collapsed">
>>> ...
>>> </div>

# Extract course title
summaries[0].select_one("h3").get_text()
>>> 'Applying Data Science to Product Management'

# Extract school
summaries[0].select_one("h4").get_text().strip()
>>> 'School of Business'

# append name and school of each summary to courses list
courses = []
for summary in summaries:
    name = summary.select_one("h3").get_text()
    school = summary.select_one("h4").get_text().strip()
    courses.append((school, name))
```

### Text Processing: Normalization
<img src="nlp_norm_0.png">

* Words with capitalization have low liklihood to influence the meaning of the statement
>* Common to convert all words (and acronyms) to lowercase
```python
text = "WoRds With&CapiTalization!'
text.lower()
>>> "words with&capitalization!'
```
* Depending on context, punctuation wont change the meaning of the statement either, especially at a high level
>* Replacing punctuations with a space instead of removal is helpful for eliminating the possiblity of words becomming concatenated
>* Common regex values: `r"[^a-zA-Z0-9]"`
```python
import re
re.sub(r"[^a-zA-Z0-9]", " ", text)
>>> "words with capitalization'
```


### Text Processing: Tokenization
Token: a symbol that holds a meaning

**Example:**
```python
text = "Dogs are the best"
text.lower().split()
>>> ["dogs", "are", "the", "best"]
```

**USING NLTK: Natural Language Toolkit**

NLTK can help us to tokenize words in a more robust way that captures other elements, such as the example below:
```python
from nltk.tokenize import word_tokenize
text = "Dr. Smith graduated from the University of Washington. He started Lux, an analytics firm."
word_tokenize(text)
>>> ["Dr.", "Smith", "graduated", "from", "the", "University", "of", "Washington", ".", "He", "started", "Lux", ",", "an", "analytics", "firm", "."]
```

And for sentences as well:
```python
from nltk.tokenize import sent_tokenize
text = "Dr. Smith graduated from the University of Washington. He started Lux, an analytics firm."
sent_tokenize(text)
>>> ["Dr. Smith graduated from the University of Washington.", "He started Lux, an analytics firm."]
```

* `nltk.tokenize` [package](http://www.nltk.org/api/nltk.tokenize.html)

### Text Processing: Stop Words
Stop Words: Words that are very common and do not add meanining to a sentence (in general)
>* Examples: "is", "are", "at", "the", etc.

```python
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
print(stopwords.words("english"))
>>> ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
```
Removing stopwords with nltk:
```python
words = [w for w in words if w not in stopwords.words("english")]
```


### Text Processing: Parts of Speech
Part-of-speech tagging using a predefined grammar like this is a simple, but limited, solution. It can be very tedious and error-prone for a large corpus of text, since you have to account for all possible sentence structures and tags!

There are other more advanced forms of POS tagging that can learn sentence structures and tags from given data, including Hidden Markov Models (HMMs) and Recurrent Neural Networks (RNNs).

```python
import nltk
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
from nltk import word_tokenize
from nltk import pos_tag
sentence = word_tokenize("I always lie down to tell a lie.")
pos_tag(sentence)
sentence = word_tokenizer("I always lie down to tell a lie.")
pos_tag(sentence)
>>> [('I', 'PRP'),
 ('always', 'RB'),
 ('lie', 'VBP'),
 ('down', 'RP'),
 ('to', 'TO'),
 ('tell', 'VB'),
 ('a', 'DT'),
 ('lie', 'NN'),
 ('.', '.')]
```

## Text Processing: Named Entity Recognition
Named Entity: can be thought of as propernouns
```python
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
text = "Antonio joined Udacity Inc. in California."
tree = ne_chunk(pos_tag(word_tokenize(text)))
print(tree)
tree.draw()
>>>
(S
  (PERSON Antonio/NNP)
  joined/VBD
  (ORGANIZATION Udacity/NNP Inc./NNP)
  in/IN
  (GPE California/NNP)
  ./.)
```
<img src="nlp_tree_0.png">

## Text Processing: Stemming
<img src="nlp_stem_0.png">
Reduce words to basic versions that hold only those most basic elements

Stemming follows rules that can reduce some words to meaningless representations, such as {caches, caching, cache} to cach, but if all instances of the base word cache are reduced to cach, the meaning isn't lost at the macro sense.

```python
from nltk.stem.porter import PorterStemmer
stemmed = [PorterStemmer().stem(w) for w in words]
```
## Text Processing: Lemmatization
<img src="nlp_lemma_0.png">

Lemmatization uses a dictionary to map variances of a word or sentiment back to its root

The default lemmatizer in NLTK uses the WordNet database to reduce words to its lemmatized version
```python
from nltk.stem.wordnet import WordNetLemmatizer
lemed = [WordNetLemmatizer().lemmatize(w) for w in words]
```

Another option involves overriding the Part-Of-Speech parameter, which defaults to noun, with `v` for verb:
```python
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]
```
Notice that `lemmed` in the second code block is initialized in the first code block. The lemmatization procedure has been chained together to account for multiple parts of speech!

When choosing between Lemmatization or Stemming, stemming may be a less memory intensive operation to consider as it doesn't require a dictionary of predefined outcomes associated with an input.
<img src="nlp_lem_stem_0.png">

It is common to apply both, lemmatization first, and then stemming

### Text Prcoessing: Summary
<img src="nlp_proc_sum_0.png">

## Stage 2: Feature Extraction

Once our text has been converted to something usable, clean, and simplified, it may need to be transformed in a way that an algorithm can handle.

Letters and numbers have symbolic representations in ASCII, but a letter cannot be meaniningfully compared to another letter or a number (in most cases). We are generally interested in the way these values will be compared not as letters and numbers though, but as combinations of them representing words - the computer has no way to make sense of these representations without some influence.

We must extract some features from the texts to make them meaningful for the machine to interpret!

[WordNet visualization tool](http://mateogianolio.com/wordnet-visualization/)

### Feature Extraction: Bag of Words

* Bag of words model: treats each document to be analyzed as an unordered "bag" of words
* Document: Unit of text to be analyzed
* Corpus: set of documents
* Vocabulary: set of unique words/tokens in the corpus
* Document-Term-Matrix: representation of tokens and frequencies per document in corpus
<img src="nlp_dtm_0.png">

Measures of similarity:

1. Dot-Product: $a\cdot b=\sum a_{0}b_{0}+a_{1}b_{1}+...+a_{n}b_{n}$
2. Cosine-Similarity: $\cos\left(\theta\right)=\frac{a\cdot b}{\lVert a \rVert \cdot \lVert b \rVert}$
>* identical vectors = 1
>* opposite vectors = -1
>* no relation = 0

<img src="nlp_meas_0.png">

### Feature Extraction: Term-Frequency Inverse-Document-Frequency (TF IDF)

* TF IDF: representation of the number of tokens appearing in a document, inversely proportional to the number of documents per corpus

>* A way to assign weights to words that signify their relevance in documents

* **Term Frequency:** raw count of a term t in a document d, divided by the total number of terms in the document d

* **Inverse Document Frequency:** logarithm of total number of documents in the collection D, divided by the number of documents d, in D, where the term t is present in d

<img src="nlp_tfidf_0.png">

$$tfidf\left(t = term,\ d = document,\ D = corpus\right) = tf\left(t,\ d \right) \cdot idf\left(t,\ D\right) $$
$$tf\left(t,\ d\right) = count\left(t,\ d\right) \div \vert d \vert$$
$$idf\left(t,\ D\right) = \log\left(\vert D \vert \div \vert \left\{ d \in D : t \in d \right\} \vert \right)$$