
# Natural Language Processing

## Summary

- NLP or Natural Language Processing (not Neurolinguistic Processing).
- Use programming & machine learning techniques to help understand and make use of large amounts of text data.

## Use Cases

- **Voice of Customer Analytics:** Improve products and services through analysis of customer interactions, such as support emails, social media posts, online comments, telephone transcriptions, i.e., to discover what factors drive the most positive and negative experiences. An example would be to extract key phrases and topics by summarizing blocks of text from open-ended survey responses in order to extract the most important and central ideas that can lead to actionable insights.  

- **Semantic Search:**  Provide a better search experience by enabling your search engine to index key phrases, entities, and sentiment. This enables you to focus the search on the intent and the context of the articles instead of basic keywords.

- **Knowledge Management & Discovery:**  Organize and categorize your documents by topic for easier discovery. You might want to personalize content recommendations for readers by recommending other articles related to the same topic. Or you might want to ensure the security of documents by closely monitoring those documents containing sensitive materials  (Topic Modeling).

## Methods

- Text Classification: 

    - Assign tags or categories to text according to its content.
    
    - Applications: sentiment analysis, topic labeling, spam detection, and intent detection.
    
    - Similar to topic modeling, but is supervised learning, so the set of possible classes are known/defined in advance.


- Topic Modeling: 

    - Discover the abstract “topics” that occur in a collection of documents.
    
    - Latent Dirichlet Allocation (LDA) is a commonly used algorithm.
    
    - Similar to text classification but is unsupervised, like clustering, so the set of possible topics are unknown prior. The topics are defined as part of generating the topic models.


## General Process

![NLP Process Workflow](https://cdn-images-1.medium.com/max/2000/1*BiVCmiQtCBIdBNcaOKjurg.png)

1. Processing & Understanding Text
2. Feature Engineering & Text Representation
3. Supervised Learning Models for Text Data
4. Unsupervised Learning Models for Text Data
5. Advanced Topics

### Wrangle
- Acquire your corpus (your sample, your dataset) of documents (your observations).

- Convert to sentences (if applicable).

- Normalize text (as applicable to your use case).

    - Make all text lowercase.
    
    - Remove accents, special characters, numbers, punctuation.
    
    - Stem or lemmatize words.

### Tokenize and Prep

- Tokenize: Break text up into tokens, linguistic units such as words. Units could be:

    - Individual words
    
    - n-grams: Set of co-occuring words (n-words) within a given window and when computing the n-grams you typically move one word forward, e.g. 'data science' is a bi-gram and 'codeup data science' is a tri-gram. 

- Remove Stopwords

### POS Tagging & Chunking (not always used).

- POS (Part-of-Speech) Tagging:

    - Explains how a word is used in a sentence. 
    
    - 8 main parts of speech - nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions, and interjections.


- Chunking:

    - Extracting phrases from unstructured text. 
    
    - Instead of just simple tokens which may not represent the actual meaning of the text, its advisable to use phrases such as “South Africa” as a single word instead of ‘South’ and ‘Africa’ separate words, this is an example of an n-gram or bi-gram in this case.
    
    - Uses pos-tags as input and provides chunks as output.
    
    - Similar to POS tags, there are a standard set of chunk tags like Noun Phrase(NP), Verb Phrase (VP), etc.
    
    - Use Cases: Extracting Locations, Person Names, Named Entity Extraction.


### Vectorization
    
- Treat each sentence as a separate document.
    
- Make a list of all words from all documents.
    
- Create vectors. Vectors convert text that can be used by the machine learning algorithm.
    
- Input:
    - `“It was the best of times”`
    - `“It was the worst of times”`
    - `“It was the age of wisdom”`
    - `“It was the age of foolishness”`
    
- Output: 
    - Dictionary = `[‘It’, ‘was’, ‘the’, ‘best’, ‘of’, ‘times’, ‘worst’, ‘age’, ‘wisdom’, ‘foolishness’]`
    - `“It was the best of times” = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]`
    - `“It was the worst of times” = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]`
    - `“It was the age of wisdom” = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]`
    - `“It was the age of foolishness” = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]`

### Features

- Bag of Words
    
    - It is a way of extracting features from the text for use in machine learning algorithms.
    
    - Use the word vectors, calculate the frequency that each word appears in a document out of all the words in the document.
    
    - Counting the occurrences of tokens and building a sparse matrix of documents x-tokens.
    
    - `CountVectorizer()` 


- Normalized Count Occurrence
    
    - Term Frequency: (count of keyword in given document)/(count of all words in given doc)
    
    - TF-IDF without the IDF
    
    - `TfidfVectorizer(use_idf=False, norm='l2')`


- TF-IDF (Term Frequency - Inverse Document Frequency)
    
    - Based on the approach that high frequency may not able to provide much information gain. i.e. Rare words contribute more weights to the model.
    
    - Word importance will be increased if the number of occurrences is high within the same document. On the other hand, it will be decreased if it occurs often in the entire corpus.

    - TF: (count of keyword in given document)/(count of all words in given doc)
    
    - IDF: (number of documents)/(number of documents containing the keyword)

    - TF/IDF

    - `TfidfVectorizer(use_idf=True, norm='l2')`


- Word Embedding

    - Representation of document vocabulary.
    
    - It is capable of capturing context of a word in a document. Context such as semantic and syntactic similarity, relation with other words, etc.
    
    - Word embeddings are vector representations of a particular word. 
    
    - **Word2Vec** is one of the most popular techniques to learn word embeddings using a shallow neural network. It can be obtained using two methods (both involving neural networks) (1) Skip Gram and (2) Common Bag Of Words (CBOW).


## Vocabulary

- Entities: Identify the type of entity extracted, such as a person, place or organization using Named Entity.
- Stemming: Reduce words to their root, or stem. For example, 'running','runs', and 'runned' become 'run'.
- Lemmatization: Return the base or dictionary form of a word, which is the lemma. For example 'better' becomes 'good' and 'walking' becomes 'walk'. Lemmatization trys to use context to choose the lemma (truncated form), where stemming just chops down to the root form of the word.
- Tokenization: Breaking text up into linguistic units such as words or n-grams.
- Corpus: Set of documents, dataset, sample, etc.
- Document: A single observation, like the body of an email.
- Sentence: What it sounds like.  Sometimes, you want to treat each sentence as a separate document and sometimes you want to merge all the sentences together to be a single document.

https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

![Analysis of a sentence](https://cdn.glitch.com/c1e65908-81db-4c5b-8274-40cc385dfa54%2Fparts_of_speach_diagramming.jpg?v=1565032453346)