# Chapter-1 NLP: A PRIMER

**Natural Language Processing** 

It is an area of computer science that deals with methods to analyze, model, and understand human language.

### Flow of this Chapter

- Overview of applications of NLP in real-world scenarios

- Various tasks that form basis of building different NLP applications

- Understanding of language from NLP perspective and why NLP is difficult

- Overview of heuristics, ML & DL

- Introduction to a few commonly used algorithms in NLP

- Walkthrough of an NLP application

- Overview of the rest of topics in this book

### Organization of Chapters in terms of NLP tasks and applications

1. **Core Tasks** (Chapters 3-7)


- Text Classification
- Information Extraction
- Conversational Agent
- Information Retrieval
- Question Answering Systems

2. **General Applications** (Chapters 4-7)



- Spam Classification
- Calendar Event Extraction
- PersonalAssistants
- Search Engines
- Jeopardy !

3. **Industry Specific** (Chapters 8-10)


- Social Media Analysis
- Retail Catalog Extraction
- Health Records Analysis
- Financial Analysis
- Legal Entity Extraction

# NLP in the Real World

### Core Applications

**E-Mail Applications** (Ch-4 & 5)

- GMail, Outlook, etc.
- provide features like spam, classification, priority inbox, calendar event extraction, auto complete etc.

**Voice-based Assistants** (Ch-6)

- Apple Siri, Google Assistant, Microsoft Cortana, Amazon Alexa
- rely on a range of NLP techniques to interact with user, understand user commands and respond accordingly

**Modern Search Engines** (Ch-7)


- Google and Bing
- use NLP heavily for various subtasks like query understanding, query expansion, question answering, information retrieval and ranking and grouping of the results


**Machine Translation Services** (Ch-7)


- Google Translate, Bing, Microsoft Translator and AMazon Translate
- direct applications of NLP

### Other Applications

- Organizations analyze their **social media feeds** to understand voice of their customers (Ch-8)

- NLP is widely used to solve use cases on **e-commerce platforms** (Amazon) like extracting relevant information from product descriptions, understanding user reviews, etc. (Ch-9)

- To solve use cases in domains such as **healthcare, finance and law** (Ch-10)

- Companies like Arria use NLP techniques to **automatically generate reports for various domains** (weather forecasting, financial services, etc.)

- NLP forms backbone of **spelling- and grammar-correction tools** (Grammarly, spell check in Microsoft Word & Google Docs)

- In popular quiz show: ***Jeopardy !***, **Watson AI** won the first prize. It was built using NLP techniques.

- NLP is used in a range of **learning and assessment tools and technologies** (automated scoring in GRE, plagiarism detection like Turnitin, intelligent tutoring systems, language learning apps like Duolingo)

- NLP is used to built **large knowledge bases** (Google Knowledge Graph)

# NLP Tasks

Fundamental tasks that appear frequently across various NLP projects:

i. **Language Modeling**


- predicting what the next word in a sentence will be based on history of previous words
- Goal -> to learn probability of a sequence of words appearing in a given language
- Uses -> speech recognition, optical character recognition, handwriting recognition, machine translation & spelling correction



ii. **Text Classification**


- task of bucketing the text into a known set of categories based on its content
- most popular task in NLP
- Uses -> email spam identification, machine translation & spelling correction

iii. **Information Extraction**


- task of extracting relevant information from text
- Uses -> calendar event extraction from emails, extracting names of people mentioned in a social media post, etc.


iv. **Information Retrieval**


- task of finding documents relevant to a user query from a large collection, e.g: Google search

v. **Conversational Agent**


- task of building dialogue systems that can converse in human languages, e.g: Alexa, Siri, etc.


vi. **Text Summarization**


- task of creating short summaries of longer documents while retaining core content and preserving overall meaning of text

vii. **Question Answering**


- task of building a system that can automatically answer questions posed in natural language


viii. **Machine Translation**


- task of converting a piece of text from one language to another, e.g: Google Translate



ix. **Topic Modeling**


- task of uncovering the topical structure of a large collection of documents
- a common text mining tool used in domains like literature, bioinformatics, etc.

# Language from NLP Perspective

- **Language** is a structured system of communication that involves complex combinations of its constituent components like charcters, words, sentences, etc.



- **Linguistics** is the systematic study of language.



- To study NLP, it is important to understand some concepts of linguistics.



- Human language can be thought of as composed of four major building blocks:


>   1. **Context** (meaning) : Applications -> Summarization, Topic Modeling, Sentiment Analysis 
>   2. **Syntax** (phrases & sentences) : Applications -> Parsing, Entity Extraction, Relation Extraction
>   3. **Morphemes & Lexemes** (words) : Applications -> Tokenization, Word Embeddings, POS Tagging
>   4. **Phonemes** (speech & sounds) : Applications -> Speech to text, Speaker Identification, Text to Speech

### PHONEMES

- **Phonemes** are the smallest units of sound in a language
- they don't have any meaning of themselves
- they can induce meaning when uttered in combination with other phonemes
- Standard English has 44 phonemes (single letters / combination of letters)

Examples:
- **Consonant phonemes** -> /b/ - bat, /s/-sun, /k/-cat, /sh/-shop, /p/-pen, /ng/-ring
- **Vowel phonemes** -> /a/-ant, /oi/-coin, /e/-egg, /ear/-dear, /oa/-boat, /ow/-cow

### MORPHEMES & LEXEMES

- A **Morpheme** is the smallest unit of language that has a meaning
- It is formed by a combination of phonemes.
- Not all morphemes are words.
- All prefixes and suffixes are morphemes, e.g: in word '`multimedia`, `multi-` is a morpheme

Examples:  
> unbreakable => un + break + able  
> cats => cat + s  
  (*these morphemes are just constituents of full words*)
  
> tumbling => tumble + ing  
> unreliability => un + rely + able + ity  
(*there is some variation when words are broken into morphemes*)

- **Lexemes** are strutural variations of morphemes related to one another by meaning, e.g: '`run` and `running` belong to same lexeme form


- **Morphological Analysis** is a foundational block of many NLP tasks such as tokenization, stemming, learning word embedings & POS tagging.
- It analyses the structure of word by studying its morphemes and lexemes.

### SYNTAX

- **Syntax** is a set of rules to construct grammatically correct sentences out of words and phrases in a language.
- In linguistics, there are many ways to represent syntactic structure.
- A common example => Parse Tree

Here,  
**N** = Noun  
**V** = Verb  
**P** = Preposition  
**NP** = Noun Phrase  
**VP** = Verb Phrase  

- A **Parse Tree** has a hierarchical structure of language => words at the lowest level followed by Parts-of-Speech tags followed by phrases, and ending with sentence at the highest level
- **Parsing** is the NLP task that constrcuts such trees automatically.
- On this knowledge of parsing, other NLP tasks can be buils => such as Entity extraction and Relation extraction (Ch-5)

### CONTEXT

- **Context** is how various parts in a language come together to convey a particular meaning.
- It includes:
> - long-term references
> - word knowledge
> - common sense
> - literal meaning of words and phrases
- Meaning of a sentence can change based on context
- Context can be of two types based on :

> 1. Semantics: direct meaning of the words and sentences without external context
> 2. Pragmatics: adds world knowledge & external context of the conversation to enable us to infer implied meaning

# Why is NLP Challenging?

There are two characteristics of human language that make NLP a demanding area to work in:-
- Ambiguity
- Creativity

### AMBIGUITY


- **Ambiguity** means uncertainty of meaning
- e.g: "I made her duck"  

> 1st possible meaning:- I cooked a duck for her.  
> 2nd possible meaning:- I made her bend down to avoid an object.

- Which of the above two meaning applies depends on the context in which sentence appears.

> Possible context of 1st:- story about a mother and a child.  
> Possible context of 2nd:- a book about sports

- Ambiguity increases even more in case of figurative language, i.e. idioms

- Examples of ambiguity in language

> The man couldn't lift his son because he was so weak => ***Who was weak?***  
> Joan made sure to thank Susan for all the help she had given => ***Who had given help?***  
> John promised Bill to leave, so an hour later he left => ***Who left an hour later?***

- The above examples are easily disambiguated by a human but are not solvable using NLP techniques.

### COMMON KNOWLEDGE

- It is the set of all facts that most humans are aware of.


- e.g: "`Man bit dog`" & "`Dog bit man`"


- We know that the 1st is unlikely to happen and the 2nd is very possible. This common knowledge was not mentioned in eaither of the two sentences.


- The computer would find it very difficult to differentiate between the two sentences as it lacks common knowledge.


- One of the key challenges of NLP => "***How to encode all the things that are common knowledge to humans in a computational model.***"

### CREATIVITY

- Language is not just rule driven => it also has a creativity aspec to it.



- Various styles, dialects, genres, variations used in language, poems, etc.



- Making machines understand creativity is a hard problem not just in NLP, but AI in general.

### DIVERSITY ACROSS LANGUAGES

- Porting an NLP solution from one language to another is hard.


- A solution that works for one language might not work at all for another language.


- Two possible solutions:

> 1. build a soluion that is language agnostic => this is conceptually very hard
> 2. build separate solutions for each language => laborious and time intensive

#### All above issues make NLP a challenging, yet rewarding domain to work in.

# Machine Learning, Deep Learning & NLP: An Overview

- **Artificial Intelligence(AI)** is a branch of computer science that aims to build systems that can perform tasks that require human intelligence.


- **Machine Learning (ML)** is a branch of AI that deals with the development of algorithms that can learn to perform tasks automatically based on large number of examples, without requiring handcrafter rules.


- **Deep Learning (DL)** is a branh of ML that is based on artificial neural network architectures.


- ML, DL & NLP are all subfields within AI.


- Early NLP applications were based on rules and heuristics.

- In past few decades, NLP application development has been hevily influenced by methods from ML.

- More recently, DL has also been frequently used to build NLP applications.

### MACHINE LEARNING (ML)

- ***Goal of ML*** => to learn to perform tasks based on examples (`training data`) without explicit instructions => this is done by creating numeric representation (features) of training data and using this representation to learn the patterns in those examples.


- ML algorithms can be divided into three categories:-


> 1. **Supervised Learning**: To learn the mapping function from input to output given a large number of examples in the form of input-output pairs (`training data`), e.g: spam e-mail classification
> 2. **Unsupervised Learning**: A set of ML methods that aim to find hidden patterns in given input data without any reference output, e.g: Topic Modeling  
> [**Semi-supervised Learning**: These techniques use both - a small labeled dataset and a large unlabeled dataset - to learn the task at hand.]
> 3. **Reinforcement Learning**: It deals with methods to learn tasks via trial and error. It is characterized by absence of either labeled or unlabeled data in large quantities.

# Approaches to NLP

- Different approaches used to solve NLP fall in three categories:-
> 1. Heuristics
> 2. Machine Learning
> 3. Deep Learning

### 1. Heuristics-based NLP

- Building rules for the task at hand.

- **Limitations**
    - developer were required to have some domain expertise
    - such systems required resources like dictionaries and thesauruses
    
- **Example**: Lexicon-based sentiment analysis => it counts positive and negative words in text to deduce the sentiment of text (Ch-4)

- **Knowledge Bases built in NLP**


1. **`WordNet`**: database of words and the semantic relationships between them.   
e.g: *Synonyms*: words with similar meanings  
*Hyponyms*: ***is-type-of*** relationships, e.g: baseball, sumo wrestling & tennis are hyponyms of sports  
*Meronyms*: ***is-part-of*** relationships, e.g: hands & legs are meronyms of body  


2. **`Open Mind Common Sense`**: Knowledge base in which common sense world knowledge has been incorporated


Both above knowledge bases are lexical resources based on world-level knowledge

- **Rule-based Systems**: They go beyond words and can incorporate other forms of information too.


1. **`Regex (Regular Expressions)`**

    - Great tool for text analysis
    - A set of characters or a pattern that is used to match and find substrings in text.
    - e.g: To find all email IDs in a place of text, we use regex     
    ([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2, 5})$  
    
    - A NLP software => ***Stanford Core NLP*** => it includes TokensRegex (framework for defining regex)
    - **Regexes** are used for deterministic matches (it's either a match or ot's not)
    - **Probabilistic Regexes** is a sub-branch that addresses this limitation by including a probability of a match. e.g: software libraries like ***pregex***
    

2. **`CFG (Content-Free Grammar)`**

    - A type of formal grammar that is used to model natural languages.
    - Invented by Prof. Noam Chomsky (renowned linguist and scientist)
    - CFGs can be used to capture more complex and hierarchical information that a regex might not.
    - **Early Parser** allows parsing all kinds of CFGs.
    - **JAPE (Java Annoatation Patterns Engine)** can model more complex rules and grammar. It has features from both regexes as wel as CFGs.
    - **GATE (General Architecture for Text Engg.)** is used for building text extraction for closed and well-defined domains where accuracy and completeness of coverage is more important.

- **Importance of Rules & Heuristics**

    - They help to quickly build the first version of the model and get a better understanding of problem at hand. (Ch-4 & 11)
    - Can be useful as features for machine learning-based NLP systems.
    - They are used to plug the gaps in the system (where statistical, ML or DL techniques will make mistakes)

### 2. Machine Learning for NLP

- Supervised ML techniques => classification & regression => are hevily used for various NLP tasks.
- **Example of classification**: to classify news articles into a set of news topics like sports and politics
- **Example of regression**: to estimate price of a stock based on processing the social media discussion about that stock
- Unsupervised ML techniques like clustering can be used to club together text documents.



- 3 common steps of any ML approach to NLP:-  
    i. extracting features from text  
    ii. using the feature representation to learn a model (Ch-3)
    iii. evaluating and improving the model (Ch-2)

- Some commonly used supervised ML methods for NLP (2nd step) are:- Naive Bayes, Support Vector Machine, Hidden Markov Model and Conditional Random Fields

### NAIVE BAYES

- Classic algorithm for classification tasks
- Mainly relies on Bayes' Theorem
- Using Bayes' Theorem, it calculates the probability of observing a class label given the set of features for the input data.
- Important characteristic => it assumes each feature is independent of all other features.  
e.g: In news classification task, we assume that domain-specific words (such as sports-specific or politics-specific) are not correlated to one another.
- Naive Bayes is used as a starting algorithm for text classification => because it is simple to understand and very fast to train and run.

### SUPPORT VECTOR MACHINE (SVM)

- Another popular classificarion algorithm.
- Goal of any classification problem => to learn a decision boundary that acts as a separation between different categories of text => this boundary can be linear or non-linear.
- An SVM can learn both a linear and non-linear decision boundary to separate data points belonging to different classes.
- **Strength of SVM**: Robustness to variation and noise in data
- **Weakness of SVM**: Time taken to train and inability to scale when there are large amounts of training data.

### HIDDEN MARKOV MODEL (HMM)

- HMM is a statistical model.
- It assumes that there is an underlying unobservable process with hidden states generates the data (we can only observe the data once it is generated).
- An HMM tries to model the hidden states from this data.  
e.g: HMMs are used for POS tagging of text data.   
Here, the underlying unobservable process is => grammar and hidden states are => Parts-of-Speech (POS)
- **`Markov` Assumption**: Each hidden state is dependent on the previous state(s).

### CONDITIONAL RANDOM FIELDS (CRF)   (Ch-5, 6 & 9)

- Another algorithm used for sequential data.
- CRF essentially performs a classification task on each element in the sequence.
- **Why CRFs are better?**: Because CRF takes the sequential input and the context of tags into consideration.
- CRFs outperform HMMs for tasks such as POS Tagging.

### 3. Deep Learning for NLP

- In last few years, we have seen a huge surge in neural networks to deal with complex, unstructured data.
- Following are a few popular deep neural networks architectures that have become the status quo in NLP:-

### RECURRENT NEURAL NETWORKS (RNNs)

- Language is inherently sequential.
- A sentence in any language flows from one direction to another.
- A model that can progressively read an input text from one end to another can be very useful for language understanding.
- RNNs are specially designed to keep such sequential processing and learning in mind.
- RNNs have **neural units** that are capable of remembering what they have processed so far.
- This memory is **temporal**, an info is stored and updated with every time step as the RNN reads the next word in the input.
- RNNs are used in NLP tasks like => Text classification, NER, machine translation, etc.
- RNNs can also be used to generate text where the goal is to read the preceding text and predict the next word or next character.

### LONG SHORT-TERM MEMORY (LSTM)

- **Disadvantages of RNNs**: Forgetful memory => they can't remember longer contexts => so they don;t perform well when input text is long (which is usually the case).
- LSTMs circumvent this problem by letting go of irrelevant contect and only remembering the part of context that is needed to solve the task at hand.
- This relieves the load of remembering very long context in one vector representation.
- **Gated Recurrent Units (GRUs)**: Another variant of RNNs that are used mostly in language generation.
- Specific use of LSTMs in various NLP applications => Ch-4, 5, 6, 9.

### CONVOLUTIONAL NEURAL NETWORK (CNN)

- CNNs are used heavily in computer vision tasks like image classification, video recognition, etc.
- In NLP, CNNs have seen success in **text-classification tasks**.

    - One can replace each word in a sentence with its corresponding word vector. All vectorsare of same size (d).
    - They can be stacked one over another to form a matrix or 2D array of dimension n X d, where n is the number of words in sentence and d is size of word vectors.
    - This matrix can be treated similar to an image and can be modeled by a CNN.

- **Advantage of CNN**: Ability to look at a group of words tegether using a context window.
- Uses of CNNs for NLP => Ch-4

### TRANSFORMERS

- Transformers have achieved state of the art in almost all NLP tasks.
- They model the textual context but not in a sequential manner.


- **Working Mechanism**:
    - Given a word in the input, it prefers to look at all the words around it (known as `SELF-ATTENTION`) and represent each word w.r.t. its context.
    - e.g: If context of the word `bank` talks about finance, then `bank` probably denotes a financial institution. On the other hand. if context mentions a river, then `bank` probably indicates a bank of the river.


- Due to this higher representation capacity of transformers as compared to other deep networks, transformers are heavily used in NLP.


- **Large Transformers**: Recently, used for transfer learning with smaller downstream tasks.


- **Transfer Learning**: A technique in AI where the knowledge gained while solving one problem is applied to a different but related problem.


- **Mechanism of large transformers**: 

    - To train very large transformer mode in an unsupervised manner (PRE-TRAINING) to predict a part of a sentence given the rest of the content so that it can encode the high-level nuances of language in it.
    - These models are trained on more than 40 GB of textual data, scraped from the whole internet.
    

- **BERT (Bidirectional Encoder Representation from Transformers)**:

    - An example of a large transformer, pre-trained on massive data and open sourced by Google.
    - This pre-trained model is then fine-tuned on downstream NLP tasks, such as text-classification, entity extraction, question answering, etc.
    - BERT works efficiently in transferring the knowledge for downstream tasks due to sheer amount of pre-trained knowledge.
    - BERT and its applications => Ch-4, 6, 10.

### AUTOENCODERS

- An autoencoder is a different kind of network that is used mainly for learning ***compressed vector representation of the input***.
- **Example**: To represent a text by a vector, we can learn a mapping function from the input text to the vector.

    - To make this mapping function useful, we reconstruct input back from vector representation (Unsupervised learning)
    - After training, we collect vector representation, which serves as an encoding of the input text as a dense vactor.
    
- Autoencoders are used to create feature representations needed for any downstream tasks.
- Some variations of autoencoders like **LSTM autoencoders** can handle specific properties of sequential data like text.

## Why Deep Learning is not yet the Silver Bullet for NLP?

#### 1. Overfitting on small datasets

- DL models tend to have more parameters than traditional ML models.
- Many times, in development phase, sufficient training data is not available to train a complex network.
- In such cases, a simple model should be preferred over a DL model (Occam's Razor)
- DL models overfit on small datasets. This leads to poor generalization capability, which in turn leads to poor performance in production.

#### 2. Few-shot learning and synthetic data generation

- **Few-shot learning**: learning from very few training examples
- DL has made significant strides in few-shot learning and in models that can generate superior-quality images.
- These advances have made it feasible to train DL-based vision models on small amounts of data.
- Therefore, DL is widely adopted in solving problems in industrial settings.
- We have not yet seen similar DL techniques be successfully developed for NLP.

#### 3. Domain adaptation

- If we utilize a large DL model trained on datasets from common domains and apply the trained model to a newer domain that is different from the common domains => it may yield poor performance => this is called **Loss in Generalization**
- This shows that DL models are not always useful.
- e.g: models trained on internet texts and product reviews will not work well when applied to domains such as law, social media or healthcare
- We need specialized models to encode the domain knowledge, which could be as simple as domain-specific, rule-based models.

#### 4. Interpretable models

- Most of the time, DL models work like a black box. So, controllability and interpretability is hard for DL models.
- Businesses often demand more interpretable results that can be explained to the customer or end user => in such cases, traditional techniques might be more useful.
- e.g: A Naive Bayes model for sentiment classification may explain the effect of strong positive and negative words on the final prediction of sentiment. Whie obtaining such insights from an LSTM-based classification model is difficult.

#### 5. Common sense and world knowledge

- Beyond syntax and semantics, language encompasses knowledge of the world around us.
- Language for communication relies on logical reasoning and common sense regarding events of the world.
- **Example of multistep reasoning**: "If John walks out of the bedroom and goes to the garden, then John is not in the bedroom anymore and his current location is garden."  
This requires multistep reasoning for a machine to identify events and understand their consequences.
- Understanding such common sense and world knowledge that is inherent in language, is crucial for any DL model to perform wellon various language tasks.
- Current DL models may perform well on standard benchmarks => but are still not capable of common sense understanding and logical reasoning.
- These are some efforts to collect common sense events and logical rules (like 'if-then') => but they are not well integrated yet with ML or DL models.

#### 6. Cost


- Building DL solutions for NLP tasks can be pretty expensive.
- Cost is in terms of both => money and time
- DL models are known as **DATA GUZZLERS** => they collect a large dataset and get it labelled
- Training large DL models to achieve desired performance:
    - increases development cycles
    - results in heavy bills for specialized hardware (GPUs)
- Deploying and maintaining DL models can be expensive both in terms of hardware requirements and effort
- As these models are bulky, they may cause **LATENCY ISSUES** during inference time => they may not be useful in cases where low latency is a must.
- One more reason => **TECHNICAL DEBT** arising from building and maintaining a heavy model => it is the cost of rework that arises from prioritizing speedy delivery over good design and implementation.

#### 7. On-device employment

- For many use cases => NLP solution needs to be deployed on an embedded device rather than in the cloud
- e.g: a machine translation system that helps tourists speak the translated text even without the internet.
- In such cases => the solution must work with limited memory and power => most DL solutions don't fit such constraints

<br>
<br>

- In most industry projects => one or more of above mentioned (7) points play out => it leads to longer project cycles and higher costs (hardware, manpower) => the performance is sometimes comparable or even lower than ML methods => Thus, we get poor return on investment => it often causes NLP project to fail.
- Thus, `DL is not always the go-to solution for industrial NLP applications`.