# Chapter-1 NLP: A PRIMER

**Natural Language Processing** 

It is an area of computer science that deals with methods to analyze, model, and understand human language.

### Flow of this Chapter

- Overview of applications of NLP in real-world scenarios

- Various tasks that form basis of building different NLP applications

- Understanding of language from NLP perspective and why NLP is difficult

- Overview of heuristics, ML & DL

- Introduction to a few commonly used algorithms in NLP

- Walkthrough of an NLP application

- Overview of the rest of topics in this book

### Organization of Chapters in terms of NLP tasks and applications

1. **Core Tasks** (Chapters 3-7)


- Text Classification
- Information Extraction
- Conversational Agent
- Information Retrieval
- Question Answering Systems

2. **General Applications** (Chapters 4-7)



- Spam Classification
- Calendar Event Extraction
- PersonalAssistants
- Search Engines
- Jeopardy !

3. **Industry Specific** (Chapters 8-10)


- Social Media Analysis
- Retail Catalog Extraction
- Health Records Analysis
- Financial Analysis
- Legal Entity Extraction

# NLP in the Real World

### Core Applications

**E-Mail Applications** (Ch-4 & 5)

- GMail, Outlook, etc.
- provide features like spam, classification, priority inbox, calendar event extraction, auto complete etc.

**Voice-based Assistants** (Ch-6)

- Apple Siri, Google Assistant, Microsoft Cortana, Amazon Alexa
- rely on a range of NLP techniques to interact with user, understand user commands and respond accordingly

**Modern Search Engines** (Ch-7)


- Google and Bing
- use NLP heavily for various subtasks like query understanding, query expansion, question answering, information retrieval and ranking and grouping of the results


**Machine Translation Services** (Ch-7)


- Google Translate, Bing, Microsoft Translator and AMazon Translate
- direct applications of NLP

### Other Applications

- Organizations analyze their **social media feeds** to understand voice of their customers (Ch-8)

- NLP is widely used to solve use cases on **e-commerce platforms** (Amazon) like extracting relevant information from product descriptions, understanding user reviews, etc. (Ch-9)

- To solve use cases in domains such as **healthcare, finance and law** (Ch-10)

- Companies like Arria use NLP techniques to **automatically generate reports for various domains** (weather forecasting, financial services, etc.)

- NLP forms backbone of **spelling- and grammar-correction tools** (Grammarly, spell check in Microsoft Word & Google Docs)

- In popular quiz show: ***Jeopardy !***, **Watson AI** won the first prize. It was built using NLP techniques.

- NLP is used in a range of **learning and assessment tools and technologies** (automated scoring in GRE, plagiarism detection like Turnitin, intelligent tutoring systems, language learning apps like Duolingo)

- NLP is used to built **large knowledge bases** (Google Knowledge Graph)

# NLP Tasks

Fundamental tasks that appear frequently across various NLP projects:

i. **Language Modeling**


- predicting what the next word in a sentence will be based on history of previous words
- Goal -> to learn probability of a sequence of words appearing in a given language
- Uses -> speech recognition, optical character recognition, handwriting recognition, machine translation & spelling correction



ii. **Text Classification**


- task of bucketing the text into a known set of categories based on its content
- most popular task in NLP
- Uses -> email spam identification, machine translation & spelling correction

iii. **Information Extraction**


- task of extracting relevant information from text
- Uses -> calendar event extraction from emails, extracting names of people mentioned in a social media post, etc.


iv. **Information Retrieval**


- task of finding documents relevant to a user query from a large collection, e.g: Google search

v. **Conversational Agent**


- task of building dialogue systems that can converse in human languages, e.g: Alexa, Siri, etc.


vi. **Text Summarization**


- task of creating short summaries of longer documents while retaining core content and preserving overall meaning of text

vii. **Question Answering**


- task of building a system that can automatically answer questions posed in natural language


viii. **Machine Translation**


- task of converting a piece of text from one language to another, e.g: Google Translate



ix. **Topic Modeling**


- task of uncovering the topical structure of a large collection of documents
- a common text mining tool used in domains like literature, bioinformatics, etc.

# Language from NLP Perspective

- **Language** is a structured system of communication that involves complex combinations of its constituent components like charcters, words, sentences, etc.



- **Linguistics** is the systematic study of language.



- To study NLP, it is important to understand some concepts of linguistics.



- Human language can be thought of as composed of four major building blocks:


>   1. **Context** (meaning) : Applications -> Summarization, Topic Modeling, Sentiment Analysis 
>   2. **Syntax** (phrases & sentences) : Applications -> Parsing, Entity Extraction, Relation Extraction
>   3. **Morphemes & Lexemes** (words) : Applications -> Tokenization, Word Embeddings, POS Tagging
>   4. **Phonemes** (speech & sounds) : Applications -> Speech to text, Speaker Identification, Text to Speech

### PHONEMES

- **Phonemes** are the smallest units of sound in a language
- they don't have any meaning of themselves
- they can induce meaning when uttered in combination with other phonemes
- Standard English has 44 phonemes (single letters / combination of letters)

Examples:
- **Consonant phonemes** -> /b/ - bat, /s/-sun, /k/-cat, /sh/-shop, /p/-pen, /ng/-ring
- **Vowel phonemes** -> /a/-ant, /oi/-coin, /e/-egg, /ear/-dear, /oa/-boat, /ow/-cow

### MORPHEMES & LEXEMES

- A **Morpheme** is the smallest unit of language that has a meaning
- It is formed by a combination of phonemes.
- Not all morphemes are words.
- All prefixes and suffixes are morphemes, e.g: in word '`multimedia`, `multi-` is a morpheme

Examples:  
> unbreakable => un + break + able  
> cats => cat + s  
  (*these morphemes are just constituents of full words*)
  
> tumbling => tumble + ing  
> unreliability => un + rely + able + ity  
(*there is some variation when words are broken into morphemes*)

- **Lexemes** are strutural variations of morphemes related to one another by meaning, e.g: '`run` and `running` belong to same lexeme form


- **Morphological Analysis** is a foundational block of many NLP tasks such as tokenization, stemming, learning word embedings & POS tagging.
- It analyses the structure of word by studying its morphemes and lexemes.

### SYNTAX

- **Syntax** is a set of rules to construct grammatically correct sentences out of words and phrases in a language.
- In linguistics, there are many ways to represent syntactic structure.
- A common example => Parse Tree

![image.png](attachment:image.png)