# Deep Learning & NLP

## Definition

Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) concerned with the
computational modelling and analysis of human language.

It aims to enable computers to understand, interpret, and generate natural language in a manner that is
both meaningful and useful.

In essence, NLP bridges the gap between human communication and machine.

## Motivation
Human language is inherently ambiguous, context-dependent, and dynamic.
A single lexical item may possess multiple meanings; for instance, the word ‚Äúbank‚Äù may refer either to a
financial institution or to the side of a river.
Similarly, the structure of a sentence can yield different interpretations, as in ‚ÄúI saw the man with the
telescope‚Äù, where the phrase ‚Äúwith the telescope‚Äù may modify either ‚Äúman‚Äù or ‚Äúsaw‚Äù.
Moreover, idiomatic expressions such as ‚Äúbreak the ice‚Äù convey meanings that cannot be deduced from
the literal definitions of their constituent words.
Humans resolve such ambiguities effortlessly through contextual reasoning and prior knowledge.
Computers, however, require explicit models and algorithms to approximate this ability.
This fundamental challenge constitutes the central motivation for research and development in NLP.

## Objectives of NLP
The principal objective of NLP is to make natural language computationally processable.
Its goals may be summarised as follows:
- **Language Understanding**: To enable machines to extract and represent the meaning of linguistic
input.
- **Language Generation**: To produce coherent and contextually appropriate natural-language
output from structured or unstructured data.
- **Human‚ÄìComputer Interaction**: To facilitate seamless communication between humans and
computers using everyday language rather than formal programming commands.

Through these objectives, NLP seeks to emulate aspects of human linguistic competence within
computational systems.


## Why Linguistic Foundations Matter in NLP

When humans communicate, we don‚Äôt just process words. We interpret sounds, structures, meanings, and context ‚Äî all at once.
Computers, however, don‚Äôt understand any of this natively. NLP is the field that teaches them to.

So, linguistics gives NLP a structured roadmap for understanding human language.

| Level             | Focus                                                                                            | Example                                                                                                  | Why It Matters for NLP                                                                                                                             |
| ----------------- | ------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| **1. Phonology**  | The **sound patterns** of language ‚Äî how words *sound* and are *pronounced*.                     | ‚ÄúKnight‚Äù and ‚Äúnight‚Äù sound the same (phonologically identical) even though they are spelled differently. | Used in **speech recognition**, **text-to-speech**, **accent modeling**, and **speech synthesis**.                                                 |
| **2. Morphology** | The **internal structure of words** ‚Äî how roots, prefixes, and suffixes combine to make meaning. | un + happy + ness ‚Üí *unhappiness*                                                                        | Important for **tokenization**, **stemming**, **lemmatization**, and **morphological analysis** (like understanding plural/singular, tense, etc.). |
| **3. Syntax**     | The **structure of sentences** ‚Äî how words combine grammatically.                                | ‚ÄúThe cat chased the dog‚Äù ‚â† ‚ÄúThe dog chased the cat‚Äù                                                      | NLP models use syntax in **parsing**, **part-of-speech tagging**, and **dependency trees** to understand grammatical relationships.                |
| **4. Semantics**  | The **meaning** of words and sentences.                                                          | The word ‚Äúlight‚Äù could mean *not heavy* or *illumination*.                                               | Drives **word embeddings (Word2Vec, BERT)** and **semantic similarity**, enabling models to understand meaning and context.                        |
| **5. Pragmatics** | The **intended meaning** in context ‚Äî what the speaker *really* means.                           | ‚ÄúCan you pass the salt?‚Äù is not a capability question ‚Äî it‚Äôs a polite *request*.                         | Crucial for **chatbots**, **sentiment analysis**, **dialogue systems**, and **contextual understanding** (e.g., sarcas                             |

## How these levels connect

Pragmatics  ‚Üê  Semantics  ‚Üê  Syntax  ‚Üê  Morphology  ‚Üê  Phonology

- Phonology: What it sounds like
- Morphology: What words are made of
- Syntax: How words fit together
- Semantics: What they mean literally
- Pragmatics: What they mean in context

Each level builds on the one below it ‚Äî and NLP models today (like GPT or BERT) try to implicitly learn all these layers from massive amounts of text.



## Approaches to NLP
Historically, two principal paradigms have guided NLP research:
- 1. **Rule-Based Systems:**
    - Early NLP systems relied on manually crafted grammatical and lexical rules.
    - They achieved high precision in restricted domains but lacked scalability and adaptability.
- 2. **Statistical and Data-Driven Systems:**
    - With the advent of large corpora and increased computational power, probabilistic and machine-learning approaches became dominant.
    - These systems learn linguistic patterns from data, offering flexibility and improved performance across diverse contexts.

**Contemporary NLP integrates both paradigms, combining the interpretability of rules with the
robustness of data-driven learning.

## Challenges in NLP

Despite significant progress, several challenges persist:
- **Ambiguity**: Multiple plausible interpretations at lexical, syntactic, or semantic levels.
- **Context Sensitivity**: Dependence of meaning on situational or pragmatic context.
- **Idiomatic and Figurative Language**: Non-literal expressions resist rule-based interpretation.
- **Resource Scarcity**: Limited availability of annotated corpora for many languages.
- **Domain Adaptation**: Decline in model accuracy when applied outside the domain of training data.

These challenges continue to motivate ongoing research in computational linguistics and machine
learning.


## The NLP Processing Pipeline
Language data undergoes a sequence of computational stages collectively termed the NLP pipeline:
- 1. **Data Acquisition**: Collection of text or speech from relevant sources such as articles, transcripts, or conversations.
- 2. **Data Cleaning**: Removal of extraneous symbols, typographical errors, and non-linguistic artefacts (e.g., HTML tags, emojis).
- 3. **Text Pre-processing**: Standardisation through operations such as tokenisation, normalisation, stemming, and lemmatisation.
- 4. **Feature Extraction**: Transformation of linguistic units into quantitative representations, using methods such as Bag-of-Words, Term Frequency‚ÄìInverse Document Frequency (TF‚ÄìIDF), or word embeddings.
- 5. ****Model Construction**: Application of statistical or machine-learning algorithms (e.g., Logistic Regression, Na√Øve Bayes, neural networks) to detect patterns within the data.
- 6. **Evaluation and Deployment**: Assessment of model performance and integration within practical systems such as chatbots, translation engines, or search interfaces.

This pipeline converts unstructured linguistic data into structured information amenable to computation.

## Applications of NLP

| **Domain**                 | **Representative Example**                             | **Function of NLP**                                                                         |
| -------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------------------------------- |
| **Information Retrieval**  | Web search queries such as ‚Äúnearest pharmacy open now‚Äù | Interprets grammatical structure and intent rather than relying solely on keyword matching. |
| **E-commerce**             | Product review analysis and recommendation             | Identifies synonyms and sentiment to enhance the relevance of recommendations.              |
| **Customer Support**       | Automated chatbots on service websites                 | Resolves lexical ambiguities and interprets user intent.                                    |
| **Healthcare**             | Processing of clinical notes                           | Extracts key medical entities and relationships (e.g., symptoms, diagnoses).                |
| **Machine Translation**    | Systems such as Google Translate                       | Maps linguistic structure and meaning across languages.                                     |
| **Voice-based Assistants** | Siri, Alexa, or Google Assistant                       | Converts speech to text, interprets intent, and generates responses.                        |


----
----

## Lexical Processing
- Lexical Processing represents the first structured stage in the Natural Language Processing (NLP) pipeline.
- It focuses on processing words and their linguistic forms, ensuring consistency and interpretability before syntactic or semantic analysis.
- The goal is to convert unstructured text into linguistically meaningful and standardised tokens.

Lexical processing typically includes:
- 1. **Text Normalisation** ‚Äì standardising text forms.
- 2. **Tokenisation** ‚Äì dividing text into words or segments.
- 3. **Stopword Removal** ‚Äì filtering uninformative words.
- 4. **Morphological Analysis** ‚Äì identifying roots and affixes.
- 5. **Spell Correction** ‚Äì detecting and fixing typographical or contextual errors.
- 6. **Stemming and Lemmatization** ‚Äì reducing words to base or dictionary forms.
- 7. **Lexical Resource Mapping** ‚Äì linking words to structured linguistic databases such as WordNet or gazetteers.

Each of these operations enhances the quality of text representation for higher levels of NLP.

----

## Text Normalisation
Language as it appears ‚Äúin the wild‚Äù is messy. Texts gathered from social media, emails, or scanned documents often contain inconsistencies such as irregular capitalisation, misspellings, mixed encodings, emojis, and formatting noise. These inconsistencies fragment the vocabulary and mislead computational models. 

Text normalisation is the process of converting such heterogeneous text into a uniform, machine-interpretable form. By enforcing consistency, it ensures that semantically identical expressions ‚Äî for example ‚ÄúCovid-19‚Äù, ‚Äúcovid 19‚Äù, and ‚ÄúCOVID19‚Äù ‚Äî are recognised as the same token.

Normalisation is therefore a foundational prerequisite for tokenisation, feature extraction, and model training.

| **Operation**                     | **Objective**                                                                                                                    | **Illustration**                                    |
| --------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------- |
| **Case Conversion**               | Reduces variation caused by capitalisation.                                                                                      | ‚ÄúNatural Language‚Äù ‚Üí ‚Äúnatural language‚Äù             |
| **Punctuation & Symbol Handling** | Removes or selectively retains symbols. Punctuation like commas or periods may be dropped unless needed for sentence boundaries. | ‚ÄúC++ programming‚Äù ‚Üí keep ‚Äú++‚Äù; ‚ÄúHello!!!‚Äù ‚Üí ‚Äúhello‚Äù |
| **Number Standardisation**        | Brings all numeric mentions to a common form or token.                                                                           | ‚Äúten kg‚Äù ‚Üí ‚Äú10 kg‚Äù or ‚Äú<NUM> kg‚Äù                    |
| **Contraction Expansion**         | Converts shortened forms to their full equivalents for better parsing.                                                           | ‚Äúdon‚Äôt‚Äù ‚Üí ‚Äúdo not‚Äù; ‚ÄúI‚Äôll‚Äù ‚Üí ‚ÄúI will‚Äù               |
| **Accent & Diacritic Removal**    | Removes language-specific marks to ensure uniform encoding.                                                                      | ‚Äúr√©sum√©‚Äù ‚Üí ‚Äúresume‚Äù; ‚Äúcaf√©‚Äù ‚Üí ‚Äúcafe‚Äù                |
| **Whitespace & Encoding Fixes**   | Eliminates redundant spaces and standardises text encoding (e.g., UTF-8).                                                        | ‚Äúdata   science‚Äù ‚Üí ‚Äúdata science‚Äù                   |


| **Task**                                | **Objective / Description**                                                                                                                                                  | **Illustration / Example**                                                                                                                                                                                                                                                                                                                                                                                               |
| --------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **1. Slang and Abbreviation Expansion** | Replace informal or shortened forms with their full equivalents. Particularly useful for **social media text mining** and informal text sources.                             | ‚Äúu‚Äù ‚Üí ‚Äúyou‚Äù; ‚Äúbtw‚Äù ‚Üí ‚Äúby the way‚Äù                                                                                                                                                                                                                                                                                                                                                                                        |
| **2. Unicode and Emoji Handling**       | Manage emojis or special Unicode characters ‚Äî either remove them or map them to **textual descriptors** depending on the application.                                        | üòä ‚Üí ‚Äúsmile‚Äù; ‚ù§Ô∏è ‚Üí ‚Äúheart‚Äù                                                                                                                                                                                                                                                                                                                                                                                               |
| **3. Text Standardisation Pipelines**   | In large-scale NLP systems, normalization steps are applied via **rule-based or regex pipelines** to convert raw input into a **clean, canonical form** before tokenisation. | **Example Pipeline:**<br>**Input:** ‚ÄúI LUV Python !! #CodingIsLife ‚ù§Ô∏è 2025‚Äù<br>**1. Lowercase:** ‚Üí ‚Äúi luv python !! #codingislife 2025‚Äù<br>**2. Expand slang:** ‚Üí ‚Äúi love python !! #codingislife ‚ù§Ô∏è 2025‚Äù<br>**3. Remove excess punctuation:** ‚Üí ‚Äúi love python codingislife ‚ù§Ô∏è 2025‚Äù<br>**4. Handle emoji & number:** ‚Üí ‚Äúi love python #codingislife heart <NUM>‚Äù<br>**Output:** ‚Üí ‚Äúi love python codingislife heart <NUM>‚Äù |


----

## Tokenisation

- After text is normalised, the next task is to divide it into meaningful linguistic units known as tokens.
- A token usually corresponds to a word, subword, or sentence, depending on the level of analysis.
- Tokenisation is crucial because almost every NLP algorithm‚Äîfrom frequency counts to neural embeddings‚Äîoperates on tokens rather than raw character streams.
- It acts as the bridge between raw text and structured linguistic data.

**Tokenisation** is the process of segmenting continuous text into smaller components (tokens) that carry meaning and can be
independently analysed.

Mathematically, tokenisation may be viewed as a mapping

    ùëá: ùëÜ ‚Üí {ùë°1, ùë°2, ‚Ä¶ , ùë°ùëõ}

where S is a text string and ùë°ùëñ are the extracted tokens.

### Tokenization Levels

| **Level**           | **Description**                                                                                                                               | **Example**                                                         |
| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- |
| **Word-Level**      | Splits text on spaces and punctuation; the most common form of tokenisation.                                                                  | ‚ÄúLanguage models are powerful.‚Äù ‚Üí [language, models, are, powerful] |
| **Subword-Level**   | Breaks complex or unseen words into smaller meaningful parts ‚Äî useful for morphologically rich languages or neural models (like BERT or GPT). | ‚Äúunhappiness‚Äù ‚Üí [un, happy, ness]                                   |
| **Sentence-Level**  | Divides paragraphs into sentences based on punctuation and syntactic cues.                                                                    | ‚ÄúIt rained. Roads flooded.‚Äù ‚Üí [Sentence 1, Sentence 2]              |
| **Character-Level** | Treats each character as a token ‚Äî rare, but used in handwriting recognition or low-resource NLP tasks.                                       | ‚Äúcat‚Äù ‚Üí [c, a, t]                                                   |




### Tokenization Approaches

| **Approach**                                      | **Description**                                                                                                                                 | **Examples / Tools**                                                                                                                  | **Advantages**                                                                                                                                | **Limitations**                                                                                         |
| ------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
| **1. Rule-Based / Regular-Expression Tokenisers** | Use **predefined rules or delimiters** (spaces, commas, punctuation marks) to split text into tokens. Often rely on handcrafted patterns.       | **Example:** NLTK‚Äôs *TreebankWordTokenizer*                                                                                           | ‚úÖ Simple, fast, and easy to implement.<br>‚úÖ Works well for structured text with clear boundaries.                                             | ‚ùå Struggles with exceptions like ‚ÄúU.S.A.‚Äù or ‚Äúe-mail‚Äù.<br>‚ùå Language-specific tuning often required.    |
| **2. Statistical Tokenisers**                     | Learn **probable token boundaries** from data instead of fixed rules. Commonly used for **languages without spaces** (e.g., Chinese, Japanese). | **Example:** Word-boundary prediction using **character n-grams** or **Conditional Random Fields (CRFs)** for Chinese segmentation.   | ‚úÖ Learns from actual text distribution.<br>‚úÖ Adapts better to different languages and writing styles.                                         | ‚ùå Requires annotated data for training.<br>‚ùå May misidentify rare or ambiguous boundaries.              |
| **3. Subword Algorithms (Neural-Era)**            | Split words into **smaller, frequent subword units** ‚Äî useful for neural NLP models. Helps represent unseen or rare words efficiently.          | **Examples:** **Byte-Pair Encoding (BPE)** and **WordPiece** (used in BERT).<br>**Illustration:** ‚Äúunbelievable‚Äù ‚Üí [un, believ, able] | ‚úÖ Handles out-of-vocabulary (OOV) words.<br>‚úÖ Reduces vocabulary size.<br>‚úÖ Improves model generalization for morphologically rich languages. | ‚ùå Adds complexity to preprocessing.<br>‚ùå May break intuitive word boundaries (e.g., [##ing], [##tion]). |


### Key Tokenization Challenges

| **Challenge**             | **Description**                                                                                                                    | **Example / Issue**                                                                        | **Implication**                                                                         |
| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------- |
| **Ambiguous Boundaries**  | Some languages (e.g., **Chinese**, **Thai**, **Japanese**) do **not use spaces** to separate words, making segmentation difficult. | Chinese: ‚ÄúÊàëÂñúÊ¨¢Â≠¶‰π†NLP‚Äù ‚Üí ‚ÄúI like studying NLP‚Äù (no clear word boundaries).                    | Requires **statistical** or **neural segmentation models** rather than rule-based ones. |
| **Multiword Expressions** | Phrases like ‚ÄúNew York City‚Äù or ‚Äúmachine learning‚Äù represent **single semantic units** and should not be split.                    | ‚ÄúNew York City‚Äù ‚Üí should be one token, not [New, York, City].                              | Affects **Named Entity Recognition (NER)** and **semantic consistency**.                |
| **Punctuation Ambiguity** | Periods, commas, or symbols can have **multiple roles** (e.g., end of sentence vs. part of abbreviation or number).                | ‚ÄúDr. Smith lives in the U.S.‚Äù or ‚Äú3.14‚Äù ‚Üí periods should *not* trigger sentence splitting. | Requires **context-aware sentence tokenisation**.                                       |
| **Contractions**          | Words like ‚Äúdon‚Äôt‚Äù, ‚ÄúI‚Äôll‚Äù can be tokenised as one or two units depending on task needs.                                           | ‚Äúdon‚Äôt‚Äù ‚Üí [don‚Äôt] or [do, not]; expanded form helps in **sentiment or syntax analysis**.   | Task-specific handling is needed.                                                       |
| **URLs and Emojis**       | These non-standard tokens don‚Äôt follow linguistic rules. Often replaced with placeholders for consistency.                         | ‚ÄúVisit [http://example.com](http://example.com) üòä‚Äù ‚Üí [Visit, <URL>, <EMOJI>]              | Ensures model robustness and uniform vocabulary handling.                               |


---

## Stopword Removal

Not all words in a text contribute equally to its meaning. Certain high-frequency function words ‚Äî such as the, is, at, of, for, to, and an ‚Äî occur frequently across all kinds of text but carry little discriminative value. They mainly serve grammatical functions, helping sentence construction rather than conveying content. For most NLP tasks, these words add computational overhead without significantly improving meaning representation. Stopword removal refers to the process of filtering out such terms so that only content-bearing words remain for further analysis.

**Stopwords** are words that occur very frequently in a language but contribute minimally to the contextual meaning or discriminative power of a document.

Removing them helps focus on semantically informative content such as nouns, verbs, and adjectives.

### Example: Stopword Removal in Action

| **Original Sentence**                         | **After Stopword Removal**     | **Explanation**                                                                              |
| --------------------------------------------- | ------------------------------ | -------------------------------------------------------------------------------------------- |
| ‚ÄúThe movie was not very good.‚Äù                | ‚Äúmovie not good‚Äù               | Retains only the words conveying meaning ‚Äî adjectives, nouns, and polarity words like *not*. |
| ‚ÄúData is the new oil of the digital economy.‚Äù | ‚ÄúData new oil digital economy‚Äù | Removes filler words (*is, the, of, the*) while preserving content words.                    |

‚úÖ Key Idea:
Retain keywords that define meaning ‚Üí remove filler words that repeat across documents and add little value.

### Why Remove Stopwords

| **Purpose**                                    | **Explanation**                                                         | **Impact / Benefit**                                                                                    |
| ---------------------------------------------- | ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
| **1. Dimensionality Reduction**                | Removes extremely common tokens from the vocabulary.                    | - Smaller vocabulary ‚Üí faster computation.<br>- Reduces vector dimensions in models like BoW or TF-IDF. |
| **2. Improved Focus on Meaningful Words**      | Keeps only words that help distinguish between documents or sentiments. | Models learn from **discriminative words** (e.g., ‚Äúexcellent‚Äù, ‚Äúterrible‚Äù).                             |
| **3. Reduced Noise in Frequency-Based Models** | Prevents frequent stopwords from dominating token counts.               | Improves **relevance and weighting** in BoW / TF-IDF representations.                                   |
| **4. Simplified Matching in Search Systems**   | Ignores stopwords (‚Äúthe‚Äù, ‚Äúof‚Äù, ‚Äúin‚Äù) in queries.                       | Better keyword matching and improved **information retrieval accuracy**.                                |


### ‚ö†Ô∏è Challenges and Considerations

While stopword removal offers efficiency, it is not universally beneficial.
Its usefulness depends on the task, domain, and language.

#### A. Task Sensitivity

| **Aspect**                                  | **Description**                                                                     | **Example / Risk**                                                                      |
| ------------------------------------------- | ----------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- |
| **Sentiment Analysis**                      | Words like *not*, *never*, *no* carry **polarity** ‚Äî removing them changes meaning. | ‚ÄúThe movie was not good.‚Äù ‚Üí Removing *not* changes meaning to ‚Äúmovie good‚Äù.             |
| **Text Summarisation / Question Answering** | Stopwords may contribute to **syntactic or contextual completeness**.               | ‚ÄúWho is the president of France?‚Äù ‚Üí Removing *is* or *of* breaks grammatical structure. |

üü° Conclusion: Stopwords aren‚Äôt always ‚Äúnoise‚Äù; some are essential for meaning.

#### B. Domain Dependence

| **Context**                                                                                  | **Issue**                                                                 | **Example**                                                                                       |
| -------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| **Generic stopword lists** (like NLTK‚Äôs English stopwords) may not suit specialized domains. | Words frequent in one field can still be **meaningful**.                  | In biomedical text, *patient*, *disease*, *study* appear often but carry **semantic importance**. |
| **Solution:**                                                                                | Curate **domain-specific stopword lists** rather than using generic ones. | ‚úÖ Improves model performance and interpretability.                                                |

#### C. Language Dependence
| **Language** | **Stopword Examples**   | **Implication**                                      |
| ------------ | ----------------------- | ---------------------------------------------------- |
| **English**  | the, is, of, and, in    | Common generic stopwords.                            |
| **French**   | le, la, les, de, des    | Language-specific; differ in function and frequency. |
| **German**   | der, die, das, und, ein | Require distinct stopword lists per language.        |

üü¢ Key Takeaway:
Stopword removal must respect language structure ‚Äî using a one-size-fits-all list can degrade accuracy in multilingual NLP systems.

#### Summary

| **Dimension**            | **Goal**                                      | **Considerations**                            | **Best Practice**                                      |
| ------------------------ | --------------------------------------------- | --------------------------------------------- | ------------------------------------------------------ |
| **Efficiency**           | Reduce vocabulary size and noise.             | Works well for BoW / TF-IDF models.           | Use after lowercasing and normalization.               |
| **Meaning Preservation** | Avoid losing important context.               | Task-specific stopword handling required.     | Keep negations (*not, never, no*) for sentiment tasks. |
| **Domain Adaptation**    | Ensure relevance of token removal.            | Generic lists may omit domain-relevant terms. | Create domain-specific stopword sets.                  |
| **Language Support**     | Adapt to grammar and syntax of each language. | Stopwords vary across languages.              | Use language-tailored stopword lists.                  |



---

## Morphological Analysis

After tokenisation and stopword filtering, the next step in lexical processing is to understand the internal structure of words. Every word in a language can often be broken down into smaller meaning-bearing units. This study of word structure is known as morphology, and the computational procedure of analysing it is called Morphological Analysis. The main goal is to identify roots (stems) and affixes (prefixes, suffixes, infixes) so that all inflected or derived forms of a word can be recognised as related.

**Morphology** is the branch of linguistics that studies the internal structure of words and how they are formed from smaller
meaningful elements called **morphemes**. Morphological Analysis in NLP refers to the process of segmenting a word into its constituent morphemes and determining their grammatical roles.

### Morphemes and Their Types
A morpheme is the smallest unit of meaning or grammatical function in a language. Each word can be formed from one or more morphemes.

#### Types of Morphenes
| **Type**              | **Definition**                                                                                    | **Examples**                | **Explanation / Usage**                                     |
| --------------------- | ------------------------------------------------------------------------------------------------- | --------------------------- | ----------------------------------------------------------- |
| **Free Morpheme**     | Can stand alone as a complete word.                                                               | book, kind, happy           | Represents standalone meaning (does not need another word). |
| **Bound Morpheme**    | Cannot stand alone; must attach to a root.                                                        | un-, -ness, -ed             | Modifies meaning or grammatical role of the root.           |
| **Root / Stem**       | Core part of a word carrying the primary meaning.                                                 | play (in playing)           | Foundation word to which affixes attach.                    |
| **Prefix**            | Added **before** the root to alter meaning.                                                       | un- (in unhappy)            | Changes meaning ‚Äî e.g., ‚Äúhappy‚Äù ‚Üí ‚Äúunhappy‚Äù.                |
| **Suffix**            | Added **after** the root to indicate tense, form, or quality.                                     | -ing, -ed, -ness            | E.g., ‚Äúwalk‚Äù ‚Üí ‚Äúwalked‚Äù (past tense).                       |
| **Infix / Circumfix** | Inserted into or wrapped around the root (rare in English, common in Asian or Semitic languages). | Tagalog *um-ulat* (‚Äúwrite‚Äù) | Alters meaning or grammatical role via insertion.           |


### Examples of Word Formation

| **Word**             | **Morphological Breakdown** | **Meaning Change**                       |
| -------------------- | --------------------------- | ---------------------------------------- |
| **Unhappiness**      | un- + happy + -ness         | ‚ÄúNot happy‚Äù ‚Üí ‚Äústate of not being happy‚Äù |
| **Replaying**        | re- + play + -ing           | ‚ÄúPlay again‚Äù                             |
| **Kindness**         | kind + -ness                | ‚ÄúState or quality of being kind‚Äù         |
| **Misunderstanding** | mis- + understand + -ing    | ‚ÄúAct of wrongly understanding‚Äù           |

üí° Observation:
- Prefixes usually modify meaning (un-, re-, mis-)
- Suffixes usually modify form or function (-ing, -ness, -ed).

### Functions of Morphological Analysis in NLP

| **Function**                                | **Description**                                                                                  | **Example / Benefit**                                                          |
| ------------------------------------------- | ------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
| **1. Vocabulary Normalisation**             | Groups together word variants to reduce redundancy.                                              | *plays, played, playing* ‚Üí *play*                                              |
| **2. Part-of-Speech (POS) Tagging Support** | Provides grammatical clues such as tense, number, or degree.                                     | *walked* ‚Üí past tense; *cats* ‚Üí plural noun; *happier* ‚Üí comparative adjective |
| **3. Semantic Clarity**                     | Identifies relationships between words with shared roots.                                        | *create*, *creator*, *creation* share the same root *create*.                  |
| **4. Downstream Utility**                   | Supports **stemming**, **lemmatisation**, and **spell correction** for consistent text handling. | Improves model robustness and accuracy.                                        |


### Computational Perspective: Morphological Analysers

Morphological analysis in NLP can be done using rule-based or statistical / learned methods.

| **Approach**                                   | **Description**                                                      | **Example**                                                                                      | **Pros**                                                                                                    | **Cons**                                                    |
| ---------------------------------------------- | -------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------- |
| **A. Rule-Based Morphological Analysers**      | Use handcrafted linguistic rules or morphological dictionaries.      | Rule: *If word ends with ‚Äú-ing‚Äù and root exists in dictionary ‚Üí classify as present participle.* | ‚úÖ Linguistically accurate<br>‚úÖ Works well for morphologically rich languages (e.g., Hindi, Arabic, Turkish) | ‚ùå Labour-intensive<br>‚ùå Language-specific, hard to scale    |
| **B. Statistical / Machine-Learned Analysers** | Learn affix patterns and morpheme boundaries from annotated corpora. | Model learns ‚Äú-ed‚Äù ‚Üí past tense with high probability (e.g., CRF, neural sequence model).        | ‚úÖ Scalable and adaptive<br>‚úÖ Works across languages                                                         | ‚ùå Requires labelled data<br>‚ùå May make probabilistic errors |


### Example Workflow: Morphological Analysis

| **Step**               | **Action**                                                             | **Output**                              |
| ---------------------- | ---------------------------------------------------------------------- | --------------------------------------- |
| **Input Word**         | Unbelievable                                                           | ‚Äî                                       |
| **1. Identify Prefix** | un-                                                                    | Negation                                |
| **2. Identify Root**   | believe                                                                | Base meaning                            |
| **3. Identify Suffix** | -able                                                                  | Expressing ability                      |
| **Final Output**       | [Prefix: un- (negation)] + [Root: believe] + [Suffix: -able (ability)] | **Meaning:** ‚ÄúNot able to be believed.‚Äù |

üü© Interpretation: Morphological decomposition helps derive word meaning, structure, and grammatical role.

### Applications of Morphological Analysis

| **Application**                  | **How Morphology Helps**                                                                        | **Example / Impact**                                  |
| -------------------------------- | ----------------------------------------------------------------------------------------------- | ----------------------------------------------------- |
| **Information Retrieval (IR)**   | Groups inflected forms under one search term.                                                   | Searching ‚Äúrun‚Äù also retrieves ‚Äúrunning‚Äù, ‚Äúran‚Äù.      |
| **Machine Translation (MT)**     | Preserves tense, number, and agreement when translating between morphologically rich languages. | English ‚Äúplayed‚Äù ‚Üí Spanish ‚Äújug√≥‚Äù.                    |
| **Speech Recognition**           | Identifies word variants pronounced differently.                                                | Recognizes *played* and *playing* as forms of *play*. |
| **Text-to-Speech Systems (TTS)** | Ensures correct pronunciation by identifying affix patterns.                                    | Distinguishes *read* (present) vs. *read* (past).     |


### Summary 

| **Aspect**                     | **What It Does**                              | **Why It Matters**                                |
| ------------------------------ | --------------------------------------------- | ------------------------------------------------- |
| **Morphemes**                  | Break words into smallest meaning units.      | Foundation of linguistic understanding.           |
| **Morphological Analysis**     | Identifies roots and affixes computationally. | Enables normalisation, tagging, and translation.  |
| **Rule-Based vs. Statistical** | Two main computational strategies.            | Choice depends on data availability and language. |
| **NLP Applications**           | IR, MT, Speech, TTS, Lemmatization.           | Improves both performance and interpretability.   |


---

## Spell Correction - Edit (Levenshtein) Distance

In spelling correction, we need a way to measure how different two words are. The Edit Distance, also known as the Levenshtein Distance, is one of the most fundamental metrics for this purpose. It quantifies the minimum number of single-character operations required to transform one string into another.

The allowed operations are:
- Insertion (I): Add one character.
- Deletion (D): Remove one character.
- Substitution (S): Replace one character with another.

For example:
flaw ‚Üí lawn
- Delete ‚Äúf‚Äù (1 operation)
- Insert ‚Äún‚Äù at the end (1 operation)
- Edit Distance = 2

For example:
kitten ‚Üí sitting
- Substitute k ‚Üí s
- Substitute e ‚Üí i
- Insert g at the end
- Total edits = 3

Formal Definition

Let two strings be ùëé = ùëé1 ùëé2‚Ä¶ aùëö and ùëè = ùëè1 ùëè2‚Ä¶ bùëõ.

Define ùëÄ(ùëñ, ùëó) as the edit distance between the first ùëñ characters of ùëé and the first ùëó characters of ùëè .

The recursive formulation is:

![image.png](attachment:image.png)

Explanation:
- M(i,j) is computed using dynamic programming.
- Each cell represents the minimum cost to align prefixes of the two strings up to positions i and j.
- The algorithm builds the matrix bottom-up.

### Example Matrix for flaw and lawn

|       |   | l | a | w | n |
| ----- | - | - | - | - | - |
| **#** | 0 | 1 | 2 | 3 | 4 |
| **f** | 1 | 1 | 2 | 3 | 4 |
| **l** | 2 | 1 | 2 | 3 | 4 |
| **a** | 3 | 2 | 1 | 2 | 3 |
| **w** | 4 | 3 | 2 | 1 | 2 |


```
if (row != col) take [min(trio) + 1 ]
if (row == col) take [diag(previous)]
```

![image.png](attachment:image.png)

![image.png](attachment:image.png)

| **Step**                   | **Description**                                                                    | **Example**                                   |
| -------------------------- | ---------------------------------------------------------------------------------- | --------------------------------------------- |
| **1. Initialize Matrix**   | Create matrix of size *(m+1) √ó (n+1)*, where m and n are lengths of the two words. | For ‚Äúflaw‚Äù (4) and ‚Äúlawn‚Äù (4), matrix is 5√ó5. |
| **2. Fill Base Cases**     | First row = 0‚Ä¶n, first column = 0‚Ä¶m (cost of inserting/deleting all characters).   | Represents converting from/to empty string.   |
| **3. Recurrence Relation** | For each i,j, choose the minimum cost among insertion, deletion, substitution.     | See formula above.                            |
| **4. Result**              | Bottom-right cell = total minimum edit cost.                                       | M[4][4] = 2 for ‚Äúflaw‚Äù‚Üí‚Äúlawn‚Äù.                |

| **Operation**    | **Symbolic Form**               | **Cost**                       |
| ---------------- | ------------------------------- | ------------------------------ |
| **Insertion**    | ( M(i, j-1) + 1 )               | 1                              |
| **Deletion**     | ( M(i-1, j) + 1 )               | 1                              |
| **Substitution** | ( M(i-1, j-1) + [a_i <> b_j] ) | 1 if characters differ, else 0 |


### Examples

| **Source ‚Üí Target**                                                               | **Edit Distance** | **Minimal operations (one optimal sequence)**                                           |
| --------------------------------------------------------------------------------- | ----------------: | --------------------------------------------------------------------------------------- |
| `rain` ‚Üí `shine`                                                                  |             **3** | substitute **r ‚Üí s**, substitute **a ‚Üí h**, insert **e** at end.                        |
| (Transforms: **rain** ‚Üí *sain* ‚Üí *shin* ‚Üí *shine*; keep `i`,`n`.)                 |                   |                                                                                         |
| `shine` ‚Üí `rain`                                                                  |             **3** | substitute **s ‚Üí r**, substitute **h ‚Üí a**, delete **e** at end.                        |
| (Transforms: **shine** ‚Üí *rine* ‚Üí *raine* ‚Üí *rain*; keep `i`,`n`.)                |                   |                                                                                         |
| `shine` ‚Üí `train`                                                                 |             **4** | insert **t** at front, substitute **s ‚Üí r**, substitute **h ‚Üí a**, delete **e** at end. |
| (Transforms: **shine** ‚Üí *tshine* ‚Üí *trhine* ‚Üí *traine* ‚Üí *train*; keep `i`,`n`.) |                   |                                                                                         |



| **Source ‚Üí Target**       | **Edit Distance** | **Minimal operations (one possible optimal sequence)**                                |
| ------------------------- | ----------------: | ------------------------------------------------------------------------------------- |
| `kitten` ‚Üí `sitting`      |             **3** | substitute **k‚Üís**, substitute **e‚Üíi**, insert **g** (result path: k‚Üís, ‚Ä¶ e‚Üíi, +g)    |
| `flaw` ‚Üí `lawn`           |             **2** | delete **f**, insert **n** (or substitute **f‚Üíl**, insert **n**)                      |
| `intention` ‚Üí `execution` |             **5** | substitute **i‚Üíe**, **n‚Üíx**, **t‚Üíe**, **e‚Üíc**, **n‚Üíu** (then remaining letters match) |
| `book` ‚Üí `back`           |             **2** | substitute **o‚Üía**, substitute **o‚Üíc**                                                |
| `color` ‚Üí `colour`        |             **1** | insert **u**                                                                          |
| `cat` ‚Üí `cut`             |             **1** | substitute **a‚Üíu**                                                                    |
| `Sunday` ‚Üí `Saturday`     |             **3** | insert **a**, insert **t**, substitute **n‚Üír** (one optimal path)                     |
| `abc` ‚Üí `yabd`            |             **2** | insert **y** at front, substitute **c‚Üíd**                                             |


### Applications in NLP

| **Application**                            | **Use of Edit Distance**                              | **Example**                                |
| ------------------------------------------ | ----------------------------------------------------- | ------------------------------------------ |
| **Spell Checking**                         | Suggests corrections based on smallest edit distance. | *‚Äúrecieve‚Äù ‚Üí ‚Äúreceive‚Äù*                    |
| **Plagiarism Detection / Text Similarity** | Measures textual similarity.                          | *‚Äúcolor‚Äù vs ‚Äúcolour‚Äù ‚Üí distance = 1*       |
| **Speech Recognition**                     | Compares transcribed output with reference text.      | Accuracy evaluation using word error rate. |
| **DNA / Protein Sequence Analysis**        | Measures sequence similarity in bioinformatics.       | *ATCG* ‚Üî *ATGC*                            |
| **Chatbots / Fuzzy Matching**              | Matches user input with closest known command.        | *‚Äúhelo‚Äù ‚Üí ‚Äúhello‚Äù*                         |


----

## Probabilistic Spell Correction ‚Äì The Noisy Channel Model

While edit distance measures how similar two words are, it does not account for word likelihood or context. For example:
- speling ‚Üí could be spelling or spieling

Both are one edit apart, but spelling is far more common.

Hence, we need a way to combine:
- How likely it is that a word was mistyped (error model), and
- How common or probable the correct word is in the language (language model).

This leads us to the **Noisy Channel Model (NCM)** ‚Äî a probabilistic framework for choosing the most likely intended word given a misspelling.

When we see a misspelled word (e.g., ‚Äúspeling‚Äù), we want to guess what the intended correct word (e.g., ‚Äúspelling‚Äù) was. The Noisy Channel Model treats this as a probabilistic decoding problem.

We aim to find: w‚àó = argwmax(w) ‚ÄãP(w‚à£x)

By Bayes rule : P(w‚à£x) ‚àù P(x‚à£w) √ó P(w)

- P(w) ‚Üí how common or likely the word is (Language Model)
- P(x‚à£w) ‚Üí how likely we are to mistype or misread w as x (Error Model)

You saw that the noisy channel model treats a typo as a distorted signal and asks which original word is most probable. To answer this, it calculates these two values for each candidate: 

- The **prior**, which measures how frequent the word is in natural text 
- The **likelihood**, which measures how probable it is that the word was transformed into the typo through insertions, deletions or swaps 
- These are then multiplied to give a score for each candidate. 

For example, with the typo hte, the candidates the, hat, and hate were considered. 

- The is extremely frequent and the swap from th ‚Üí ht is a common mistake, giving it a high score. 
- Hat and hate are much less frequent, and the errors needed to produce hte from them are unlikely, so their scores are quite low. 
- So, the model selects the as the most probable correction. 

 

This approach is more nuanced than edit distance alone. It uses real language patterns and error statistics, not just letter similarity, which is why it underpins systems such as search suggestions and autocorrect. By combining frequency with error likelihood, the noisy channel model can identify corrections that match both language use and human mistakes. - 

In the Noisy Channel Model for spelling correction:

P(w¬†given¬†x) ‚àù P(x¬†given¬†w) √ó P(w)

Where:
- w = the intended (correct) word
- x = the observed (possibly misspelled) word

| Term             | Meaning                                                                                                                | Example                                  | Role               |
| ---------------- | ---------------------------------------------------------------------------------------------------------------------- | ---------------------------------------- | ------------------ |
| **P(w)**         | Prior probability of a word ‚Äî how common it is in the language (from corpus frequencies).                              | ‚Äúspelling‚Äù more frequent than ‚Äúspieling‚Äù | **Language Model** |
| **P(x given w)** | Probability that the correct word *w* would produce the observed misspelling *x* ‚Äî captures likely **error patterns**. | Typing ‚Äúspeling‚Äù instead of ‚Äúspelling‚Äù   | **Error Model**    |
| **P(w given x)** | Posterior probability ‚Äî most likely intended word given the error.                                                     | Model picks ‚Äúspelling‚Äù as correction     | **Final Decision** |


**Find the correct word c, given the observed wrong word w**

wrong word = sitting

choice for correct words = eating, kitten, beating, ...

![image-2.png](attachment:image-2.png)

### Two Components of the Noisy Channel Model

| **Component**      | **Symbol**      | **Meaning / Description**                                                                                           | **Estimated From**                                                      | **Role in Correction**                                                  | **Example Insight**                                                                                        |
| ------------------ | --------------- | ------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- | ----------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
| **Language Model** | ( P(w) )        | Prior probability of the word ( w ) appearing in natural text. Represents how *linguistically likely* a word is.    | Large text corpora (word frequency lists, n-gram models, dictionaries). | Prefers words that are **common** or **contextually plausible**.        | ‚Äúspelling‚Äù occurs far more often than ‚Äúspieling‚Äù.                                                          |
| **Error Model**    | ( P(x given w) ) | Likelihood that the intended word ( w ) produced the observed misspelling ( x ). Captures *how likely an error is*. | Confusion matrices from typing, OCR, or speech error data.              | Prefers corrections **consistent with typical human or system errors**. | Typing ‚Äúspeling‚Äù instead of ‚Äúspelling‚Äù (missing an ‚Äòl‚Äô) is common; ‚Äúspeling‚Äù instead of ‚Äúselling‚Äù is rare. |


```
| Observed word (x) | Candidate word (w) | P(w) (from corpus) | P(x|w) (from error model) | P(x|w) √ó P(w) |
|--------------------|--------------------|--------------------|---------------------------|----------------|
| speling | spelling | 0.0012 | 0.5 | 0.0006 |
| speling | spieling | 0.0001 | 0.5 | 0.00005 |
| speling | selling | 0.0011 | 0.05 | 0.000055 |
```

Interpretation

- ‚úÖ ‚Äúspelling‚Äù ‚Üí frequent word (high P(w)) + likely typo (high P(x‚à£w)) ‚Üí best candidate.
- ‚ö†Ô∏è ‚Äúspieling‚Äù ‚Üí plausible typo but very rare ‚Üí low P(w).
- ‚ö†Ô∏è ‚Äúselling‚Äù ‚Üí common word but unlikely typo ‚Üí low P(x‚à£w).

Hence, ùë§‚àó = spelling


### Comparison Edit Distance vs Noisy Channel Model

| **Aspect**             | **Edit Distance**                                                                                                        | **Noisy Channel Model**                                                                                                              |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------ |
| **Core Idea**          | Measures the *minimum number of edit operations* (insert, delete, substitute) needed to convert one string into another. | Estimates the probability of the intended word given the observed misspelling using **P(w given x) = P(x given w) √ó P(w)**.          |
| **Approach Type**      | **Deterministic / Distance-based**                                                                                       | **Probabilistic / Statistical**                                                                                                      |
| **Main Components**    | Edit operations (Insertion, Deletion, Substitution).                                                                     | Two components: **Language Model (P(w))** and **Error Model (P(x given w))**.                                                        |
| **Output Meaning**     | Produces a *numerical distance* ‚Äî smaller means more similar.                                                            | Produces a *probability score* ‚Äî higher means the candidate is a more likely correction.                                             |
| **Data Requirement**   | Works purely on string structure; **no corpus or training data needed**.                                                 | Requires **language corpus** (for P(w)) and **error statistics** (for P(x given w)).                                                 |
| **Context Awareness**  | Context-independent ‚Äî each word is treated in isolation.                                                                 | Context-aware ‚Äî the Language Model prefers more plausible or frequent words.                                                         |
| **Error Modeling**     | All edits usually have **equal cost** or simple fixed weights.                                                           | Can assign **different probabilities** for common vs. rare errors (e.g., ‚Äúie‚Äù ‚Üî ‚Äúei‚Äù).                                               |
| **Mathematical Basis** | Dynamic programming (Levenshtein algorithm).                                                                             | Bayesian inference: **w* = arg max [P(x given w) √ó P(w)]**, i.e., maximize P(w given x).                                             |
| **Example Use Case**   | Fuzzy string matching, DNA sequence comparison, basic spell-check.                                                       | Probabilistic spell correction, OCR or speech recognition error correction.                                                          |
| **Interpretability**   | Simple, transparent, and easy to compute.                                                                                | Rich but more complex ‚Äî combines statistical language and error likelihood.                                                          |
| **Advantages**         | - Language-agnostic and simple.<br>- No training data required.<br>- Fast for short strings.                             | - Captures both **language frequency (P(w))** and **error likelihood (P(x given w))**.<br>- Produces **more realistic corrections**. |
| **Disadvantages**      | - Treats all errors equally.<br>- Ignores context and word frequency.<br>- Lower real-world accuracy.                    | - Needs **large corpora** and **error data**.<br>- More computationally expensive.<br>- Harder to tune and interpret.                |
| **Example**            | `flaw ‚Üí lawn` ‚Üí 2 edits.                                                                                                 | `speling ‚Üí spelling` ‚Üí picks the candidate with highest **P(x given w) √ó P(w)**.                                                     |


### Summary

| **Dimension**                           | **Edit Distance**        | **Noisy Channel Model**                          |
| --------------------------------------- | ------------------------ | ------------------------------------------------ |
| **Type**                                | Rule-based               | Probabilistic                                    |
| **Needs Language Data?**                | ‚ùå No                     | ‚úÖ Yes                                            |
| **Considers Real Error Patterns?**      | ‚ùå No                     | ‚úÖ Yes                                            |
| **Considers Word Frequency / Context?** | ‚ùå No                     | ‚úÖ Yes                                            |
| **Best For**                            | Simple string similarity | Realistic spelling correction, OCR/Speech errors |


---

## Stemming and Lemitization

After text has been normalised, tokenised, and corrected for spelling, different inflected or derived forms of the same word may still appear separately. From a computational perspective, words such as connect, connected, connecting, and connection share the same core meaning and should ideally map to a common base form. Stemming and Lemmatization achieve this by reducing words to their root or lemma, thereby lowering vocabulary size and improving linguistic consistency.

| **Term** | **Definition** | **Example Output** |
|-----------|----------------|--------------------|
| **Stemming** | Rule-based removal of prefixes/suffixes to obtain an approximate root. | studies ‚Üí studi |
| **Lemmatization** | Linguistically informed reduction to canonical dictionary form. | studies ‚Üí study |


### Porter Stemmer vs Snowball Stemmer

| **Aspect**                        | **Porter Stemmer**                                                                                                 | **Snowball Stemmer (Porter2)**                                                                                    |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------- |
| **Origin**                        | Original 1980 algorithm written in procedural rules for English only.                                              | Reimplementation in **Snowball language**, offering cleaner syntax, easier maintenance, and multilingual support. |
| **Language Support**              | Only **English**.                                                                                                  | **Multiple languages** (e.g., English, French, German, Spanish, Dutch, Italian, Russian, Portuguese, and more).   |
| **Algorithm Design**              | Hardcoded sequence of conditional rewrite rules divided into 5 steps.                                              | More **generalized and modular**, with consistent design and minor rule refinements (sometimes called *Porter2*). |
| **Stemming Quality**              | Produces correct stems most of the time but can **over-stem** or **under-stem** (e.g., *‚Äúuniversal‚Äù ‚Üí ‚Äúunivers‚Äù*). | Produces **cleaner and more consistent stems**, fixing some issues in the original Porter algorithm.              |
| **Implementation**                | Implemented manually in many libraries (e.g., NLTK `PorterStemmer`).                                               | Implemented as part of the **Snowball stemmer library**, available in NLTK as `SnowballStemmer`.                  |
| **Readability / Maintainability** | Difficult to modify ‚Äî rule-heavy procedural code.                                                                  | Readable and maintainable ‚Äî uses Snowball scripting language for defining rules declaratively.                    |
| **Speed**                         | Slightly faster (simpler rule set).                                                                                | Slightly slower but more accurate and standardized.                                                               |
| **Output Example**                | ‚Äúorganization‚Äù ‚Üí *organ*<br>‚Äúconditional‚Äù ‚Üí *condit*                                                               | ‚Äúorganization‚Äù ‚Üí *organ*<br>‚Äúconditional‚Äù ‚Üí *condition* (more linguistically correct)                             |
| **Use Case**                      | Quick English-only stemming in search engines or text preprocessing.                                               | Multilingual, standardized stemming in NLP pipelines.                                                             |


### Porter Stemmer vs Lemmitization

| **Aspect**                   | **Stemming (Porter Stemmer ‚Äì Rule-Based)**                                                                                                                                                                                                                                                                                                                                                       | **Lemmatization (Linguistic / POS-Aware)**                                                                                                                                                                                                              |
| ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Definition**               | Rule-based removal of prefixes/suffixes to obtain an approximate root form.                                                                                                                                                                                                                                                                                                                      | Uses **morphological analysis** and **part-of-speech (POS)** information to identify the dictionary (lemma) form.                                                                                                                                       |
| **Underlying Method**        | Successive rewrite rules (Porter Algorithm, 1980) applied in **five stages**:<br>1Ô∏è‚É£ Remove plural / -ed / -ing endings ‚Üí *‚Äúagreed‚Äù ‚Üí ‚Äúagree‚Äù*<br>2Ô∏è‚É£ Remove derivational suffixes ‚Üí *‚Äúsadness‚Äù ‚Üí ‚Äúsad‚Äù*<br>3Ô∏è‚É£ Simplify double suffixes ‚Üí *‚Äúrelational‚Äù ‚Üí ‚Äúrelate‚Äù*<br>4Ô∏è‚É£ Remove final -ion / -ous endings ‚Üí *‚Äúeffective‚Äù ‚Üí ‚Äúeffect‚Äù*<br>5Ô∏è‚É£ Drop redundant terminal -e ‚Üí *‚Äúhopeful‚Äù ‚Üí ‚Äúhope‚Äù* | Uses **linguistic rules + dictionaries + POS tags** to find the correct lemma.<br>Mathematically: *f(word, POS) ‚Üí lemma*<br>Examples:<br>‚Ä¢ *running (Verb) ‚Üí run*<br>‚Ä¢ *studies (Noun) ‚Üí study*<br>‚Ä¢ *better (Adjective) ‚Üí good*<br>‚Ä¢ *was (Verb) ‚Üí be* |
| **Example (Sentence)**       | ‚ÄúThe boys are playing happily and studied hard.‚Äù ‚Üí<br>**Stems:**<br>‚Ä¢ boys ‚Üí boy<br>‚Ä¢ playing ‚Üí play<br>‚Ä¢ happily ‚Üí happi<br>‚Ä¢ studied ‚Üí studi                                                                                                                                                                                                                                                   | ‚ÄúThe boys are playing happily and studied hard.‚Äù ‚Üí<br>**Lemmas:**<br>‚Ä¢ boys ‚Üí boy<br>‚Ä¢ playing ‚Üí play<br>‚Ä¢ happily ‚Üí happy<br>‚Ä¢ studied ‚Üí study                                                                                                         |
| **Output Validity**          | May yield **non-words** (e.g., *studies ‚Üí studi*).                                                                                                                                                                                                                                                                                                                                               | Always produces **valid dictionary words** (e.g., *studies ‚Üí study*).                                                                                                                                                                                   |
| **Context Awareness**        | ‚ùå None ‚Äî does not consider word meaning or POS.                                                                                                                                                                                                                                                                                                                                                  | ‚úÖ High ‚Äî uses POS and grammar to determine correct lemma.                                                                                                                                                                                               |
| **Handling Irregular Forms** | Poor ‚Äî cannot handle irregular verbs or adjectives (*went ‚Üí went*).                                                                                                                                                                                                                                                                                                                              | Excellent ‚Äî handles irregular inflections correctly (*went ‚Üí go*).                                                                                                                                                                                      |
| **Accuracy**                 | Moderate ‚Äî may over-stem (*universe, university ‚Üí univers*) or under-stem (*connect, connection*).                                                                                                                                                                                                                                                                                               | High ‚Äî linguistically correct and semantically consistent.                                                                                                                                                                                              |
| **Speed / Efficiency**       | Very fast and lightweight (rule-based).                                                                                                                                                                                                                                                                                                                                                          | Slower ‚Äî requires morphological and POS analysis.                                                                                                                                                                                                       |
| **Language Dependence**      | Requires handcrafted rules for each language.                                                                                                                                                                                                                                                                                                                                                    | Requires language-specific lexicons and POS taggers.                                                                                                                                                                                                    |
| **Advantages**               | ‚Ä¢ Fast, simple, and efficient.<br>‚Ä¢ Reduces vocabulary sparsity for search or indexing.<br>‚Ä¢ Suitable for Information Retrieval (IR) tasks.                                                                                                                                                                                                                                                      | ‚Ä¢ Produces valid, meaningful lemmas.<br>‚Ä¢ Handles irregular and inflected forms.<br>‚Ä¢ Context-aware and semantically accurate.<br>‚Ä¢ Suitable for advanced NLP tasks.                                                                                    |
| **Disadvantages**            | ‚Ä¢ May output non-words.<br>‚Ä¢ Over- or under-stemming.<br>‚Ä¢ Ignores context.<br>‚Ä¢ Rule creation is language-specific.                                                                                                                                                                                                                                                                             | ‚Ä¢ Computationally expensive.<br>‚Ä¢ Slower.<br>‚Ä¢ Requires linguistic resources (POS taggers, morphological dictionaries).                                                                                                                                 |
| **Best Use Cases**           | ‚Ä¢ Search Engines (IR) ‚Äî reduce term variants.<br>‚Ä¢ Topic Modelling (LDA) ‚Äî reduce redundancy.<br>‚Ä¢ Text Classification ‚Äî minimize feature space.                                                                                                                                                                                                                                                 | ‚Ä¢ Machine Translation (MT) ‚Äî maintain grammatical agreement.<br>‚Ä¢ Question Answering (QA) ‚Äî normalize between Q & A forms.<br>‚Ä¢ Semantic Analysis ‚Äî preserve true meaning.<br>‚Ä¢ Text Summarization.                                                     |
| **Illustrative Comparison**  | *studies ‚Üí studi*<br>*relational ‚Üí relat*                                                                                                                                                                                                                                                                                                                                                        | *studies ‚Üí study*<br>*relational ‚Üí relate*                                                                                                                                                                                                              |
| **Common Tools**             | Porter Stemmer, Snowball Stemmer, Lancaster Stemmer (NLTK).                                                                                                                                                                                                                                                                                                                                      | WordNetLemmatizer (NLTK), spaCy Lemmatizer, Stanza.                                                                                                                                                                                                     |


### Summary

| **Dimension**           | **Stemming**                    | **Lemmatization**             |
| ----------------------- | ------------------------------- | ----------------------------- |
| **Type**                | Rule-based (syntactic)          | Linguistic (semantic)         |
| **Output**              | Root-like form (may be invalid) | True dictionary lemma         |
| **Speed**               | Fast                            | Slower                        |
| **Context Sensitivity** | None                            | High                          |
| **Accuracy**            | Moderate                        | High                          |
| **Use**                 | IR, indexing, search            | Semantic NLP, translation, QA |


----
----