<a href="https://colab.research.google.com/github/HamdanXI/nlp_adventure/blob/main/exam/prepare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Qualifying Exam Preperation

## Traditional NLP

### Stemming

#### What is Stemming?

Stemming is the process of reducing words to their word stem, base, or root form—generally a written word form. The idea is to remove affixes (prefixes and suffixes) from words in order to obtain a form that is often not a complete word by itself but is representative of related words. For instance, "running", "runner", and "ran" all stem from the root "run".

<br>

#### Why is Stemming Used?

1. **Simplification**: It simplifies textual data, reducing the complexity of subsequent NLP tasks.
2. **Speed**: Stemming is generally faster than more complex methods like lemmatization because it uses simple heuristics.
3. **Efficiency**: It increases the efficiency of information retrieval by linking words with the same roots.

<br>

#### How Does Stemming Work?

Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This process is fairly crude; the stems may not be actual words. For example:
- **Porter Stemmer**: One of the most common and gentle stemmers. It's known for its simplicity and speed.
- **Lancaster Stemmer**: A more aggressive stemmer than the Porter, often resulting in shorter stems, hence more errors if accuracy is critical.

<br>

#### Challenges with Stemming

- **Over-stemming**: Occurs when two words are stemmed to the same root that are not of the same root. This can lead to a loss of information important for understanding the original word.
- **Under-stemming**: Occurs when two words that should be stemmed to the same root are not. This can lead to inconsistent results in search queries and information retrieval.

<br>

#### Example in Python

Using the NLTK library, let's look at how to use both the Porter and Lancaster stemmers:

```python
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()

words = ["running", "eats", "flying", "quickly"]
porter_stems = [porter.stem(word) for word in words]
lancaster_stems = [lancaster.stem(word) for word in words]

print("Porter Stems:", porter_stems)
print("Lancaster Stems:", lancaster_stems)
```

This script will output the stems of the words using both stemmers:

```plaintext
Porter Stems: ['run', 'eat', 'fli', 'quickli']
Lancaster Stems: ['run', 'eat', 'fly', 'quick']
```

As you can see, the Lancaster stemmer generally produces more aggressive cuts. These stemming techniques are useful in scenarios where the exact form of a word is less important than linking variants of the word to the same base form.

### Lemmatization

#### What is Lemmatization?

Lemmatization is the process of reducing a word to its base or root form, known as the lemma. Unlike stemming, which crudely chops off word endings to achieve the root form, lemmatization considers the morphological analysis of the word, ensuring that the reduced form is a valid word according to the language's vocabulary. This process involves understanding the context and part of speech of a word in a sentence, as well as its standard form according to a language's rules.

<br>

#### Why is Lemmatization Important?

1. **Reduces Complexity**: It decreases the complexity of text data by reducing variations of the same word to a single form, which improves the performance of various NLP tasks.
2. **Improves Accuracy**: Since lemmatization keeps the semantic meaning of the word intact, it's more accurate than stemming for tasks that need a higher level of understanding, such as semantic analysis.
3. **Facilitates Better Text Analysis**: By converting words to their base forms, it becomes easier to perform tasks like textual comparison and pattern recognition.

<br>

#### How Does Lemmatization Work?

Lemmatization works by using vocabulary and morphological analysis of words. The goal is to remove only inflectional endings and return the base or dictionary form of a word, which is known as the lemma. For instance, the lemma of "was" is "be," and the lemma of "mice" is "mouse."

<br>

#### Challenges in Lemmatization

- **Language Dependency**: Lemmatization rules can be complex and highly language-dependent. For example, in English, verbs and nouns are lemmatized differently.
- **Resource Intensive**: It requires more computational resources than stemming, as it needs a complete dictionary of lemmas and morphological analysis.

<br>

#### Example in Python

Using the popular NLP library called NLTK, here is how you can perform lemmatization:

```python
import nltk
nltk.download('wordnet')  # Download the necessary lexicons
nltk.download('omw-1.4')  # Download the additional WordNet data

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["running", "ran", "run", "easily", "fairer"]
lemmas = [lemmatizer.lemmatize(word) for word in words]
print(lemmas)
```

This code snippet will output the lemmas of the provided words:

```plaintext
['running', 'ran', 'run', 'easily', 'fairer']
```

It's important to note that without specifying the part of speech (POS), the lemmatizer treats every word as a noun, which can lead to incorrect lemmas for verbs and adjectives.

In [2]:
import spacy

# Load the spaCy model for English
nlp = spacy.load("en_core_web_sm")

# Function to lemmatize a sentence
def lemmatize_sentence(sentence):
    # Process the sentence using spaCy
    doc = nlp(sentence)
    # Extract the lemma for each token and return them
    lemmas = [token.lemma_ for token in doc]
    return lemmas

# Example usage
sentence = "The striped bats are hanging on their feet for best"
lemmatized_tokens = lemmatize_sentence(sentence)
print("Lemmatized tokens:", lemmatized_tokens)

Lemmatized tokens: ['the', 'stripe', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'good']


### Tokenization

#### What is Tokenization?

Tokenization is a fundamental step in natural language processing (NLP) where text is divided into smaller units called tokens. These tokens can be words, numbers, or punctuation marks. The process helps in preparing text for deeper processing like parsing, part of speech tagging, and sentiment analysis.

<br>

#### Why is Tokenization Important?

1. **Simplification**: It simplifies text analysis by breaking down large pieces of text into manageable units.
2. **Standardization**: Tokens become the standard input for most NLP tasks.
3. **Efficiency**: It increases the efficiency of other NLP processes, as they can operate on simplified and standardized data.

<br>

#### Types of Tokenization

1. **Word Tokenization**: Splits text into words. It's the most common form, useful for tasks like frequency analysis and word-level processing.
2. **Sentence Tokenization**: Breaks text into sentences. This is useful for tasks that require understanding the context or meaning of sentences, like summarization.
3. **Subword Tokenization**: Divides words into smaller meaningful units (subwords or morphemes). This is particularly useful in language modeling and machine translation to handle rare words or morphologically rich languages.

<br>

#### Challenges

- **Complexity in Different Languages**: Languages with no clear word boundaries (like Chinese or Japanese) require more sophisticated techniques beyond whitespace-based tokenization.
- **Handling Special Cases**: Punctuation, contractions (like "don't"), and special characters can complicate straightforward splits.

<br>

#### Examples in Python

Using a popular NLP library in Python called NLTK, here's how you can perform word tokenization:

```python
import nltk
nltk.download('punkt')  # Download the necessary models

text = "Hello, how are you doing today?"
tokens = nltk.word_tokenize(text)
print(tokens)
```

This would output:

```plaintext
['Hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']
```

### Chunking

#### What is Chunking?
Chunking, also known as shallow parsing, is the process of extracting phrases from unstructured text and grouping together consecutive words into larger units—commonly known as "chunks." Instead of just identifying parts of speech, chunking groups words into meaningful sequences like noun phrases or verb phrases that provide more structure than individual words for processing.

<br>

#### Why is Chunking Important?
1. **Structure Extraction**: Chunking helps extract more structure from text than individual tokenization or POS tagging by identifying the constituents of sentences.
2. **Information Retrieval**: It aids in extracting entities (like names or places) and relations between them, which is crucial for tasks like information extraction and named entity recognition.
3. **Improves Understanding**: By identifying phrases, chunking helps in understanding the context and the syntactic meaning of the text better.

<br>

#### How Does Chunking Work?
Chunking usually works on top of part-of-speech (POS) tagging and uses rules or machine learning models to identify the different chunks. For example, a simple rule might be to group any combination of an optional determiner followed by any number of adjectives and then a noun into a noun phrase (NP).

<br>

#### Rules for Chunking
A common way to perform chunking is to use regular-expression-based rules. For example:
- **NP**: `{<DT>?<JJ>*<NN>}` - This rule states that a noun phrase, NP, might start with an optional determiner (DT), followed by any number of adjectives (JJ), and ends with a noun (NN).

<br>

#### Example in Python
Using NLTK, we can implement chunking as follows:

```python
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.chunk import RegexpParser

sentence = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(sentence)
tagged_tokens = pos_tag(tokens)

# Define your chunk pattern
pattern = 'NP: {<DT>?<JJ>*<NN>}'

# Create a chunk parser
cp = RegexpParser(pattern)
cs = cp.parse(tagged_tokens)

# Display the chunked sentence
print(cs)
cs.draw()
```

In this example, the script tokenizes the sentence, tags each token with its part of speech, and then applies a chunking pattern to identify noun phrases. The `cs.draw()` method would display the sentence structure graphically, showing which parts of the sentence are grouped as noun phrases.

<br>

#### Challenges in Chunking

- **Complex Patterns**: Designing rules that accurately capture the intended chunks without being too general or too specific can be challenging.
- **Language Dependence**: Chunking rules can be highly dependent on the language and may not transfer well from one language to another without adjustments.

### PoS Tagging

#### What is PoS Tagging?
Part-of-speech tagging, or PoS tagging, involves assigning a part-of-speech (such as noun, verb, adjective, etc.) to each word in a given text, based on both its definition and its context within a sentence. This is fundamental for syntactic parsing and text analysis.

<br>

#### Why is PoS Tagging Important?
1. **Syntax Analysis**: It helps in understanding the grammatical structure of sentences, which is crucial for higher-level NLP tasks like parsing and entity recognition.
2. **Disambiguation**: Helps in resolving ambiguities in language by clarifying whether a word is used as a noun, verb, or adjective in a particular context.
3. **Improves Machine Translation**: Accurate tagging is crucial for effective translation, as it helps in constructing sentences correctly in the target language.

<br>

#### How Does PoS Tagging Work?

PoS tagging can be performed using different methods:
- **Rule-Based Tagging**: Uses hand-written rules to decide the tag based on the words’ affixes and the context in which they appear.
- **Stochastic Tagging**: Relies on statistical models, often trained on a tagged corpus, to predict the most likely tag based on the word itself and its surrounding context.
- **Machine Learning Approaches**: Modern taggers use more complex models, including Hidden Markov Models (HMM), Conditional Random Fields (CRF), and neural networks, which can learn from large amounts of data and capture more subtle distinctions in how words are used.

<br>

#### Example in Python

Using the NLTK library, which provides access to some pre-trained PoS taggers, here's how you can tag a sentence:

```python
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

# Example sentence
sentence = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(sentence)
tagged_tokens = pos_tag(tokens)

print(tagged_tokens)
```

This script will output a list of tuples, where each tuple consists of a word and its corresponding part-of-speech tag based on the default English PoS tagger:

```plaintext
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
```

<br>

#### Challenges in PoS Tagging

- **Ambiguity**: Some words can represent more than one part of speech based on the context, making PoS tagging non-trivial.
- **Domain-Specific Language**: Specialized vocabularies, such as medical or legal terminologies, can have different usage patterns that standard taggers may not handle well.
- **New Words and Slangs**: Languages evolve, and new words, slangs, and terminologies may not be recognized by existing models.

### Parsing

#### What is Parsing?

Parsing, in the context of NLP, refers to the process of analyzing a text, conforming to the rules of formal grammar, to identify its grammatical structure with respect to a given language. This involves breaking down a text into its component parts and then understanding how these parts fit together in a structured form, often represented as a parse tree.

<br>

#### Types of Parsing

1. **Constituency Parsing**: This type focuses on the structure of the sentence as defined by the grammar of the language. It breaks a text into sub-phrases, known as constituents, which can be nested within each other. The result is often visualized as a tree, where each node represents a constituent.
   
2. **Dependency Parsing**: Rather than focusing on sub-phrases, dependency parsing builds relationships based on word dependencies. Each sentence is represented as a directed graph, where nodes are words and edges are dependencies between the words, such as subject or object relationships.

<br>

#### Why is Parsing Important?

- **Syntax Analysis**: Parsing is crucial for understanding the syntactic structure of sentences, which is fundamental for numerous applications like machine translation, question answering, and speech recognition.
- **Information Extraction**: It aids in extracting structured information from text, such as in legal documents or technical manuals.
- **Improving Language Understanding**: Advanced parsing helps machines understand complex linguistic constructs, enhancing natural language understanding systems.

<br>

#### How Does Parsing Work?

Parsing algorithms can be broadly classified into two categories:
- **Rule-Based Parsing**: Uses a set of predefined grammar rules and attempts to apply these rules to find out the sentence structure. Common algorithms include top-down parsing, bottom-up parsing, and chart parsing.
- **Statistical Parsing**: Employs statistical methods and machine learning models trained on corpora containing sentences and their correct structures. These parsers predict the most likely structure of a new sentence based on learned patterns.

<br>

#### Example in Python

Using the popular NLTK library, here’s how you can implement a simple parser:

```python
import nltk
from nltk import CFG
from nltk.parse import ChartParser

# Define a context-free grammar for a small subset of English
grammar = CFG.fromstring("""
    S -> NP VP
    VP -> V NP
    NP -> D N
    V -> "eats" | "drinks"
    D -> "the" | "a"
    N -> "man" | "fish" | "apple" | "water"
""")

# Create a parser
parser = ChartParser(grammar)

# Parse a sentence
sentence = 'the man eats a fish'.split()
trees = list(parser.parse(sentence))

for tree in trees:
    print(tree)
    tree.draw()
```

This script will output a parse tree showing the structure of the sentence according to the defined grammar. The `tree.draw()` method visualizes this structure.

<br>

#### Challenges in Parsing

- **Complexity**: Parsing is computationally expensive, especially for longer sentences with more complex structures.
- **Ambiguity**: Natural language is often ambiguous, making it difficult to identify a single correct parse tree.
- **Coverage**: Rule-based parsers require comprehensive and detailed grammar rules, which are hard to define for all possible sentences in a language.

### Linguistic Ambiguity

#### What is Linguistic Ambiguity?
It refers to the phenomenon where a sentence, phrase, word, or even a sound can be interpreted in multiple ways. Understanding and resolving ambiguity is a key challenge in natural language processing (NLP).

<br>

#### Types of Linguistic Ambiguity

1. **Lexical Ambiguity**: This occurs when a word has multiple meanings. For example, the word "bank" can refer to the side of a river or a financial institution.

2. **Syntactic Ambiguity**: Also known as structural ambiguity, this happens when a sentence can be parsed in more than one way. For example, "I saw the man with a telescope" can mean either that I used a telescope to see the man or that the man I saw had a telescope.

3. **Semantic Ambiguity**: This type of ambiguity arises when the meanings of sentences are unclear or have multiple interpretations beyond individual word meanings or syntactic structure. For instance, "He’s visiting the bank" doesn’t specify which type of bank.

4. **Pragmatic Ambiguity**: Occurs when the context does not provide enough information to clarify the meaning, even though the words and structure are clear. For example, the response "I have" in answer to the question "Do you have the time or the inclination?" is pragmatically ambiguous.

<br>

#### Why is Linguistic Ambiguity Important?

- **Language Understanding**: Ambiguity is inherent in language and understanding it is crucial for effective communication, humor, and language richness.
- **NLP Applications**: Ambiguity presents both a challenge and an opportunity in NLP tasks like machine translation, speech recognition, and sentiment analysis, where different interpretations can lead to different outcomes.

<br>

#### Resolving Ambiguity

Resolving ambiguity is often context-dependent and can involve several strategies:

- **Syntactic Parsing**: Techniques like parsing can help determine the most likely structure of a sentence and clarify syntactic ambiguity.
- **Semantic Analysis**: Tools like word sense disambiguation are used to determine which meaning of a word is being used in a context.
- **Pragmatic Understanding**: Understanding the speaker’s intent and the conversational context can help resolve many ambiguities, particularly pragmatic ambiguity.

<br>

#### Challenges

- **High Complexity**: Resolving ambiguity often requires understanding the full context in which communication occurs, which can be highly complex.
- **Computational Difficulty**: Ambiguity resolution is computationally expensive and can be challenging for algorithms to handle accurately, especially in real-time applications.

<br>

#### Example in NLP

Handling ambiguity in NLP typically involves a combination of linguistic data and statistical models. For instance, modern NLP systems like BERT (Bidirectional Encoder Representations from Transformers) use context heavily to determine word meanings, which helps in disambiguating sentences more effectively.

### All-in-One Code

In [None]:
%%capture
# Libraries
%%capture
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.chunk import RegexpParser
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer, LancasterStemmer

In [None]:
def preprocess(sentence):
  # Tokenization
  tokens = nltk.word_tokenize(sentence)
  print('Tokenization: ', tokens)

  # Lemmatizer
  lemmatizer = WordNetLemmatizer()
  lemmas = [lemmatizer.lemmatize(token) for token in tokens]
  print('Lemmatization: ', lemmas)

  # Stemming
  porter = PorterStemmer()
  lancaster = LancasterStemmer()

  porter_stems = [porter.stem(token) for token in tokens]
  lancaster_stems = [lancaster.stem(token) for token in tokens]

  print("Stemming (Porter Stems): ", porter_stems)
  print("Stemming (Lancaster Stems): ", lancaster_stems)


  # Chunking
  tagged_tokens = pos_tag(tokens)

  # Define your chunk pattern - A common way to perform chunking is to use regular-expression-based rules. For example:
  pattern = 'NP: {<DT>?<JJ>*<NN>}'
  # This rule states that a noun phrase, NP, might start with an optional determiner (DT), followed by any number of adjectives (JJ),
  # and ends with a noun (NN).

  # Create a chunk parser
  cp = RegexpParser(pattern)
  cs = cp.parse(tagged_tokens)
  print("Chunking: ", cs)

In [None]:
preprocess("The quick brown fox jumps over the lazy dog")

Tokenization:  ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
Lemmatization:  ['The', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog']
Stemming (Porter Stems):  ['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog']
Stemming (Lancaster Stems):  ['the', 'quick', 'brown', 'fox', 'jump', 'ov', 'the', 'lazy', 'dog']
Chunking:  (S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  over/IN
  (NP the/DT lazy/JJ dog/NN))


In [None]:
preprocess("We ran the running marathon together.")

Tokenization:  ['We', 'ran', 'the', 'running', 'marathon', 'together', '.']
Lemmatization:  ['We', 'ran', 'the', 'running', 'marathon', 'together', '.']
Stemming (Porter Stems):  ['we', 'ran', 'the', 'run', 'marathon', 'togeth', '.']
Stemming (Lancaster Stems):  ['we', 'ran', 'the', 'run', 'marathon', 'togeth', '.']
Chunking:  (S
  We/PRP
  ran/VBD
  the/DT
  running/VBG
  (NP marathon/NN)
  together/RB
  ./.)
