<a href="https://colab.research.google.com/github/HamdanXI/nlp_adventure/blob/main/exam/prepare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Qualifying Exam Preperation

## Traditional NLP

### Stemming

#### What is it?
Stemming is the process of reducing words to their word stem, base, or root form—generally a written word form. The idea is to remove affixes (prefixes and suffixes) from words in order to obtain a form that is often not a complete word by itself but is representative of related words. For instance, "running", "runner", and "ran" all stem from the root "run".

<br>

#### Why is it used?
1. Simplification: It simplifies textual data, reducing the complexity of subsequent NLP tasks.
2. Speed: Stemming is generally faster than more complex methods like lemmatization because it uses simple heuristics.
3. Efficiency: It increases the efficiency of information retrieval by linking words with the same roots.

<br>

#### How does it work?
Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This process is fairly crude; the stems may not be actual words. For example:
* Porter Stemmer: One of the most common and gentle stemmers. It's known for its simplicity and speed.
* Lancaster Stemmer: A more aggressive stemmer than the Porter, often resulting in shorter stems, hence more errors if accuracy is critical.

<br>

#### Challenges with Stemming:
* Over-stemming: Occurs when two words are stemmed to the same root that are not of the same root. This can lead to a loss of information important for understanding the original word.
* Under-stemming: Occurs when two words that should be stemmed to the same root are not. This can lead to inconsistent results in search queries and information retrieval.

<br>

#### Example in Python:

In [6]:
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()

words = ["running", "eats", "flying", "quickly"]
porter_stems = [porter.stem(word) for word in words]
lancaster_stems = [lancaster.stem(word) for word in words]

print("Porter Stems:", porter_stems)
print("Lancaster Stems:", lancaster_stems)

Porter Stems: ['run', 'eat', 'fli', 'quickli']
Lancaster Stems: ['run', 'eat', 'fly', 'quick']


### Lemmatization

#### What is it?
Lemmatization is the process of reducing a word to its base or root form, known as the lemma. Unlike stemming, which crudely chops off word endings to achieve the root form, lemmatization considers the morphological analysis of the word, ensuring that the reduced form is a valid word according to the language's vocabulary. This process involves understanding the context and part of speech of a word in a sentence, as well as its standard form according to a language's rules.

<br>

#### Why is it important?
1. Reduces Complexity: It decreases the complexity of text data by reducing variations of the same word to a single form, which improves the performance of various NLP tasks.
2. Improves Accuracy: Since lemmatization keeps the semantic meaning of the word intact, it's more accurate than stemming for tasks that need a higher level of understanding, such as semantic analysis.
3. Facilitates Better Text Analysis: By converting words to their base forms, it becomes easier to perform tasks like textual comparison and pattern recognition.

<br>

#### How Does Lemmatization Work?
Lemmatization works by using vocabulary and morphological analysis of words. The goal is to remove only inflectional endings and return the base or dictionary form of a word, which is known as the lemma. For instance, the lemma of "was" is "be," and the lemma of "mice" is "mouse."

<br>

#### Challenges in Lemmatization
* Language Dependency: Lemmatization rules can be complex and highly language-dependent. For example, in English, verbs and nouns are lemmatized differently.
* Resource Intensive: It requires more computational resources than stemming, as it needs a complete dictionary of lemmas and morphological analysis.

<br>

#### Example in Python:

In [7]:
import nltk
nltk.download('wordnet')  # Download the necessary lexicons
nltk.download('omw-1.4')  # Download the additional WordNet data

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["running", "ran", "run", "easily", "fairer", "was", "be", "mice"]
lemmas = [lemmatizer.lemmatize(word) for word in words]
print(lemmas)

['running', 'ran', 'run', 'easily', 'fairer', 'wa', 'be', 'mouse']


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### Tokenization

#### What is it?
Tokenization is a fundamental step in natural language processing (NLP) where text is divided into smaller units called tokens. These tokens can be words, numbers, or punctuation marks. The process helps in preparing text for deeper processing like parsing, part of speech tagging, and sentiment analysis.

<br>

#### Why is it important?
1. Simplification: It simplifies text analysis by breaking down large pieces of text into manageable units.
2. Standardization: Tokens become the standard input for most NLP tasks.
3. Efficiency: It increases the efficiency of other NLP processes, as they can operate on simplified and standardized data.

<br>

#### Types of Tokenization
1. Word Tokenization: Splits text into words. It's the most common form, useful for tasks like frequency analysis and word-level processing.
2. Sentence Tokenization: Breaks text into sentences. This is useful for tasks that require understanding the context or meaning of sentences, like summarization.
3. Subword Tokenization: Divides words into smaller meaningful units (subwords or morphemes). This is particularly useful in language modeling and machine translation to handle rare words or morphologically rich languages.
4. Character Tokenization: This process divides the text into individual characters. This can be useful for modelling character-level languages.

<br>

#### Challenges
* Complexity in Different Languages: Languages with no clear word boundaries (like Chinese or Japanese) require more sophisticated techniques beyond whitespace-based tokenization.
* Handling Special Cases: Punctuation, contractions (like "don't"), and special characters can complicate straightforward splits.

In [8]:
import nltk
nltk.download('punkt')  # Download the necessary models

text = "Hello, how are you doing today?"
tokens = nltk.word_tokenize(text)
print(tokens)

['Hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Chunking

#### What is it?
Chunking, also known as shallow parsing, is the process of extracting phrases from unstructured text and grouping together consecutive words into larger units—commonly known as "chunks." Instead of just identifying parts of speech, chunking groups words into meaningful sequences like noun phrases or verb phrases that provide more structure than individual words for processing.

<br>

#### Why is it important?
1. Structure Extraction: Chunking helps extract more structure from text than individual tokenization or POS tagging by identifying the constituents of sentences.
2. Information Retrieval: It aids in extracting entities (like names or places) and relations between them, which is crucial for tasks like information extraction and named entity recognition.
3. Improves Understanding: By identifying phrases, chunking helps in understanding the context and the syntactic meaning of the text better.

<br>

#### How Does Chunking Work?
Chunking usually works on top of part-of-speech (POS) tagging and uses rules or machine learning models to identify the different chunks. For example, a simple rule might be to group any combination of an optional determiner followed by any number of adjectives and then a noun into a noun phrase (NP).

<br>

#### Challenges in Chunking
* Complex Patterns: Designing rules that accurately capture the intended chunks without being too general or too specific can be challenging.
* Language Dependence: Chunking rules can be highly dependent on the language and may not transfer well from one language to another without adjustments.

In [15]:
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.chunk import RegexpParser
nltk.download('averaged_perceptron_tagger')

sentence = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(sentence)
tagged_tokens = pos_tag(tokens)

# Define your chunk pattern
# Rules for Chunking: A common way to perform chunking is to use regular-expression-based rules. For example:
pattern = 'NP: {<DT>?<JJ>*<NN>}'
# This rule states that a noun phrase, NP, might start with an optional determiner (DT), followed by any number of adjectives (JJ), and ends with a noun (NN).

# Create a chunk parser
cp = RegexpParser(pattern)
cs = cp.parse(tagged_tokens)

# Display the chunked sentence
print(cs)

(S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  over/IN
  (NP the/DT lazy/JJ dog/NN))


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### All-in-One Code

In [41]:
%%capture
# Libraries
%%capture
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.chunk import RegexpParser
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer, LancasterStemmer

In [43]:
def preprocess(sentence):
  # Tokenization
  tokens = nltk.word_tokenize(sentence)
  print('Tokenization: ', tokens)

  # Lemmatizer
  lemmatizer = WordNetLemmatizer()
  lemmas = [lemmatizer.lemmatize(token) for token in tokens]
  print('Lemmatization: ', lemmas)

  # Stemming
  porter = PorterStemmer()
  lancaster = LancasterStemmer()

  porter_stems = [porter.stem(token) for token in tokens]
  lancaster_stems = [lancaster.stem(token) for token in tokens]

  print("Stemming (Porter Stems): ", porter_stems)
  print("Stemming (Lancaster Stems): ", lancaster_stems)


  # Chunking
  tagged_tokens = pos_tag(tokens)

  # Define your chunk pattern - A common way to perform chunking is to use regular-expression-based rules. For example:
  pattern = 'NP: {<DT>?<JJ>*<NN>}'
  # This rule states that a noun phrase, NP, might start with an optional determiner (DT), followed by any number of adjectives (JJ),
  # and ends with a noun (NN).

  # Create a chunk parser
  cp = RegexpParser(pattern)
  cs = cp.parse(tagged_tokens)
  print("Chunking: ", cs)

In [44]:
preprocess("The quick brown fox jumps over the lazy dog")

Tokenization:  ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
Lemmatization:  ['The', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog']
Stemming (Porter Stems):  ['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog']
Stemming (Lancaster Stems):  ['the', 'quick', 'brown', 'fox', 'jump', 'ov', 'the', 'lazy', 'dog']
Chunking:  (S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  over/IN
  (NP the/DT lazy/JJ dog/NN))


In [45]:
preprocess("We ran the running marathon together.")

Tokenization:  ['We', 'ran', 'the', 'running', 'marathon', 'together', '.']
Lemmatization:  ['We', 'ran', 'the', 'running', 'marathon', 'together', '.']
Stemming (Porter Stems):  ['we', 'ran', 'the', 'run', 'marathon', 'togeth', '.']
Stemming (Lancaster Stems):  ['we', 'ran', 'the', 'run', 'marathon', 'togeth', '.']
Chunking:  (S
  We/PRP
  ran/VBD
  the/DT
  running/VBG
  (NP marathon/NN)
  together/RB
  ./.)
