## **Flow of the notebook:**
    In natrual language processing, the text procession involves 4 major steps. This steps are as listed below:
        1. Tokenization
        2. Stop Word Removal
        3. Stemming
        4. Lemmatization

Now lets start with the first step.

### **Step 01: Tokenization**

**Understanding Tokenization**

Tokenization is the process of breaking down a piece of text into smaller units called tokens. These tokens can be words, sentences, or even subwords. Tokenization is the first step in text preprocessing and is crucial for preparing text data for further processing by NLP algorithms.

**Types of Tokenization**

**1. Word Tokenization**

This involves splitting a text into individual words.
For example, the sentence "Hello there! How are you doing today?" would be tokenized into ["Hello", "there", "!", "How", "are", "you", "doing", "today", "?"].

**2.Sentence Tokenization**

This involves splitting a text into sentences.
For example, the paragraph "Hello there! How are you doing today? I hope you're having a great day." would be tokenized into ["Hello there!", "How are you doing today?", "I hope you're having a great day."].

**Importance of Tokenization**

**1. Data Preparation:** It converts text into a format that can be used for further processing and analysis.

**2. Feature Extraction:** Tokens are used as features in various NLP tasks like text classification, sentiment analysis, etc.

**3. Text Normalization:** Helps in cleaning and standardizing the text data.

**Techniques for Tokenization**

Different techniques and tools can be used for tokenization. We'll focus on a popular NLP library called NLTK (Natural Language Toolkit) in Python.

In [3]:
# word tokenization

import nltk
from nltk.tokenize import word_tokenize

text = input("Enter a sentence: ")
result  = word_tokenize(text)

print("Entered sentence is: ", text)
print("The words in the sentence are: ", result)

Entered sentence is:  Hello Everyone. Welcome to the 50 days of NLP challenge.
The words in the sentence are:  ['Hello', 'Everyone', '.', 'Welcome', 'to', 'the', '50', 'days', 'of', 'NLP', 'challenge', '.']


In [4]:
# sentence tokenisation

from nltk.tokenize import sent_tokenize
text = input("Enter a text: ")
result = sent_tokenize(text)

print("The sentence you have entered is: ",text)
print("The sentence tokenisation is: ",result)

The sentence you have entered is:  Hello everyone. My name is Hitarth Mahadevia. Welcome to 50 days of NLP challenge. Hope you have a wonderful experience!
The sentence tokenisation is:  ['Hello everyone.', 'My name is Hitarth Mahadevia.', 'Welcome to 50 days of NLP challenge.', 'Hope you have a wonderful experience!']


In [5]:
# combining both the techniques

paragraph = input("Enter a paragraph: ")

sentences = sent_tokenize(paragraph)

for sentence in sentences:
    words = word_tokenize(sentence)
    print(words)


['Machine', 'learning', 'is', 'a', 'branch', 'of', 'artificial', 'intelligence', '.']
['It', 'focuses', 'on', 'building', 'systems', 'that', 'learn', 'from', 'data', '.']
['NLP', 'is', 'a', 'key', 'application', 'of', 'machine', 'learning', '.']


### **Step 02: Stop Word Removal**

**Understanding Stop Words**

Stop words are commonly used words in a language that are usually ignored during text processing because they don't carry significant meaning. These words include articles, prepositions, conjunctions, and some pronouns. 

Examples of stop words in English are "is", "the", "in", "and", "to", "of", etc.

**Why Remove Stop Words?**

1. Reduce Noise: Stop words add noise to the data, which can negatively impact the performance of NLP algorithms.

2. Dimensionality Reduction: Removing stop words helps in reducing the dimensionality of the text data, making it easier to process.

3. Focus on Important Words: By removing stop words, we can focus on the more meaningful words in the text.

In [7]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hitar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [8]:
# removing stop words using nltk

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = input("Enter a text: ")
result = word_tokenize(text)
stop_words = set(stopwords.words('english'))

filtered_result = [word for word in result if not word.lower() in stop_words]

print("Tokens without removing stop words:\n")
print(result)
print("\nTokens after removing stop words:\n")
print(filtered_result)

Tokens without removing stop words:

['Machine', 'learning', 'is', 'a', 'branch', 'of', 'artificial', 'intelligence', '.', 'It', 'focuses', 'on', 'building', 'systems', 'that', 'learn', 'from', 'data', '.', 'NLP', 'is', 'a', 'key', 'application', 'of', 'machine', 'learning', '.']

Tokens after removing stop words:

['Machine', 'learning', 'branch', 'artificial', 'intelligence', '.', 'focuses', 'building', 'systems', 'learn', 'data', '.', 'NLP', 'key', 'application', 'machine', 'learning', '.']


#### **Custom Stop word adding and removing**

In [13]:
# adding stop words
custom_stop_words = set(stop_words)
custom_stop_words.update(["example", "additional", "words"])


text = "This is an example showing additional stop words."
tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if word.lower() not in custom_stop_words]
print("After removing many custom stop words: ",filtered_tokens)

#adding a single stop_word
custom_stop_words = set(stop_words)
custom_stop_words.add("example")
filtered_tokens = [word for word in tokens if word.lower() not in custom_stop_words]
print("After removing only one custom stop word: ",filtered_tokens)

After removing many custom stop words:  ['showing', 'stop', '.']
After removing only one custom stop word:  ['showing', 'additional', 'stop', 'words', '.']


In [12]:
#removing stop word
custom_list = set(stopwords.words('english'))
custom_list.discard('not')

original_list = set(stopwords.words('english'))

text = "Kanchan Kumbdi is not a good witch!"
tokens = word_tokenize(text)
filtered_tokens = [token for token in tokens if token not in custom_list]
print("Stop word removal using custom list: ",filtered_tokens)

original_tokens = [token for token in tokens if token not in original_list]
print("Stop word removal using original list: ",original_tokens)

Stop word removal using custom list:  ['Kanchan', 'Kumbdi', 'not', 'good', 'witch', '!']
Stop word removal using original list:  ['Kanchan', 'Kumbdi', 'good', 'witch', '!']


### **Part 03: Stemming**

**Understanding Stemming**

Stemming is the process of reducing words to their root or base form. The root form of the word is called the "stem." Stemming algorithms typically remove suffixes from words to convert them into their base form, which may or may not be a real word.

**Why Use Stemming?**

1. Dimensionality Reduction: Stemming helps in reducing the number of unique words in the text, which can simplify the data and reduce dimensionality.

2. Improving Generalization: By converting different forms of a word into a single form, stemming helps algorithms generalize better, especially in tasks like text classification and sentiment analysis.


**Common Stemming Algorithms**

1. Porter Stemmer: One of the most widely used stemming algorithms.

2. Lancaster Stemmer: A more aggressive stemming algorithm compared to the Porter Stemmer.

3. Snowball Stemmer: An improved version of the Porter Stemmer that supports multiple languages.

In [15]:
# Step 01: Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# Step 02: Input a text
text = input("Enter a text: ")

# Step 03: Tokenize the text
tokens = word_tokenize(text)

# Step 04: Initialize the Porter Stemmer
stemmer = PorterStemmer()

# Step 05: Apply stemmer to each token
stemmed_tokens = [stemmer.stem(word) for word in tokens]

# Step 06: Display the tokens before and after stemming
print("Entered Text: \n", text)
print("\nOriginal Tokens:", tokens)
print("\nStemmed Tokens:", stemmed_tokens)

Entered Text: 
 Machine learning is a branch of artificial intelligence. It focuses on building systems that learn from data. NLP is a key application of machine learning.

Original Tokens: ['Machine', 'learning', 'is', 'a', 'branch', 'of', 'artificial', 'intelligence', '.', 'It', 'focuses', 'on', 'building', 'systems', 'that', 'learn', 'from', 'data', '.', 'NLP', 'is', 'a', 'key', 'application', 'of', 'machine', 'learning', '.']

Stemmed Tokens: ['machin', 'learn', 'is', 'a', 'branch', 'of', 'artifici', 'intellig', '.', 'it', 'focus', 'on', 'build', 'system', 'that', 'learn', 'from', 'data', '.', 'nlp', 'is', 'a', 'key', 'applic', 'of', 'machin', 'learn', '.']


#### **Comparing different Stemmers**

In [16]:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

# Initialize the stemmers
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')

# Define a text to process
text = "Machine learning is a branch of artificial intelligence. It focuses on building systems that learn from data. NLP is a key application of machine learning."

# Tokenize the text
tokens = word_tokenize(text)

# Apply each stemmer
porter_stems = [porter.stem(word) for word in tokens]
lancaster_stems = [lancaster.stem(word) for word in tokens]
snowball_stems = [snowball.stem(word) for word in tokens]

# Display the results
print("Original Tokens:", tokens)
print("Porter Stemmed Tokens:", porter_stems)
print("Lancaster Stemmed Tokens:", lancaster_stems)
print("Snowball Stemmed Tokens:", snowball_stems)


Original Tokens: ['Machine', 'learning', 'is', 'a', 'branch', 'of', 'artificial', 'intelligence', '.', 'It', 'focuses', 'on', 'building', 'systems', 'that', 'learn', 'from', 'data', '.', 'NLP', 'is', 'a', 'key', 'application', 'of', 'machine', 'learning', '.']
Porter Stemmed Tokens: ['machin', 'learn', 'is', 'a', 'branch', 'of', 'artifici', 'intellig', '.', 'it', 'focus', 'on', 'build', 'system', 'that', 'learn', 'from', 'data', '.', 'nlp', 'is', 'a', 'key', 'applic', 'of', 'machin', 'learn', '.']
Lancaster Stemmed Tokens: ['machin', 'learn', 'is', 'a', 'branch', 'of', 'art', 'intellig', '.', 'it', 'focus', 'on', 'build', 'system', 'that', 'learn', 'from', 'dat', '.', 'nlp', 'is', 'a', 'key', 'apply', 'of', 'machin', 'learn', '.']
Snowball Stemmed Tokens: ['machin', 'learn', 'is', 'a', 'branch', 'of', 'artifici', 'intellig', '.', 'it', 'focus', 'on', 'build', 'system', 'that', 'learn', 'from', 'data', '.', 'nlp', 'is', 'a', 'key', 'applic', 'of', 'machin', 'learn', '.']


### **Part 04: Lemmatization**

#### **Understanding Lemmatization**

Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma. Unlike stemming, lemmatization considers the context and the part of speech of the word, leading to more accurate reductions that are valid words.

#### **Why Use Lemmatization?**

1. Output Validity: Ensures that the base form of the word is a valid word.

2. Context-Awareness: Considers the context and part of speech, leading to more accurate results.

3. Improving Text Analysis: Useful for tasks that require understanding the meaning and context of the text.

#### **Key Differences Between Stemming and Lemmatization**
| Aspect               | Stemming                                                                                     | Lemmatization                                                                                     |
|----------------------|----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
| Output Validity      | Produces base forms that may not be actual words (e.g., "running" -> "run", "better" -> "bett"). | Produces base forms that are valid words (e.g., "running" -> "run", "better" -> "good").           |
| Context-Awareness    | Does not consider the context or part of speech of the word.                                  | Considers the context and part of speech, leading to more accurate results.                       |
| Accuracy             | Can be too aggressive, leading to over-stemming (e.g., "universities" -> "univers").           | More precise, reducing words to their true base form.                                              |
| Usage in NLP         | Useful for tasks where speed and simplicity are more important than accuracy (e.g., search engines). | Preferred for tasks requiring a deeper understanding of the text and more accurate results (e.g., sentiment analysis, text summarization). |


In [19]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hitar\AppData\Roaming\nltk_data...


True

In [22]:
# Step 01: Import necessary libraries
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Step 02: Define a text to process
text = "The children are playing with the toys. He saw the cats running around."

# Step 03: Tokenize the text
tokens = word_tokenize(text)

#Step 04: Remove stopwords
stop_words = set(stopwords.words('english'))
tokens_for_lemetization = [w for w in tokens if not w in stop_words]

# Step 05: Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Step 06: Apply lemmatization to each token
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens_for_lemetization]

# Step 07: Display the tokens before and after lemmatization
print("Original Tokens:", tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Tokens: ['The', 'children', 'are', 'playing', 'with', 'the', 'toys', '.', 'He', 'saw', 'the', 'cats', 'running', 'around', '.']
Lemmatized Tokens: ['The', 'child', 'playing', 'toy', '.', 'He', 'saw', 'cat', 'running', 'around', '.']


#### **Here are some common types of algorithms and approaches used for lemmatization:**

1. Dictionary-Based Lemmatization

Dictionary-based lemmatization relies on predefined dictionaries or lexicons that map words to their base forms. These dictionaries contain information about the lemma of each word along with its part of speech (POS). Algorithms using this approach typically involve looking up each word in the dictionary and returning its lemma based on the specified POS.

WordNet: A widely used lexical database for English. It provides a set of hierarchical synsets (groups of synonymous words) with short definitions and usage examples. Lemmatization with WordNet involves mapping words to their corresponding synsets and selecting the lemma based on context.

2. Rule-Based Lemmatization

Rule-based lemmatization utilizes linguistic rules and patterns to derive the lemma of a word. These rules are often language-specific and take into account morphological variations of words based on their part of speech and syntactic context.

Morphological Rules: Rules are crafted based on the morphology of the language. For example, rules might specify how to handle plural forms, verb conjugations, and adjectival inflections.

3. Stochastic Lemmatization

Stochastic lemmatization employs statistical models and machine learning techniques to predict the lemma of a word based on large corpora of text. These models learn patterns and relationships between words and their lemmas from data, improving accuracy through statistical inference.

Machine Learning Models: Techniques such as sequence models (e.g., Hidden Markov Models, Conditional Random Fields) or neural networks can be used to predict lemmas based on contextual features and linguistic patterns observed in training data.

4. Hybrid Approaches

Hybrid approaches combine multiple strategies, such as integrating dictionary-based methods with rule-based or statistical techniques. These approaches aim to leverage the strengths of each method to enhance accuracy and coverage in different linguistic contexts.

Example Lemmatization Libraries and Tools
NLTK (Natural Language Toolkit): Provides implementations of both dictionary-based (WordNet) and rule-based lemmatizers for various languages.

SpaCy: A modern NLP library that supports lemmatization using statistical models and rule-based approaches for multiple languages.

Stanford CoreNLP: A suite of NLP tools that includes lemmatization capabilities based on linguistic rules and statistical models.

In [26]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')  # Download the POS tagger

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hitar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\hitar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [27]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

# Function to perform lemmatization using WordNet
def lemmatize_with_wordnet(text):
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return lemmatized_tokens

# Function to perform lemmatization using NLTK's built-in POS tagging and rules
def lemmatize_with_nltk_rules(text):
    lemmatizer = nltk.WordNetLemmatizer()
    tokens = word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    lemmatized_tokens = []
    for word, tag in pos_tags:
        wn_tag = get_wordnet_pos(tag)
        if wn_tag is None:
            lemmatized_tokens.append(word)  # fallback to original word if no suitable tag found
        else:
            lemmatized_tokens.append(lemmatizer.lemmatize(word, pos=wn_tag))
    return lemmatized_tokens

# Helper function to map NLTK POS tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

# Sample text
sample_text = "The children are playing with the toys. He saw the cats running around."

# Perform lemmatization using WordNet
lemmatized_wordnet = lemmatize_with_wordnet(sample_text)
print("Lemmatized with WordNet:", lemmatized_wordnet)

# Perform lemmatization using NLTK's built-in rules and POS tagging
lemmatized_nltk_rules = lemmatize_with_nltk_rules(sample_text)
print("Lemmatized with NLTK Rules:", lemmatized_nltk_rules)


Lemmatized with WordNet: ['The', 'child', 'are', 'playing', 'with', 'the', 'toy', '.', 'He', 'saw', 'the', 'cat', 'running', 'around', '.']
Lemmatized with NLTK Rules: ['The', 'child', 'be', 'play', 'with', 'the', 'toy', '.', 'He', 'saw', 'the', 'cat', 'run', 'around', '.']


In [29]:
# Lemmetization along with POS tagging.

from nltk.corpus import wordnet
from nltk import pos_tag

# Function to get POS tag in a format compatible with WordNetLemmatizer
def get_wordnet_pos(word):
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

text = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(text)

# Apply lemmatization with POS tagging
lemmatized_tokens_pos = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens]

# Display the tokens before and after lemmatization
print("Original Tokens:", tokens)
print("Lemmatized Tokens with POS:", lemmatized_tokens_pos)


Original Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
Lemmatized Tokens with POS: ['The', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.']


### **Summing up all the things we have learnt:**

In [30]:
# importing necessary libraries

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Take text

text = input("Enter text: ")

# Tokenize text
tokens = word_tokenize(text)
print("Step 01: Tokenized text",tokens)

# Remove stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in tokens if not w in stop_words]
print("\nStep 02: Removed stopwords",words)

# Stemming
stemmer = PorterStemmer()
words = [stemmer.stem(w) for w in words]
print("\nStep 03: Stemmed words",words)

# Lemmatization
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(w) for w in words]
print("\nStep 04: Lemmatized words",words)


Step 01: Tokenized text ['Machine', 'learning', 'is', 'a', 'sub-branch', 'of', 'Artificial', 'Intelligence', '.', 'NLP', 'is', 'indid', 'a', 'sub-branch', 'of', 'Machine', 'Learning']

Step 02: Removed stopwords ['Machine', 'learning', 'sub-branch', 'Artificial', 'Intelligence', '.', 'NLP', 'indid', 'sub-branch', 'Machine', 'Learning']

Step 03: Stemmed words ['machin', 'learn', 'sub-branch', 'artifici', 'intellig', '.', 'nlp', 'indid', 'sub-branch', 'machin', 'learn']

Step 04: Lemmatized words ['machin', 'learn', 'sub-branch', 'artifici', 'intellig', '.', 'nlp', 'indid', 'sub-branch', 'machin', 'learn']
