<a href="https://colab.research.google.com/github/Shobana0608/Learnbay-Project/blob/main/Natural_Language_Processing_(NLP).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Natural Language Processing (NLP) Text Preprocessing Pipeline**

**Table of Contents**

*   Introduction
*   Environment Setup
*   Tokenization
*   Stop Word Removal
*   Part-of-Speech (POS) Tagging
*   Lemmatization and Stemming
*   Named Entity Recognition (NER)
*   Complete Pipeline Demonstration
*   Practical Applications
*   Best Practices and Summary












# **Introduction**

This comprehensive guide explores essential text preprocessing techniques in Natural Language Processing (NLP) in the correct order of implementation. These fundamental concepts form the backbone of most NLP applications and are crucial for preparing text data for machine learning models.

**Prerequisites**

Basic understanding of Python programming

Familiarity with Jupyter notebooks

Introductory knowledge of Natural Language Processing (NLP)

Python 3.x installed and configured
pip (Python package manager) for installing required libraries

**Why Text Preprocessing Matters**
Text preprocessing is crucial because:

Standardization: Converts text to a consistent format

Noise Reduction: Removes irrelevant information

Feature Extraction: Identifies meaningful components

Model Performance: Improves accuracy of downstream tasks

Computational Efficiency: Reduces processing overhead

# **Environment Setup**

**Library Installation**

# **Required Libraries Installation:**

In [1]:
# Core NLP libraries
!pip install nltk
!pip install spacy
!python -m spacy download en_core_web_sm
!pip install  textblob

# Data manipulation and visualization
!pip install pandas
!pip install matplotlib
!pip install seaborn

# Modern transformer-based tools
!pip install transformers
!pip install huggingface_hub

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m69.0 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


**Important:** Restart the kernel after installation to ensure all packages are properly loaded.

**What is Tokenization?**

Tokenization is the process of breaking down text into smaller components, typically words or phrases. It serves as the first step in most text processing pipelines, converting continuous text into discrete units that can be analyzed individually.

**Why Tokenization is Important:**

Text Segmentation: Breaks continuous text into manageable pieces

Language Understanding: Helps algorithms identify word boundaries

Feature Preparation: Creates input features for machine learning models

Standardization: Provides consistent text representation

**Method 1: Modern Tokenization with Transformers (BPE)**

In [1]:
from transformers import GPT2Tokenizer

# Load a pretrained GPT-2 tokenizer (uses Byte Pair Encoding)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Example sentence
sentence = "I live in Mysore, which has a growing population."
print(f"Original sentence: {sentence}")

# Tokenize into subwords
tokens = tokenizer.tokenize(sentence)
print(f"Subword Tokens: {tokens}")

# Convert to token IDs (model input format)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Token IDs: {token_ids}")

# Decode back to original text
decoded_sentence = tokenizer.decode(token_ids)
print(f"Decoded Sentence: {decoded_sentence}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Original sentence: I live in Mysore, which has a growing population.
Subword Tokens: ['I', 'Ġlive', 'Ġin', 'ĠM', 'ys', 'ore', ',', 'Ġwhich', 'Ġhas', 'Ġa', 'Ġgrowing', 'Ġpopulation', '.']
Token IDs: [40, 2107, 287, 337, 893, 382, 11, 543, 468, 257, 3957, 3265, 13]
Decoded Sentence: I live in Mysore, which has a growing population.


**Key Concepts:**

BPE (Byte Pair Encoding): Subword tokenization method that handles out-of-vocabulary words

Subword Tokens: Parts of words that capture morphological patterns

Token IDs: Numerical representations that neural networks can process

Reversibility: Ability to reconstruct original text from tokens

**Method 2: Traditional Tokenization with NLTK**

In [2]:
import nltk
from nltk.tokenize import word_tokenize

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

# Define sample text
text = "The quick brown fox jumps over the lazy dog in the garden."
print(f"Original text: {text}")

# Tokenize the sentence
tokens = word_tokenize(text)
print(f"NLTK Tokens: {tokens}")

Original text: The quick brown fox jumps over the lazy dog in the garden.
NLTK Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'in', 'the', 'garden', '.']


**NLTK Tokenization Features:**

Word-level tokenization: Splits text into individual words

Punctuation handling: Separates punctuation marks

Language-specific rules: Handles contractions and special cases

Rule-based approach: Uses predefined patterns for segmentation

**Method 3: Tokenization with spaCy**

In [3]:
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Process the sentence
doc = nlp(text)

# Extract tokens
spacy_tokens = [token.text for token in doc]
print(f"spaCy Tokens: {spacy_tokens}")

spaCy Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'in', 'the', 'garden', '.']


**spaCy Tokenization Advantages:**

Statistical models: Uses machine learning for better accuracy

Rich token attributes: Provides additional linguistic information

Multi-language support: Handles various languages effectively

Context awareness: Considers surrounding text for better decisions

**What are Stop Words?**

Stop words are common words that usually do not carry significant meaning and are often removed during text preprocessing. Examples include "is", "and", "the", "of", etc. These words appear frequently in text but typically don't contribute meaningful information for analysis purposes.

**Why Remove Stop Words?**

Noise Reduction: Eliminates low-information content

Focus Enhancement: Emphasizes important terms

Performance Improvement: Reduces computational overhead

Feature Quality: Improves signal-to-noise ratio in text analysis

In [4]:
from nltk.corpus import stopwords

# Download stopwords data
nltk.download('stopwords', quiet=True)

# Sample text
text = "This is a sample sentence demonstrating the removal of stop words."
print(f"Original text: {text}")

# Step 1: Tokenize
words = word_tokenize(text)
print(f"Tokenized words: {words}")

# Step 2: Get English stop words
stop_words = set(stopwords.words('english'))
print(f"Number of English stop words: {len(stop_words)}")
print(f"Sample stop words: {list(stop_words)[:10]}")

# Step 3: Identify stop words in text
stop_words_in_text = [word for word in words if word.lower() in stop_words]
print(f"Stop words found in text: {stop_words_in_text}")

# Step 4: Remove stop words
filtered_sentence = [word for word in words if word.lower() not in stop_words]
print(f"Text without stop words: {' '.join(filtered_sentence)}")

Original text: This is a sample sentence demonstrating the removal of stop words.
Tokenized words: ['This', 'is', 'a', 'sample', 'sentence', 'demonstrating', 'the', 'removal', 'of', 'stop', 'words', '.']
Number of English stop words: 198
Sample stop words: ["she's", 'after', 'and', 'y', "shouldn't", 'for', 'myself', "i've", 'what', "don't"]
Stop words found in text: ['This', 'is', 'a', 'the', 'of']
Text without stop words: sample sentence demonstrating removal stop words .


**Advanced Stop Word Removal with spaCy**

In [5]:
doc = nlp(text)
spacy_filtered = [token.text for token in doc if not token.is_stop and token.is_alpha]
print(f"spaCy filtered text: {' '.join(spacy_filtered)}")

spaCy filtered text: sample sentence demonstrating removal stop words


**spaCy Advantages:**

Built-in stop word detection: No separate download required

Alphabetic filtering: Removes punctuation automatically

Context-aware: Better handling of ambiguous cases

# **Part-of-Speech (POS) Tagging**

**What is POS Tagging?**

POS tagging is the process of assigning grammatical labels to words in a sentence based on their context and definition. Each word is tagged with a specific part of speech such as noun, verb, adjective, adverb, preposition, etc.

**Common POS Tags:**

Nouns (NN, NNS, NNP): Person, place, thing, or idea

Verbs (VB, VBD, VBG, VBN, VBP, VBZ): Action or state words

Adjectives (JJ, JJR, JJS): Descriptive words that modify nouns

Adverbs (RB, RBR, RBS): Words that modify verbs, adjectives, or other adverbs

Pronouns (PRP, PRP$): Words that replace nouns

Prepositions (IN): Words that show
relationships between other words

Conjunctions (CC): Words that connect words, phrases, or clauses

**Method 1: Basic POS Tagging with NLTK**

In [6]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter

# Download required data
import nltk
nltk.download('averaged_perceptron_tagger_eng', quiet=True) # Explicitly download the English tagger
nltk.download('punkt', quiet=True)

# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog in the garden."
tokens = word_tokenize(sentence)

# Perform POS tagging
pos_tags = pos_tag(tokens)
print(f"POS Tags: {pos_tags}")

# Function to explain POS tags
def explain_pos_tags(pos_tags):
    tag_meanings = {
        'DT': 'Determiner',
        'JJ': 'Adjective',
        'NN': 'Noun, singular',
        'NNS': 'Noun, plural',
        'VBZ': 'Verb, 3rd person singular present',
        'VB': 'Verb, base form',
        'IN': 'Preposition or subordinating conjunction',
        'PRP': 'Personal pronoun',
        'RB': 'Adverb',
        'CC': 'Coordinating conjunction',
        '.': 'Punctuation',
        'VBD': 'Verb, past tense',
        'MD': 'Modal',
        'PRP$': 'Possessive pronoun',
        'NFP': 'Noise function',
        'POS': 'Possessive ending',
        'WP': 'Wh-pronoun',
        'WRB': 'Wh-adverb'
    }

    print("\nWord\t\tPOS Tag\t\tMeaning")
    print("-" * 50)
    for word, tag in pos_tags:
        meaning = tag_meanings.get(tag, f'Other ({tag})')
        print(f"{word:<12}\t{tag:<8}\t{meaning}")

explain_pos_tags(pos_tags)

POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('in', 'IN'), ('the', 'DT'), ('garden', 'NN'), ('.', '.')]

Word		POS Tag		Meaning
--------------------------------------------------
The         	DT      	Determiner
quick       	JJ      	Adjective
brown       	NN      	Noun, singular
fox         	NN      	Noun, singular
jumps       	VBZ     	Verb, 3rd person singular present
over        	IN      	Preposition or subordinating conjunction
the         	DT      	Determiner
lazy        	JJ      	Adjective
dog         	NN      	Noun, singular
in          	IN      	Preposition or subordinating conjunction
the         	DT      	Determiner
garden      	NN      	Noun, singular
.           	.       	Punctuation


**Method 2: Universal POS Tags**

In [7]:
# Using universal POS tags for simplified categories
import nltk
nltk.download('universal_tagset', quiet=True) # Download the universal tagset

universal_pos_tags = pos_tag(tokens, tagset='universal')
print(f"Universal POS Tags: {universal_pos_tags}")

# Count tag frequencies
universal_tag_freq = Counter([tag for word, tag in universal_pos_tags])
print(f"Tag Frequencies: {universal_tag_freq}")

Universal POS Tags: [('The', 'DET'), ('quick', 'ADJ'), ('brown', 'NOUN'), ('fox', 'NOUN'), ('jumps', 'VERB'), ('over', 'ADP'), ('the', 'DET'), ('lazy', 'ADJ'), ('dog', 'NOUN'), ('in', 'ADP'), ('the', 'DET'), ('garden', 'NOUN'), ('.', '.')]
Tag Frequencies: Counter({'NOUN': 4, 'DET': 3, 'ADJ': 2, 'ADP': 2, 'VERB': 1, '.': 1})


**Universal Tags Benefits:**

Language Independence: Consistent across different languages

Simplified Categories: Fewer, more general tag types

Cross-linguistic Analysis: Enables comparison between languages

**Method 3: Advanced POS Tagging with spaCy**

In [8]:
sentence = "The quick brown fox jumps over the lazy dog in the garden 4 times."
# Process with spaCy
doc = nlp(sentence)

# Extract detailed POS information
spacy_pos_tags = [(token.text, token.pos_, token.tag_) for token in doc]
print("spaCy POS Tags (word, pos, detailed_tag):")
for word, pos, tag in spacy_pos_tags:
    print(f"{word:<12} {pos:<8} {tag}")

# Detailed token analysis
def analyze_sentence_spacy(sentence):
    doc = nlp(sentence)

    print("\nDetailed POS Analysis:")
    print("-" * 80)
    print(f"{'Word':<12} {'POS':<8} {'Tag':<8} {'Lemma':<12} {'Shape':<10} {'Alpha'}")
    print("-" * 80)

    for token in doc:
        print(f"{token.text:<12} {token.pos_:<8} {token.tag_:<8} {token.lemma_:<12} {token.shape_:<10} {token.is_alpha}")

analyze_sentence_spacy(sentence)

spaCy POS Tags (word, pos, detailed_tag):
The          DET      DT
quick        ADJ      JJ
brown        ADJ      JJ
fox          NOUN     NN
jumps        VERB     VBZ
over         ADP      IN
the          DET      DT
lazy         ADJ      JJ
dog          NOUN     NN
in           ADP      IN
the          DET      DT
garden       NOUN     NN
4            NUM      CD
times        NOUN     NNS
.            PUNCT    .

Detailed POS Analysis:
--------------------------------------------------------------------------------
Word         POS      Tag      Lemma        Shape      Alpha
--------------------------------------------------------------------------------
The          DET      DT       the          Xxx        True
quick        ADJ      JJ       quick        xxxx       True
brown        ADJ      JJ       brown        xxxx       True
fox          NOUN     NN       fox          xxx        True
jumps        VERB     VBZ      jump         xxxx       True
over         ADP      IN       over

spaCy POS Features:

Fine-grained tags: Provides both coarse and fine-grained POS information

Additional attributes: Lemma, shape, and other linguistic features

Statistical accuracy: Uses trained models for better performance

# **Lemmatization and Stemming**
Understanding the Difference

**Stemming** is a text normalization technique that reduces a word to its root form by stripping affixes without considering the word's meaning or context. It's faster but less accurate.

**Lemmatization** reduces a word to its base form by considering the context and part of speech. This approach provides more meaningful and accurate results.


**Key Differences:**

Stemming: Faster, rule-based suffix removal, may produce non-words

Lemmatization: More sophisticated, context-aware, always produces valid words

Example: "studies" → "studi" (stemming) vs "study" (lemmatization)

**Method 1: Stemming with NLTK PorterStemmer**

In [9]:
from nltk.stem import PorterStemmer

# Initialize stemmer
stemmer = PorterStemmer()

# Test words
test_words = ["running", "runs", "easily", "fairly", "fairness", "studies", "studying", "studied"]

print("Word\t\tStemmed")
print("-" * 30)
for word in test_words:
    stemmed_word = stemmer.stem(word)
    print(f"{word:<12}\t{stemmed_word}")

Word		Stemmed
------------------------------
running     	run
runs        	run
easily      	easili
fairly      	fairli
fairness    	fair
studies     	studi
studying    	studi
studied     	studi


**Porter Stemmer Characteristics:**

Rule-based: Uses predefined suffix removal rules

Fast processing: Suitable for large-scale text

Language-specific: Optimized for English
Aggressive stemming: May over-stem words

**Method 2: Lemmatization with spaCy**

In [10]:
# Test lemmatization
print("Word\t\tLemmatized")
print("-" * 30)
for word in test_words:
    doc = nlp(word)
    lemmatized_word = doc[0].lemma_
    print(f"{word:<12}\t{lemmatized_word}")

Word		Lemmatized
------------------------------
running     	run
runs        	run
easily      	easily
fairly      	fairly
fairness    	fairness
studies     	study
studying    	study
studied     	study


**Method 3: Context-Aware Lemmatization**

In [11]:
# Process sentences for context-aware lemmatization
sentences = [
    "The cats are running in the garden.",
    "He runs every morning for fitness.",
    "The studies show interesting results.",
    "She is studying for her exams."
]

for sentence in sentences:
    print(f"\nOriginal: {sentence}")
    doc = nlp(sentence)

    lemmatized_words = []
    for token in doc:
        if token.is_alpha:  # Only process alphabetic tokens
            lemmatized_words.append(token.lemma_)
        else:
            lemmatized_words.append(token.text)

    print(f"Lemmatized: {' '.join(lemmatized_words)}")


Original: The cats are running in the garden.
Lemmatized: the cat be run in the garden .

Original: He runs every morning for fitness.
Lemmatized: he run every morning for fitness .

Original: The studies show interesting results.
Lemmatized: the study show interesting result .

Original: She is studying for her exams.
Lemmatized: she be study for her exam .


**Context-Aware Benefits:**

POS consideration: Uses part-of-speech for accurate lemmatization

Meaning preservation: Maintains semantic meaning

Disambiguation: Handles words with multiple possible lemmas

**Comparison: Stemming vs Lemmatization**

In [12]:
comparison_words = ["better", "flying", "flies", "dogs", "churches", "mice", "feet"]

print("Word\t\tStemmed\t\tLemmatized")
print("-" * 45)
for word in comparison_words:
    stemmed = stemmer.stem(word)
    doc = nlp(word)
    lemmatized = doc[0].lemma_
    print(f"{word:<12}\t{stemmed:<12}\t{lemmatized}")

Word		Stemmed		Lemmatized
---------------------------------------------
better      	better      	well
flying      	fli         	fly
flies       	fli         	fly
dogs        	dog         	dog
churches    	church      	church
mice        	mice        	mouse
feet        	feet        	foot


# **Named Entity Recognition (NER)**
**What is NER?**
Named Entity Recognition (NER) is the process of identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, dates, quantities, monetary values, percentages, etc.


**Common NER Categories:**

PERSON: People, including fictional characters

NORP: Nationalities, religious, or political groups

FAC: Buildings, airports, highways, bridges

ORG: Companies, agencies, institutions

GPE: Countries, cities, states

LOC: Non-GPE locations, mountain ranges, bodies of water

PRODUCT: Objects, vehicles, foods (not services)

EVENT: Named hurricanes, battles, wars, sports events

WORK_OF_ART: Titles of books, songs, etc.

LAW: Named documents made into laws

LANGUAGE: Any named language

DATE: Absolute or relative dates or periods

TIME: Times smaller than a day

PERCENT: Percentage, including "%"

MONEY: Monetary values, including unit

QUANTITY: Measurements, weight, distance

ORDINAL: "first", "second", etc.

CARDINAL: Numerals that don't fall under another type

**Method 1: NER with spaCy**

In [13]:
# Sample text with various named entities
text = """
Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976
in Cupertino, California. The company went public on December 12, 1980.
Today, Apple is worth over $2 trillion and employs more than 150,000 people worldwide.
Tim Cook became CEO in 2011, succeeding Steve Jobs.
"""

# Process with spaCy
doc = nlp(text)

# Extract named entities
print("Named Entities Found:")
print("-" * 60)
print(f"{'Entity':<20} {'Label':<12} {'Description'}")
print("-" * 60)

for ent in doc.ents:
    print(f"{ent.text:<20} {ent.label_:<12} {spacy.explain(ent.label_)}")

Named Entities Found:
------------------------------------------------------------
Entity               Label        Description
------------------------------------------------------------
Apple Inc.           ORG          Companies, agencies, institutions, etc.
Steve Jobs           PERSON       People, including fictional
Steve Wozniak        PERSON       People, including fictional
Ronald Wayne         PERSON       People, including fictional
April 1976           DATE         Absolute or relative dates or periods
Cupertino            GPE          Countries, cities, states
California           GPE          Countries, cities, states
December 12, 1980    DATE         Absolute or relative dates or periods
Today                DATE         Absolute or relative dates or periods
Apple                ORG          Companies, agencies, institutions, etc.
over $2 trillion     MONEY        Monetary values, including unit
more than 150,000    CARDINAL     Numerals that do not fall under another 

**Method 2: Detailed NER Analysis**

In [14]:
def analyze_entities(text):
    doc = nlp(text)

    # Group entities by type
    entity_dict = {}
    for ent in doc.ents:
        if ent.label_ not in entity_dict:
            entity_dict[ent.label_] = []
        entity_dict[ent.label_].append(ent.text)

    print("Entities grouped by type:")
    print("-" * 40)
    for label, entities in entity_dict.items():
        print(f"{label} ({spacy.explain(label)}): {entities}")

    return entity_dict

entity_analysis = analyze_entities(text)

Entities grouped by type:
----------------------------------------
ORG (Companies, agencies, institutions, etc.): ['Apple Inc.', 'Apple']
PERSON (People, including fictional): ['Steve Jobs', 'Steve Wozniak', 'Ronald Wayne', 'Tim Cook', 'Steve Jobs']
DATE (Absolute or relative dates or periods): ['April 1976', 'December 12, 1980', 'Today', '2011']
GPE (Countries, cities, states): ['Cupertino', 'California']
MONEY (Monetary values, including unit): ['over $2 trillion']
CARDINAL (Numerals that do not fall under another type): ['more than 150,000']


**NER Analysis Benefits:**

Information Extraction: Automatically identifies key information

Knowledge Graph Construction: Enables relationship mapping

Data Organization: Categorizes entities for structured analysis

Semantic Understanding: Provides context about text content

**Method 3: Batch NER Processing**

In [15]:
sample_texts = [
    "Barack Obama was the 44th President of the United States from 2009 to 2017.",
    "Google was founded by Larry Page and Sergey Brin in September 1998 in California.",
    "The meeting is scheduled for next Monday at 3:00 PM in New York.",
    "Microsoft acquired LinkedIn for $26.2 billion in June 2016."
]

for i, text in enumerate(sample_texts, 1):
    print(f"\nText {i}: {text}")
    doc = nlp(text)

    entities = [(ent.text, ent.label_) for ent in doc.ents]
    if entities:
        print("Entities:", entities)
    else:
        print("No entities found")


Text 1: Barack Obama was the 44th President of the United States from 2009 to 2017.
Entities: [('Barack Obama', 'PERSON'), ('44th', 'ORDINAL'), ('the United States', 'GPE'), ('2009', 'DATE')]

Text 2: Google was founded by Larry Page and Sergey Brin in September 1998 in California.
Entities: [('Google', 'ORG'), ('Larry Page', 'PERSON'), ('Sergey Brin', 'PERSON'), ('September 1998', 'DATE'), ('California', 'GPE')]

Text 3: The meeting is scheduled for next Monday at 3:00 PM in New York.
Entities: [('next Monday', 'DATE'), ('3:00 PM', 'TIME'), ('New York', 'GPE')]

Text 4: Microsoft acquired LinkedIn for $26.2 billion in June 2016.
Entities: [('Microsoft', 'ORG'), ('LinkedIn', 'ORG'), ('$26.2 billion', 'MONEY'), ('June 2016', 'DATE')]


# **Complete Pipeline Demonstration**
**Pipeline Order Importance**
The order of preprocessing steps matters:

Tokenization: Break text into tokens

Stop Word Removal: Remove low-information words

POS Tagging: Identify grammatical roles

Lemmatization/Stemming: Normalize word forms

Named Entity Recognition: Identify important entities

# **Complete Pipeline Function**

In [16]:
def complete_nlp_pipeline(text, use_lemmatization=True):
    """
    Complete NLP preprocessing pipeline

    Args:
        text (str): Input text to process
        use_lemmatization (bool): Use lemmatization if True, stemming if False

    Returns:
        dict: Results from each preprocessing step
    """

    results = {'original_text': text}

    print(f"Original Text: {text}")
    print("-" * 80)

    # Step 1: Tokenization
    tokens = word_tokenize(text)
    results['tokens'] = tokens
    print(f"1. Tokenization: {tokens}")

    # Step 2: Stop Word Removal
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words and word.isalpha()]
    results['filtered_tokens'] = filtered_tokens
    print(f"2. After Stop Word Removal: {filtered_tokens}")

    # Step 3: POS Tagging
    pos_tags = pos_tag(filtered_tokens)
    results['pos_tags'] = pos_tags
    print(f"3. POS Tags: {pos_tags}")

    # Step 4: Lemmatization or Stemming
    if use_lemmatization:
        processed_words = []
        for word in filtered_tokens:
            doc = nlp(word)
            processed_words.append(doc[0].lemma_)
        results['lemmatized'] = processed_words
        print(f"4. Lemmatized: {processed_words}")
    else:
        stemmer = PorterStemmer()
        stemmed_words = [stemmer.stem(word) for word in filtered_tokens]
        results['stemmed'] = stemmed_words
        print(f"4. Stemmed: {stemmed_words}")

    # Step 5: Named Entity Recognition (on original text)
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    results['entities'] = entities
    print(f"5. Named Entities: {entities}")

    return results

# Example usage
sample_text = "Apple Inc. was founded by Steve Jobs in California. The company develops innovative products and services."

print("Using Lemmatization:")
results_lemma = complete_nlp_pipeline(sample_text, use_lemmatization=True)

print("\nUsing Stemming:")
results_stem = complete_nlp_pipeline(sample_text, use_lemmatization=False)

Using Lemmatization:
Original Text: Apple Inc. was founded by Steve Jobs in California. The company develops innovative products and services.
--------------------------------------------------------------------------------
1. Tokenization: ['Apple', 'Inc.', 'was', 'founded', 'by', 'Steve', 'Jobs', 'in', 'California', '.', 'The', 'company', 'develops', 'innovative', 'products', 'and', 'services', '.']
2. After Stop Word Removal: ['Apple', 'founded', 'Steve', 'Jobs', 'California', 'company', 'develops', 'innovative', 'products', 'services']
3. POS Tags: [('Apple', 'NNP'), ('founded', 'VBD'), ('Steve', 'NNP'), ('Jobs', 'NNP'), ('California', 'NNP'), ('company', 'NN'), ('develops', 'VBZ'), ('innovative', 'JJ'), ('products', 'NNS'), ('services', 'NNS')]
4. Lemmatized: ['apple', 'found', 'Steve', 'job', 'California', 'company', 'develop', 'innovative', 'product', 'service']
5. Named Entities: [('Apple Inc.', 'ORG'), ('Steve Jobs', 'PERSON'), ('California', 'GPE')]

Using Stemming:
Original 

# **POS Distribution Analysis**

In [17]:
def analyze_pos_distribution(text):
    doc = nlp(text)

    # Count POS frequencies
    pos_counts = Counter([token.pos_ for token in doc if token.is_alpha])

    print(f"\nPOS Distribution for: {text}")
    print("-" * 50)
    for pos, count in pos_counts.most_common():
        print(f"{pos}: {count}")

    return pos_counts

# Analyze sample text
sample_text = """
Natural language processing is a fascinating field of artificial intelligence
that focuses on the interaction between computers and human language. It involves
developing algorithms and models that can understand, interpret, and generate
human language in a meaningful way. Applications include machine translation,
sentiment analysis, chatbots, and voice assistants.
"""

pos_distribution = analyze_pos_distribution(sample_text)


POS Distribution for: 
Natural language processing is a fascinating field of artificial intelligence
that focuses on the interaction between computers and human language. It involves
developing algorithms and models that can understand, interpret, and generate
human language in a meaningful way. Applications include machine translation,
sentiment analysis, chatbots, and voice assistants.

--------------------------------------------------
NOUN: 19
VERB: 7
ADJ: 6
ADP: 4
CCONJ: 4
DET: 3
PRON: 3
AUX: 2


# **Best Practices and Summary**
# **Key Learning Points**
1. Tokenization

Modern tokenization: Uses advanced techniques like Byte Pair Encoding (BPE)
Subword tokenization: Helps handle out-of-vocabulary words effectively
Token IDs: Numerical representations that models can process
Context preservation: Maintains semantic relationships

2. Stop Word Removal

Essential preprocessing: Removes common, low-information words
Content focus: Improves focus on meaningful content in text analysis
Case-insensitive matching: Ensures comprehensive removal
Domain considerations: Some applications may need certain stop words

3. POS Tagging

Grammatical categorization: Assigns grammatical categories to words in context
Context dependency: Different taggers may produce different results for ambiguous cases
Universal tags: Provide language-independent categorization
Feature extraction: Creates features for downstream tasks

4. Lemmatization vs Stemming

Stemming: Faster but less accurate, using rule-based suffix removal
Lemmatization: More sophisticated, considering context and part-of-speech
Word validity: Lemmatization always produces valid base words
Accuracy trade-off: Lemmatization is generally preferred for accuracy

5. Named Entity Recognition

Information extraction: Identifies and classifies named entities in predefined categories
Knowledge construction: Crucial for information extraction and knowledge graph construction
Context requirement: Works best with complete, grammatically correct sentences
Domain adaptation: May require fine-tuning for specific domains
