# 📚 Natural Language Processing (NLP) Fundamentals & Text Preprocessing

**Author:** Chigozilai Kejeh  
**Connect:** [Linkedin](https://www.linkedin.com/in/chigozilai-kejeh-058014143/)  
**Level:** Beginner to Intermediate  
**Duration:** 1-2 hours  

## 🎯 Learning Objectives
By the end of this notebook, you will:
- Understand what NLP is and its core concepts
- Master text preprocessing fundamentals
- Work with NLTK and spaCy libraries
- Implement tokenization, stemming, and POS tagging
- Build a complete sentiment analysis model

---

## 🔧 Setup & Installation

First, let's install the required libraries:

In [None]:
# Install required packages
!pip install nltk spacy scikit-learn matplotlib seaborn wordcloud
!python -m spacy download en_core_web_sm

# Import libraries
import nltk
import spacy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import warnings
warnings.filterwarnings('ignore')

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('vader_lexicon')

print("✅ Setup complete!")

---

# 1️⃣ What is Natural Language Processing (NLP)?

## 📖 Theory Overview

**Natural Language Processing (NLP)** is a branch of artificial intelligence that helps computers understand, interpret, and generate human language in a valuable way.

### 🧠 Core NLP Concepts:

```
NLP Pipeline:
Raw Text → Preprocessing → Feature Extraction → Model → Results
     ↓           ↓              ↓            ↓        ↓
"I love AI"  → Clean text  → Vectors    → Algorithm → Positive
```

### 🎯 NLP Applications:
- **Sentiment Analysis**: Understanding emotions in text
- **Machine Translation**: Google Translate
- **Chatbots**: Customer service automation
- **Text Summarization**: News article summaries
- **Named Entity Recognition**: Extracting names, places, organizations

### 📚 Further Reading:
- [NLP Courses - Kaggle](https://www.kaggle.com/discussions/questions-and-answers/278543)
- [LLM Course - Hugging Face](https://huggingface.co/learn/llm-course/chapter1/1)
- [NLP Tutorial - GeeksforGeeks](https://www.geeksforgeeks.org/nlp/natural-language-processing-nlp-tutorial/)

---

## 🛠️ Hands-On Practice: Your First NLP Program

In [None]:
# Let's start with a simple example
sample_text = "Natural Language Processing is amazing! I love learning about AI and machine learning."

# Basic text analysis
print("📝 Original Text:")
print(sample_text)
print("\n📊 Basic Statistics:")
print(f"Character count: {len(sample_text)}")
print(f"Word count: {len(sample_text.split())}")
print(f"Sentence count: {sample_text.count('.')}")

# Case conversion examples
print("\n🔄 Text Transformations:")
print(f"Lowercase: {sample_text.lower()}")
print(f"Uppercase: {sample_text.upper()}")
print(f"Title Case: {sample_text.title()}")

---

# 2️⃣ Text Preprocessing Fundamentals

## 📖 Theory Overview

**Text Preprocessing** is the crucial first step in any NLP pipeline. Raw text is messy and needs cleaning before analysis. No matter how advanced the model, success still begins with clean, consistent input so you can reap reliable, high-quality results on the other side.

![Local Image](img/nlp_pipeline.png)

### 🧹 Common Preprocessing Steps:

```
Raw Text: "Hello World! I'm learning NLP... It's GREAT!!!"
    ↓
Lowercase: "hello world! i'm learning nlp... it's great!!!"
    ↓
Remove Punctuation: "hello world im learning nlp its great"
    ↓
Tokenization: ["hello", "world", "im", "learning", "nlp", "its", "great"]
    ↓
Remove Stopwords: ["learning", "nlp", "great"]
    ↓
Stemming: ["learn", "nlp", "great"]
```

### 📚 Further Reading:
- [Text Preprocessing Guide](https://www.geeksforgeeks.org/machine-learning/text-preprocessing-in-python-set-1/)

---

## 🛠️ Hands-On Practice: Text Cleaning Pipeline

In [None]:
import re
import string

# Sample messy text
messy_text = """
    Hello World!!! I'm learning NLP... It's ABSOLUTELY amazing!!! 
    Visit https://example.com for more info. Email: test@email.com
    Phone: +1-234-567-8900. #NLP #MachineLearning @AIExpert
"""

print("📝 Original Messy Text:")
print(repr(messy_text))

def clean_text_step_by_step(text):
    """Demonstrate text cleaning step by step"""
    
    print("\n🧹 Step-by-Step Text Cleaning:")
    
    # Step 1: Remove extra whitespace
    step1 = re.sub(r'\s+', ' ', text.strip())
    print(f"\n1. Remove extra whitespace:\n{repr(step1)}")
    
    # Step 2: Convert to lowercase
    step2 = step1.lower()
    print(f"\n2. Convert to lowercase:\n{step2}")
    
    # Step 3: Remove URLs
    step3 = re.sub(r'http\S+|www\S+|https\S+', '', step2, flags=re.MULTILINE)
    print(f"\n3. Remove URLs:\n{step3}")
    
    # Step 4: Remove email addresses
    step4 = re.sub(r'\S+@\S+', '', step3)
    print(f"\n4. Remove emails:\n{step4}")
    
    # Step 5: Remove phone numbers
    step5 = re.sub(r'\+?\d[\d\s\-\(\)]+\d', '', step4)
    print(f"\n5. Remove phone numbers:\n{step5}")
    
    # Step 6: Remove social media mentions and hashtags
    step6 = re.sub(r'[@#]\w+', '', step5)
    print(f"\n6. Remove social media tags:\n{step6}")
    
    # Step 7: Remove punctuation
    step7 = step6.translate(str.maketrans('', '', string.punctuation))
    print(f"\n7. Remove punctuation:\n{step7}")
    
    # Step 8: Remove extra spaces again
    final = re.sub(r'\s+', ' ', step7.strip())
    print(f"\n8. Final cleaned text:\n{final}")
    
    return final

cleaned_text = clean_text_step_by_step(messy_text)

---

# 3️⃣ NLTK Library Deep Dive

## 📖 Theory Overview

**NLTK (Natural Language Toolkit)** is one of the most popular Python libraries for NLP. It provides easy-to-use interfaces to over 50 corpora and lexical resources.

### 🔧 NLTK Key Features:
- **Tokenization**: Breaking text into words/sentences
- **Stemming & Lemmatization**: Reducing words to root forms
- **POS Tagging**: Identifying parts of speech
- **Named Entity Recognition**: Extracting entities
- **Sentiment Analysis**: VADER sentiment analyzer

### 📚 Further Reading:
- [NLTK Official Documentation](https://www.nltk.org/)
- [NLTK Book Online](https://www.nltk.org/book/)

---

## 🛠️ Hands-On Practice: Tokenization

Tokenization is splitting raw text into smaller units called tokens—these might be words, subwords, or individual characters.

![tokenization](img/token.png)

Humans can intuitively parse the text into `generative AI is fascinating and is the future`, but machines require explicit instructions to recognize word boundaries. Tokenization bridges this gap, enabling machines to identify and separate individual words or meaningful subunits within the text.

Let’s implement a basic word tokenizer using Python. This example will split a sentence into words and punctuation marks, demonstrating how tokenization structures raw text.

In [None]:
def simple_tokenize(text):
    tokens = []
    current_word = ""
    for char in text:
        if char.isalnum():
            current_word += char
        else:
            if current_word != "":
                tokens.append(current_word)  # Append the accumulated word.
                current_word = ""
            if char.strip() != "":  # Ignore whitespace.
                tokens.append(char)  # Append punctuation or other non-alphanumeric characters.
    if current_word != "":
        tokens.append(current_word)  # Append any remaining word.
    return tokens

# Example usage
sentence = "Generative AI is fascinating!"
tokens = simple_tokenize(sentence)
print(tokens)

['Generative', 'AI', 'is', 'fascinating', '!']


In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize, RegexpTokenizer
from nltk.corpus import stopwords

# Sample text for tokenization
text = """
Natural Language Processing is a fascinating field! It combines linguistics, 
computer science, and artificial intelligence. NLP helps computers understand 
human language. Isn't that amazing?
"""

print("📝 Original Text:")
print(text.strip())

# Sentence Tokenization
sentences = sent_tokenize(text)
print(f"\n📄 Sentence Tokenization ({len(sentences)} sentences):")
for i, sentence in enumerate(sentences, 1):
    print(f"{i}. {sentence.strip()}")

# Word Tokenization
words = word_tokenize(text)
print(f"\n🔤 Word Tokenization ({len(words)} tokens):")
print(words)

# Custom tokenization (only alphabetic words)
regexp_tokenizer = RegexpTokenizer(r'\w+')
clean_words = regexp_tokenizer.tokenize(text.lower())
print(f"\n🧹 Clean Word Tokens ({len(clean_words)} tokens):")
print(clean_words)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in clean_words if word not in stop_words]
print(f"\n🚫 After Removing Stopwords ({len(filtered_words)} tokens):")
print(filtered_words)

# Show what stopwords were removed
removed_stopwords = [word for word in clean_words if word in stop_words]
print(f"\n🗑️ Removed Stopwords ({len(removed_stopwords)} tokens):")
print(removed_stopwords)

## 🛠️ Hands-On Practice: NLTK Stemming & Lemmatization

Stemming is a rule-based process that truncates words by removing common prefixes or suffixes. It’s quick and computationally simple, making it popular for tasks like document classification and search engine indexing.

![stem](img/stem.png) 

Lemmatization takes a more sophisticated route, mapping words to their base or dictionary form (a lemma). Unlike stemming, lemmatization typically requires knowledge of a word’s part of speech and may rely on morphological analyzers or lexical databases.

![lemma](img/lemma.png)

In [None]:
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

# Initialize stemmers and lemmatizer
porter = PorterStemmer()
snowball = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

# Sample words for demonstration
words = ['running', 'runs', 'ran', 'easily', 'fairly', 'studies', 'studying', 
         'cries', 'crying', 'better', 'best', 'feet', 'geese']

print("🔤 Stemming vs Lemmatization Comparison:")
print("="*70)
print(f"{'Original':<12} {'Porter':<12} {'Snowball':<12} {'Lemmatizer':<12}")
print("="*70)

for word in words:
    porter_stem = porter.stem(word)
    snowball_stem = snowball.stem(word)
    lemma = lemmatizer.lemmatize(word)
    
    print(f"{word:<12} {porter_stem:<12} {snowball_stem:<12} {lemma:<12}")

# Let's also try lemmatization with POS tags for better results
print("\n🎯 Lemmatization with POS Tags (More Accurate):")
print("="*50)

pos_words = [('running', 'v'), ('better', 'a'), ('studies', 'v'), ('studies', 'n')]

for word, pos in pos_words:
    # Convert POS tag to WordNet format
    wordnet_pos = {'n': wordnet.NOUN, 'v': wordnet.VERB, 'a': wordnet.ADJ}.get(pos, wordnet.NOUN)
    lemma = lemmatizer.lemmatize(word, pos=wordnet_pos)
    print(f"{word} ({pos}) → {lemma}")

# Practical example with a sentence
sentence = "The runners were running and studying better strategies"
tokens = word_tokenize(sentence.lower())

print(f"\n📝 Practical Example:")
print(f"Original: {sentence}")
print(f"Tokens: {tokens}")
print(f"Stemmed (Porter): {[porter.stem(token) for token in tokens]}")
print(f"Lemmatized: {[lemmatizer.lemmatize(token) for token in tokens]}")

---

# 4️⃣  spaCy Library Deep Dive

## 📖 Theory Overview

**spaCy** is an industrial-strength NLP library designed for production use. It's faster than NLTK and provides pre-trained models.

### 🚀 spaCy Key Features:
- **Fast & Efficient**: Written in Cython
- **Pre-trained Models**: Ready-to-use language models
- **Advanced NER**: Named Entity Recognition
- **Dependency Parsing**: Understanding sentence structure
- **Built-in Visualizations**: displaCy for visualization

### 📊 NLTK vs spaCy Comparison:

| Feature | NLTK | spaCy |
|---------|------|-------|
| Speed | Slower | Faster |
| Learning Curve | Steeper | Easier |
| Models | Manual setup | Pre-trained |
| Production Ready | Research-focused | Production-ready |

### 📚 Further Reading:
- [spaCy Official Documentation](https://spacy.io/)
- [spaCy 101 Guide](https://spacy.io/usage/spacy-101)

---

## 🛠️ Hands-On Practice: spaCy Basics

In [None]:
# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Sample text
text = """
Apple Inc. is planning to open a new store in New York City next year. 
Tim Cook, the CEO, announced this during a conference in San Francisco. 
The company's stock price increased by 5% after the announcement.
"""

# Process the text
doc = nlp(text)

print("📝 Original Text:")
print(text.strip())

# Basic token information
print("\n🔤 Token Analysis:")
print("="*80)
print(f"{'Token':<15} {'Lemma':<15} {'POS':<10} {'Tag':<10} {'Stop?':<8} {'Alpha?'}")
print("="*80)

for token in doc:
    if not token.is_space:  # Skip whitespace tokens
        print(f"{token.text:<15} {token.lemma_:<15} {token.pos_:<10} {token.tag_:<10} {token.is_stop!s:<8} {token.is_alpha}")

# Sentence segmentation
print(f"\n📄 Sentences ({len(list(doc.sents))}):") 
for i, sent in enumerate(doc.sents, 1):
    print(f"{i}. {sent.text.strip()}")

# Named Entity Recognition (NER)
print(f"\n🏷️ Named Entities ({len(doc.ents)}):")
print("="*50)
for ent in doc.ents:
    print(f"{ent.text:<20} → {ent.label_:<10} ({spacy.explain(ent.label_)})")

# Noun phrases
print(f"\n📝 Noun Phrases ({len(list(doc.noun_chunks)}):")
for chunk in doc.noun_chunks:
    print(f"- {chunk.text} (root: {chunk.root.text})")

---

# 5️⃣ Part-of-Speech (POS) Tagging

## 📖 Theory Overview

**POS Tagging** identifies the grammatical category of each word (noun, verb, adjective, etc.). This is crucial for understanding sentence structure and meaning.

### 🏷️ Common POS Tags:

```
Sentence: "The quick brown fox jumps over the lazy dog."

The     → DT  (Determiner)
quick   → JJ  (Adjective)
brown   → JJ  (Adjective) 
fox     → NN  (Noun)
jumps   → VBZ (Verb, 3rd person singular)
over    → IN  (Preposition)
the     → DT  (Determiner)
lazy    → JJ  (Adjective)
dog     → NN  (Noun)
.       → .   (Punctuation)
```

![POS-Tag](img/POS-Tagging.webp)

### 🎯 Why POS Tagging Matters:
- **Disambiguation**: "bank" (financial) vs "bank" (river side)
- **Information Extraction**: Finding all nouns (entities)
- **Grammar Checking**: Ensuring proper sentence structure
- **Machine Translation**: Understanding syntax

### 📚 Further Reading:
- [POS-Tagging](https://www.geeksforgeeks.org/nlp/nlp-part-of-speech-default-tagging/)
- [Penn Treebank POS Tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)



## 🛠️ Hands-On Practice: POS Tagging Comparison

In [None]:
from collections import Counter

# Sample sentences with different complexities
sentences = [
    "The cat sat on the mat.",
    "I can can a can into a can.",  # Ambiguous 'can'
    "The bank near the river bank offers great banking services.",  # Multiple 'bank'
    "Flying planes can be dangerous.",  # Ambiguous structure
]

for i, sentence in enumerate(sentences, 1):
    print(f"\n{'='*60}")
    print(f"Sentence {i}: {sentence}")
    print(f"{'='*60}")
    
    # NLTK POS Tagging
    nltk_tokens = word_tokenize(sentence)
    nltk_pos = nltk.pos_tag(nltk_tokens)
    
    print("\n🔤 NLTK POS Tagging:")
    for word, pos in nltk_pos:
        print(f"  {word:<12} → {pos:<6} ({nltk.help.upenn_tagset(pos) if pos else 'N/A'})")
    
    # spaCy POS Tagging
    spacy_doc = nlp(sentence)
    
    print("\n🚀 spaCy POS Tagging:")
    for token in spacy_doc:
        if not token.is_space:
            print(f"  {token.text:<12} → {token.pos_:<6} ({spacy.explain(token.pos_) or 'N/A'})")

# POS Distribution Analysis
print("\n\n📊 POS Tag Distribution Analysis:")
print("="*50)

sample_text = """
Natural language processing is a subfield of linguistics, computer science, 
and artificial intelligence concerned with the interactions between computers 
and human language, in particular how to program computers to process and 
analyze large amounts of natural language data.
"""

doc = nlp(sample_text)
pos_counts = Counter([token.pos_ for token in doc if not token.is_space])

print(f"Text: {sample_text.strip()[:100]}...")
print(f"\nPOS Distribution:")
for pos, count in pos_counts.most_common():
    percentage = (count / sum(pos_counts.values())) * 100
    print(f"  {pos:<10} {count:>3} tokens ({percentage:>5.1f}%) - {spacy.explain(pos)}")

---

# 6️⃣ Building a Complete Text Preprocessing Pipeline

## 📖 Theory Overview

A **preprocessing pipeline** chains multiple text cleaning steps together. This ensures consistent, reproducible text processing across your NLP projects.

### 🔄 Pipeline Design Pattern:

```python
def preprocessing_pipeline(text):
    text = clean_text(text)
    text = tokenize(text) 
    text = remove_stopwords(text)
    text = lemmatize(text)
    return text
```

### ⚙️ Pipeline Benefits:
- **Consistency**: Same processing for all texts
- **Modularity**: Easy to add/remove steps
- **Debugging**: Can inspect each step
- **Scalability**: Can process large datasets

---

In [None]:
class TextPreprocessor:
    """Complete text preprocessing pipeline"""
    
    def __init__(self, use_spacy=True, remove_stopwords=True, lemmatize=True):
        """Initialize the preprocessor with configuration options"""
        self.use_spacy = use_spacy
        self.remove_stopwords = remove_stopwords
        self.lemmatize = lemmatize
        
        # Initialize tools
        if use_spacy:
            self.nlp = spacy.load('en_core_web_sm')
        else:
            self.stemmer = PorterStemmer()
            self.lemmatizer = WordNetLemmatizer()
            self.stop_words = set(stopwords.words('english'))
    
    def clean_text(self, text):
        """Basic text cleaning"""
        # Convert to lowercase
        text = text.lower()
        
        # Remove URLs, emails, phone numbers
        text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
        text = re.sub(r'\S+@\S+', '', text)
        text = re.sub(r'\+?\d[\d\s\-\(\)]+\d', '', text)
        
        # Remove social media mentions and hashtags
        text = re.sub(r'[@#]\w+', '', text)
        
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text.strip())
        
        return text
    
    def process_with_spacy(self, text):
        """Process text using spaCy"""
        doc = self.nlp(text)
        
        processed_tokens = []
        for token in doc:
            # Skip spaces, punctuation, and stopwords if configured
            if token.is_space or token.is_punct:
                continue
            if self.remove_stopwords and token.is_stop:
                continue
            
            # Use lemma if configured, otherwise use original token
            if self.lemmatize:
                processed_tokens.append(token.lemma_)
            else:
                processed_tokens.append(token.text)
        
        return processed_tokens
    
    def process_with_nltk(self, text):
        """Process text using NLTK"""
        # Tokenize
        tokens = word_tokenize(text)
        
        # Remove punctuation and convert to lowercase
        tokens = [token.lower() for token in tokens if token.isalpha()]
        
        # Remove stopwords if configured
        if self.remove_stopwords:
            tokens = [token for token in tokens if token not in self.stop_words]
        
        # Lemmatize or stem if configured
        if self.lemmatize:
            tokens = [self.lemmatizer.lemmatize(token) for token in tokens]
        
        return tokens
    
    def preprocess(self, text):
        """Main preprocessing function"""
        # Step 1: Clean text
        cleaned_text = self.clean_text(text)
        
        # Step 2: Process tokens
        if self.use_spacy:
            tokens = self.process_with_spacy(cleaned_text)
        else:
            tokens = self.process_with_nltk(cleaned_text)
        
        return tokens
    
    def preprocess_documents(self, documents, show_progress=True):
        """Process multiple documents"""
        processed_docs = []
        
        for i, doc in enumerate(documents):
            if show_progress and i % 10 == 0:
                print(f"Processing document {i+1}/{len(documents)}")
            
            processed_tokens = self.preprocess(doc)
            processed_docs.append(processed_tokens)
        
        return processed_docs

# Demo the preprocessing pipeline
print("🔧 Text Preprocessing Pipeline Demo")
print("="*50)

# Sample documents
sample_docs = [
    "Hello World! I'm learning NLP and it's AMAZING!!! Visit https://example.com",
    "Natural Language Processing helps computers understand human language.",
    "Machine Learning and AI are transforming technology @TechExpert #AI",
    "The quick brown fox jumps over the lazy dog. Email: test@email.com"
]

# Initialize different preprocessors
preprocessors = {
    "spaCy (Full)": TextPreprocessor(use_spacy=True, remove_stopwords=True, lemmatize=True),
    "spaCy (No Stopwords)": TextPreprocessor(use_spacy=True, remove_stopwords=False, lemmatize=True),
    "NLTK (Full)": TextPreprocessor(use_spacy=False, remove_stopwords=True, lemmatize=True),
}

# Compare different preprocessing approaches
for doc_idx, doc in enumerate(sample_docs):
    print(f"\n📄 Document {doc_idx + 1}:")
    print(f"Original: {doc}")
    
    for name, preprocessor in preprocessors.items():
        tokens = preprocessor.preprocess(doc)
        print(f"{name:>15}: {tokens}")

# Performance comparison
print(f"\n⏱️ Performance Comparison:")
print("="*40)

import time

large_text = " ".join(sample_docs) * 100  # Create larger text for timing

for name, preprocessor in preprocessors.items():
    start_time = time.time()
    result = preprocessor.preprocess(large_text)
    end_time = time.time()
    
    processing_time = (end_time - start_time) * 1000  # Convert to milliseconds
    print(f"{name:>15}: {processing_time:.2f}ms ({len(result)} tokens)")

---

# 7️⃣ Advanced NLP Techniques

Beyond basic preprocessing, there are several advanced techniques which help computers understand words by converting them to numbers (numerical features), this hugely improves model performance:

### 1. Bag of Words (BoW)

A simple and popular technique used to convert text into numerical features for machine learning models to recognize words by counting how many times each word appears in the document.

![bow](img/bow.png)

### 2. TF-IDF (Term Frequency–Inverse Document Frequency)
This is a technique that scores each word in a document based on how often it appears in that document (TF) and how rare it is across all documents (IDF). It improves Bag of Words by reducing the weight of common words (like “the”, “is”) and highlighting more informative, distinctive terms.

![tf-idf](img/tf-idf.png)
![tf-idf2](img/tf-idf2.png)

### 3. N-grams
An n-gram is a contiguous sequence of n words from a text, such as bigrams (2-grams) or trigrams (3-grams), capturing word combinations. It helps by preserving some context and word order, improving models’ ability to understand phrases and local dependencies.

![ngram](img/ngram.png)



1. **Word Embeddings**: Dense vector representations
2. **Dependency Parsing**: Understanding sentence structure

### 📚 Further Reading:
- [Bag of Words](https://www.geeksforgeeks.org/nlp/bag-of-words-bow-model-in-nlp/)
- [N-grams Explained](https://web.stanford.edu/~jurafsky/slp3/3.pdf)
- [TF-IDF Tutorial](https://www.geeksforgeeks.org/machine-learning/understanding-tf-idf-term-frequency-inverse-document-frequency/)

---

## 🛠️ Hands-On Practice: BoW, N-grams and TF-IDF

In [None]:
# BoW Implementation 

from sklearn.feature_extraction.text import CountVectorizer

sentences = ["I love cats", "I hate dogs"]
vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w+\b')  # Adjusted pattern to include single characters
bow_matrix = vectorizer.fit_transform(sentences)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("Vectors:\n", bow_matrix.toarray())

In [None]:
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from collections import Counter
import pandas as pd

# Sample corpus for analysis
corpus = [
    "Natural language processing is amazing and useful for text analysis",
    "Machine learning algorithms can process natural language effectively", 
    "Text analysis and natural language understanding are key AI technologies",
    "Deep learning models excel at natural language processing tasks",
    "Natural language generation is an exciting field in artificial intelligence"
]

print("📚 Sample Corpus:")
for i, doc in enumerate(corpus, 1):
    print(f"{i}. {doc}")

# N-grams Analysis
print("\n\n🔗 N-grams Analysis:")
print("="*50)

# Combine all documents for n-gram analysis
all_text = " ".join(corpus).lower()
tokens = word_tokenize(all_text)
tokens = [token for token in tokens if token.isalpha()]

# Generate different n-grams
for n in [1, 2, 3]:
    n_grams = list(ngrams(tokens, n))
    n_gram_freq = Counter(n_grams)
    
    print(f"\n{n}-grams (Top 5):")
    for gram, freq in n_gram_freq.most_common(5):
        gram_str = " ".join(gram)
        print(f"  '{gram_str}': {freq}")

# TF-IDF Analysis
print("\n\n📊 TF-IDF Analysis:")
print("="*50)

# Create TF-IDF vectorizer
tfidf = TfidfVectorizer(
    max_features=20,  # Limit to top 20 features
    stop_words='english',
    ngram_range=(1, 2)  # Include both unigrams and bigrams
)

# Fit and transform corpus
tfidf_matrix = tfidf.fit_transform(corpus)
feature_names = tfidf.get_feature_names_out()

# Create DataFrame for better visualization
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=feature_names,
    index=[f"Doc {i+1}" for i in range(len(corpus))]
)

print("TF-IDF Scores (Top features per document):")
for idx, row in tfidf_df.iterrows():
    top_features = row.nlargest(3)
    print(f"\n{idx}:")
    for feature, score in top_features.items():
        if score > 0:
            print(f"  {feature}: {score:.3f}")

# Visualize TF-IDF matrix
print("\n📈 TF-IDF Matrix Visualization:")
plt.figure(figsize=(12, 6))
sns.heatmap(tfidf_df.T, annot=True, fmt='.2f', cmap='YlOrRd', cbar=True)
plt.title('TF-IDF Scores Heatmap')
plt.xlabel('Documents')
plt.ylabel('Terms')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Compare with simple word counts
count_vectorizer = CountVectorizer(stop_words='english', max_features=10)
count_matrix = count_vectorizer.fit_transform(corpus)
count_df = pd.DataFrame(
    count_matrix.toarray(),
    columns=count_vectorizer.get_feature_names_out(),
    index=[f"Doc {i+1}" for i in range(len(corpus))]
)

print("\n📊 Word Count vs TF-IDF Comparison:")
print("Word Counts:")
print(count_df.sum().sort_values(ascending=False))

print("\nAverage TF-IDF Scores:")
common_features = set(count_df.columns) & set(tfidf_df.columns)
for feature in sorted(common_features):
    avg_tfidf = tfidf_df[feature].mean()
    total_count = count_df[feature].sum()
    print(f"{feature}: Count={total_count}, Avg TF-IDF={avg_tfidf:.3f}")

---

# 8️⃣ Mini Project: Sentiment Analysis Model

## 📖 Project Overview

Now we'll combine everything we've learned to build a complete **Sentiment Analysis** model! This project will:

1. **Load and explore** a sentiment dataset
2. **Preprocess** the text data using our pipeline
3. **Extract features** using TF-IDF
4. **Train multiple models** (Naive Bayes, SVM, Logistic Regression)
5. **Evaluate performance** and compare models
6. **Create a prediction interface** for new text

### 🎯 Learning Goals:
- Apply preprocessing techniques to real data
- Understand feature extraction for ML models
- Compare different classification algorithms
- Build an end-to-end NLP application

---

## 🛠️ Step 1: Dataset Creation and Exploration

In [None]:
# Create a sample sentiment dataset
# In practice, you might use datasets like IMDB, Amazon reviews, or Twitter sentiment

sample_data = {
    'text': [
        # Positive examples
        "I absolutely love this product! It's amazing and works perfectly.",
        "Great experience! Highly recommend to everyone.",
        "Fantastic quality and excellent customer service.",
        "This is the best purchase I've made this year!",
        "Outstanding performance and great value for money.",
        "I'm so happy with this product. It exceeded my expectations.",
        "Wonderful design and very user-friendly interface.",
        "Perfect! Exactly what I was looking for.",
        "Excellent build quality and fast delivery.",
        "Amazing features and works flawlessly.",
        
        # Negative examples
        "This product is terrible and doesn't work at all.",
        "Worst purchase ever! Complete waste of money.",
        "Poor quality and breaks easily. Very disappointed.",
        "Horrible customer service and defective product.",
        "I hate this product. It's completely useless.",
        "Terrible experience. Would not recommend to anyone.",
        "Very poor quality and overpriced. Avoid at all costs.",
        "Disappointing performance and many issues.",
        "Awful design and difficult to use.",
        "Complete failure. Doesn't meet basic requirements.",
        
        # Neutral examples
        "The product is okay, nothing special but works fine.",
        "Average quality, meets basic requirements but not impressive.",
        "It's an okay product for the price. Nothing more, nothing less.",
        "Decent build quality but could be better.",
        "Reasonable performance, though there are some minor issues.",
        "The product works as expected, no major complaints.",
        "Fair quality and standard features.",
        "Acceptable product with room for improvement.",
        "It does the job but isn't particularly exciting.",
        "Standard quality product with basic functionality."
    ],
    'sentiment': [
        # Positive labels (1)
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        # Negative labels (0)
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        # Neutral labels (2)
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2
    ]
}

# Create DataFrame
df = pd.DataFrame(sample_data)

# Map numeric labels to text for better understanding
sentiment_map = {0: 'Negative', 1: 'Positive', 2: 'Neutral'}
df['sentiment_text'] = df['sentiment'].map(sentiment_map)

print("📊 Dataset Overview:")
print(f"Total samples: {len(df)}")
print(f"Features: {list(df.columns)}")

print("\n📈 Sentiment Distribution:")
sentiment_counts = df['sentiment_text'].value_counts()
print(sentiment_counts)

# Visualize sentiment distribution
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
sentiment_counts.plot(kind='bar', color=['red', 'green', 'gray'])
plt.title('Sentiment Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.xticks(rotation=45)

plt.subplot(1, 2, 2)
plt.pie(sentiment_counts.values, labels=sentiment_counts.index, autopct='%1.1f%%', 
        colors=['red', 'green', 'gray'])
plt.title('Sentiment Distribution (%)')

plt.tight_layout()
plt.show()

# Show sample data
print("\n📄 Sample Data:")
for i, row in df.head(3).iterrows():
    print(f"\n{i+1}. Sentiment: {row['sentiment_text']}")
    print(f"   Text: {row['text']}")

# Text length analysis
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()

print("\n📏 Text Statistics:")
print(f"Average text length: {df['text_length'].mean():.1f} characters")
print(f"Average word count: {df['word_count'].mean():.1f} words")

# Text length by sentiment
print("\n📊 Text Length by Sentiment:")
length_by_sentiment = df.groupby('sentiment_text')[['text_length', 'word_count']].mean()
print(length_by_sentiment)

## 🛠️ Step 2: Text Preprocessing Pipeline

In [None]:
# Apply our preprocessing pipeline to the dataset
print("🔧 Applying Text Preprocessing Pipeline...")

# Initialize preprocessor
preprocessor = TextPreprocessor(use_spacy=True, remove_stopwords=True, lemmatize=True)

# Preprocess all texts
processed_texts = []
for text in df['text']:
    tokens = preprocessor.preprocess(text)
    processed_text = ' '.join(tokens)  # Join tokens back to string
    processed_texts.append(processed_text)

df['processed_text'] = processed_texts

# Show preprocessing results
print("\n📄 Preprocessing Examples:")
for i in range(3):
    print(f"\n{i+1}. Original: {df.iloc[i]['text']}")
    print(f"   Processed: {df.iloc[i]['processed_text']}")

# Compare text lengths before and after preprocessing
df['processed_length'] = df['processed_text'].str.len()
df['processed_word_count'] = df['processed_text'].str.split().str.len()

print("\n📊 Preprocessing Impact:")
print(f"Average length reduction: {df['text_length'].mean() - df['processed_length'].mean():.1f} characters")
print(f"Average word reduction: {df['word_count'].mean() - df['processed_word_count'].mean():.1f} words")

# Visualize the impact
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.scatter(df['text_length'], df['processed_length'], alpha=0.6)
plt.plot([0, df['text_length'].max()], [0, df['text_length'].max()], 'r--', alpha=0.5)
plt.xlabel('Original Text Length')
plt.ylabel('Processed Text Length')
plt.title('Text Length: Before vs After Preprocessing')

plt.subplot(1, 2, 2)
plt.scatter(df['word_count'], df['processed_word_count'], alpha=0.6)
plt.plot([0, df['word_count'].max()], [0, df['word_count'].max()], 'r--', alpha=0.5)
plt.xlabel('Original Word Count')
plt.ylabel('Processed Word Count')
plt.title('Word Count: Before vs After Preprocessing')

plt.tight_layout()
plt.show()

# Create word cloud for each sentiment
print("\n☁️ Word Clouds by Sentiment:")

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
colors = ['red', 'green', 'gray']

for i, (sentiment, color) in enumerate(zip(['Negative', 'Positive', 'Neutral'], colors)):
    # Combine all processed texts for this sentiment
    sentiment_texts = ' '.join(df[df['sentiment_text'] == sentiment]['processed_text'])
    
    if sentiment_texts.strip():  # Check if there's text to process
        wordcloud = WordCloud(
            width=400, height=300, 
            background_color='white',
            colormap=plt.cm.get_cmap('Reds' if sentiment == 'Negative' else 
                                    'Greens' if sentiment == 'Positive' else 'Greys')
        ).generate(sentiment_texts)
        
        axes[i].imshow(wordcloud, interpolation='bilinear')
        axes[i].set_title(f'{sentiment} Sentiment', fontsize=14, color=color)
        axes[i].axis('off')
    else:
        axes[i].text(0.5, 0.5, 'No text data', ha='center', va='center', 
                    transform=axes[i].transAxes)
        axes[i].set_title(f'{sentiment} Sentiment', fontsize=14, color=color)

plt.tight_layout()
plt.show()

## 🛠️ Step 3: Feature Extraction and Model Training

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline

print("🤖 Machine Learning Model Training")
print("="*50)

# Prepare data
X = df['processed_text']
y = df['sentiment']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"Training sentiment distribution:\n{pd.Series(y_train).value_counts()}")

# Define models with pipelines
models = {
    'Naive Bayes': Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
        ('classifier', MultinomialNB())
    ]),
    
    'Logistic Regression': Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
        ('classifier', LogisticRegression(random_state=42, max_iter=1000))
    ]),
    
    'SVM': Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
        ('classifier', SVC(random_state=42, probability=True))
    ]),
    
    'Random Forest': Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
        ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
    ])
}

# Train and evaluate models
results = {}
trained_models = {}

print("\n🔄 Training Models...")
for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Train model
    model.fit(X_train, y_train)
    trained_models[name] = model
    
    # Make predictions
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    cv_scores = cross_val_score(model, X_train, y_train, cv=5)
    
    results[name] = {
        'accuracy': accuracy,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std()}
