# Module 00: Introduction to NLP and Text Processing

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 90 minutes  
**Prerequisites**: Deep Learning Fundamentals, Python proficiency

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand the fundamental challenges in Natural Language Processing
2. Implement basic text processing pipelines
3. Apply traditional text representation methods (Bag of Words, TF-IDF)
4. Build a simple text classification model using traditional methods
5. Understand the evolution from traditional NLP to modern transformers

## What is Natural Language Processing?

Natural Language Processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. The goal is to enable computers to understand, interpret, and generate human language in a valuable way.

### Why is NLP Challenging?

Unlike structured data, human language is:
- **Ambiguous**: "I saw her duck" - Did I see her bend down or her pet bird?
- **Context-dependent**: "That's sick!" can be positive or negative
- **Evolving**: New words and meanings emerge constantly
- **High-dimensional**: Vocabulary sizes can be 50,000+ words
- **Sequential**: Word order matters - "dog bites man" ≠ "man bites dog"

## Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# NLP libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Sklearn for traditional ML
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix

# Visualization settings
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

print("✓ All libraries imported successfully!")

In [None]:
# Download required NLTK data
# These are standard datasets needed for tokenization and processing
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

print("✓ NLTK data downloaded successfully!")

## 1. Basic Text Processing

Before we can apply any machine learning algorithm to text, we need to process it. Let's explore the fundamental steps.

### 1.1 Tokenization

**Tokenization** is the process of breaking text into individual units (tokens), typically words or sentences.

**Why it's important**: Machine learning models work with numerical data, so we need to break text into manageable pieces first.

In [None]:
# Sample text for demonstration
sample_text = """
Natural Language Processing is fascinating! It enables computers to understand human language.
This technology powers chatbots, translation systems, and sentiment analysis tools.
Modern NLP has evolved dramatically with the advent of transformers.
"""

# Sentence tokenization - split into sentences
sentences = sent_tokenize(sample_text)
print("Number of sentences:", len(sentences))
print("\nSentences:")
for i, sent in enumerate(sentences, 1):
    print(f"{i}. {sent.strip()}")

In [None]:
# Word tokenization - split into words
words = word_tokenize(sample_text)
print("Number of tokens:", len(words))
print("\nFirst 20 tokens:")
print(words[:20])

**Observation**: Notice that punctuation marks are treated as separate tokens. This is intentional because punctuation can carry meaning ("I love this!" vs "I love this.").

### 1.2 Lowercasing and Cleaning

We typically convert all text to lowercase to avoid treating "The" and "the" as different words.

In [None]:
# Convert to lowercase
words_lower = [word.lower() for word in words]

# Remove punctuation and keep only alphabetic tokens
words_alpha = [word for word in words_lower if word.isalpha()]

print("Original token count:", len(words))
print("After removing punctuation:", len(words_alpha))
print("\nCleaned tokens:")
print(words_alpha[:20])

### 1.3 Stop Words Removal

**Stop words** are common words (like "the", "is", "and") that appear frequently but carry little meaningful information for many NLP tasks.

**When to remove them**: 
- Text classification: Usually beneficial
- Sentiment analysis: Sometimes harmful ("not good" becomes "good")
- Language modeling: Keep them (they're part of natural language)

In [None]:
# Get English stop words
stop_words = set(stopwords.words('english'))

print("Number of stop words in NLTK:", len(stop_words))
print("\nExample stop words:")
print(list(stop_words)[:20])

In [None]:
# Remove stop words
words_no_stop = [word for word in words_alpha if word not in stop_words]

print("Tokens before stop word removal:", len(words_alpha))
print("Tokens after stop word removal:", len(words_no_stop))
print("\nRemaining meaningful words:")
print(words_no_stop)

### 1.4 Stemming and Lemmatization

Both reduce words to their base form, but differently:

**Stemming**: Crude chopping of word endings  
- "running" → "run"
- "studies" → "studi" (not a real word!)
- Fast but less accurate

**Lemmatization**: Linguistic analysis to find the root  
- "running" → "run"
- "studies" → "study"
- Slower but more accurate

In [None]:
# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Test words
test_words = ['running', 'runs', 'ran', 'studies', 'studying', 'better', 'worse']

print("Word\t\tStemmed\t\tLemmatized")
print("-" * 50)
for word in test_words:
    stemmed = stemmer.stem(word)
    lemmatized = lemmatizer.lemmatize(word, pos='v')  # pos='v' for verb
    print(f"{word:12}\t{stemmed:12}\t{lemmatized}")

**Exercise 1**: Create a text processing pipeline

Write a function that takes raw text and applies all preprocessing steps:
1. Tokenization
2. Lowercasing
3. Remove punctuation
4. Remove stop words
5. Lemmatization

In [None]:
def preprocess_text(text):
    """
    Apply standard NLP preprocessing pipeline to text.
    
    Parameters:
    -----------
    text : str
        Raw input text
        
    Returns:
    --------
    list : Cleaned and processed tokens
    """
    # YOUR CODE HERE
    pass

# Test your function
test_text = "The scientists are studying the effects of climate change on polar bears!"
processed = preprocess_text(test_text)
print("Processed tokens:", processed)

## 2. Text Representation: From Words to Numbers

Machine learning models require numerical input. How do we convert text to numbers?

### 2.1 Bag of Words (BoW)

**Bag of Words** represents text as a vector of word counts, ignoring grammar and word order.

**Advantages**: Simple, easy to understand  
**Disadvantages**: Loses word order, high dimensionality, no semantic meaning

In [None]:
# Sample documents
documents = [
    "I love machine learning",
    "I love natural language processing",
    "Machine learning is part of AI",
    "Deep learning is a subset of machine learning"
]

# Create Bag of Words representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# Get feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()

print("Vocabulary size:", len(feature_names))
print("\nVocabulary:")
print(feature_names)
print("\nBag of Words matrix shape:", bow_matrix.shape)

In [None]:
# Visualize BoW as DataFrame
bow_df = pd.DataFrame(
    bow_matrix.toarray(),
    columns=feature_names,
    index=[f"Doc {i+1}" for i in range(len(documents))]
)

print("Bag of Words representation:")
print(bow_df)

**Observation**: Each column represents a word in our vocabulary, and each row represents a document. The values are word counts.

### 2.2 TF-IDF (Term Frequency-Inverse Document Frequency)

**TF-IDF** weights words by how important they are to a document relative to the entire corpus.

**Formula**: TF-IDF(word, document) = TF(word, document) × IDF(word, corpus)

- **TF (Term Frequency)**: How often does the word appear in this document?
- **IDF (Inverse Document Frequency)**: How rare is the word across all documents?

**Why it's better**: Common words get lower weights, rare (potentially more informative) words get higher weights.

In [None]:
# Create TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Visualize TF-IDF
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=tfidf_vectorizer.get_feature_names_out(),
    index=[f"Doc {i+1}" for i in range(len(documents))]
)

print("TF-IDF representation:")
print(tfidf_df.round(3))

In [None]:
# Visualize TF-IDF scores for first document
plt.figure(figsize=(12, 5))
doc1_scores = tfidf_df.iloc[0].sort_values(ascending=False)
doc1_scores[doc1_scores > 0].plot(kind='bar')
plt.title('TF-IDF Scores for Document 1: "I love machine learning"')
plt.xlabel('Word')
plt.ylabel('TF-IDF Score')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

**Exercise 2**: Compare BoW and TF-IDF

Given the following documents, create both BoW and TF-IDF representations and explain the differences:

```python
docs = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "Cats and dogs are enemies"
]
```

Which words get higher TF-IDF scores and why?

In [None]:
# YOUR CODE HERE
docs = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "Cats and dogs are enemies"
]

# Create BoW

# Create TF-IDF

# Compare and explain

## 3. Building a Simple Text Classifier

Let's put everything together and build a sentiment classifier using traditional methods.

In [None]:
# Create a small sentiment dataset
# In practice, you'd use datasets like IMDB or Twitter sentiment
texts = [
    "I love this product, it's amazing!",
    "Terrible experience, would not recommend",
    "Best purchase ever, highly satisfied",
    "Waste of money, very disappointing",
    "Excellent quality and fast shipping",
    "Poor customer service and defective item",
    "Absolutely fantastic, exceeded expectations",
    "Horrible quality, broke after one use",
    "Great value for money, very pleased",
    "Do not buy, complete garbage",
    "Outstanding product, worth every penny",
    "Regret buying this, total disappointment",
    "Superb quality, would buy again",
    "Awful product, returned immediately",
    "Impressive features, works perfectly",
    "Broke within days, very upset",
    "Highly recommend, exceptional quality",
    "Useless product, wasted my time",
    "Brilliant purchase, very happy",
    "Terrible quality, avoid at all costs"
]

# Labels: 1 for positive, 0 for negative
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]

print(f"Dataset size: {len(texts)} reviews")
print(f"Positive reviews: {sum(labels)}")
print(f"Negative reviews: {len(labels) - sum(labels)}")

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.3, random_state=42, stratify=labels
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

In [None]:
# Create TF-IDF features
vectorizer = TfidfVectorizer(max_features=100, ngram_range=(1, 2))

# Fit on training data and transform both train and test
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(f"Feature matrix shape: {X_train_tfidf.shape}")
print(f"Vocabulary size: {len(vectorizer.get_feature_names_out())}")

In [None]:
# Train Naive Bayes classifier
# Naive Bayes works well for text classification
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)

# Make predictions
y_pred = classifier.predict(X_test_tfidf)

# Evaluate
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

In [None]:
# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

In [None]:
# Test on new examples
new_reviews = [
    "This is absolutely wonderful, best thing ever!",
    "Complete waste of money, very disappointed",
    "Not bad, but could be better"
]

# Transform and predict
new_reviews_tfidf = vectorizer.transform(new_reviews)
predictions = classifier.predict(new_reviews_tfidf)
probabilities = classifier.predict_proba(new_reviews_tfidf)

print("Predictions on new reviews:\n")
for review, pred, prob in zip(new_reviews, predictions, probabilities):
    sentiment = "Positive" if pred == 1 else "Negative"
    confidence = max(prob) * 100
    print(f"Review: {review}")
    print(f"Sentiment: {sentiment} (Confidence: {confidence:.1f}%)\n")

**Exercise 3**: Feature importance analysis

Extract the top 10 features (words/n-grams) that are most indicative of positive and negative sentiments. Hint: Use the classifier's `feature_log_prob_` attribute.

In [None]:
# YOUR CODE HERE
# Get feature names
feature_names = vectorizer.get_feature_names_out()

# Extract top features for each class
# Hint: classifier.feature_log_prob_[0] = negative class
#       classifier.feature_log_prob_[1] = positive class

## 4. Limitations of Traditional NLP

While traditional methods work reasonably well, they have significant limitations:

### 4.1 Loss of Context and Word Order

Bag of Words treats "not good" and "good" almost the same way!

In [None]:
# Demonstrate order independence
sentences = [
    "The movie was not good at all",
    "The movie was good not at all",  # Ungrammatical but same BoW
    "All good was the not movie at"   # Nonsense but same BoW
]

vec = CountVectorizer()
bow = vec.fit_transform(sentences)

print("All three sentences have identical BoW representations!")
print(pd.DataFrame(
    bow.toarray(),
    columns=vec.get_feature_names_out(),
    index=sentences
))

### 4.2 No Semantic Understanding

Traditional methods treat "king" and "queen" as completely unrelated words, even though they're semantically similar.

In [None]:
# Similar meaning but different representations
similar_sentences = [
    "The cat is sleeping",
    "The feline is resting"
]

vec = CountVectorizer()
bow = vec.fit_transform(similar_sentences)

print("Despite similar meanings, no word overlap:")
print(pd.DataFrame(
    bow.toarray(),
    columns=vec.get_feature_names_out(),
    index=similar_sentences
))

### 4.3 High Dimensionality and Sparsity

With vocabularies of 50,000+ words, most features are zero (sparse), making models inefficient.

In [None]:
# Demonstrate sparsity
large_corpus = [
    "This is a simple example with many different words to demonstrate sparsity",
    "Another sentence with completely different vocabulary and terms",
    "Yet more text using unique words not seen before"
]

vec = CountVectorizer()
bow = vec.fit_transform(large_corpus)

# Calculate sparsity
total_elements = bow.shape[0] * bow.shape[1]
non_zero_elements = bow.nnz
sparsity = (1 - non_zero_elements / total_elements) * 100

print(f"Matrix shape: {bow.shape}")
print(f"Total elements: {total_elements}")
print(f"Non-zero elements: {non_zero_elements}")
print(f"Sparsity: {sparsity:.1f}%")

## 5. The Evolution to Modern NLP

To address these limitations, NLP has evolved through several paradigms:

### Historical Timeline:

1. **1990s-2000s**: Rule-based systems and statistical methods
   - Hand-crafted rules
   - Bag of Words, TF-IDF
   - N-grams and language models

2. **2013-2017**: Word Embeddings
   - Word2Vec (2013): Dense vector representations
   - GloVe (2014): Global vectors for word representation
   - FastText (2016): Subword embeddings
   - **Breakthrough**: Words with similar meanings have similar vectors!

3. **2014-2017**: Recurrent Neural Networks
   - LSTM (1997, popular 2014+): Handle sequences better
   - GRU (2014): Simpler alternative to LSTM
   - Seq2Seq (2014): Machine translation breakthrough
   - **Achievement**: Can model word order and context

4. **2017-2018**: Attention Mechanism
   - Attention (2015): Focus on relevant parts of input
   - Transformer (2017): "Attention is All You Need"
   - **Revolution**: Parallel processing + long-range dependencies

5. **2018-Present**: Transfer Learning Era
   - BERT (2018): Bidirectional transformers
   - GPT (2018, 2019, 2020): Autoregressive generation
   - T5, RoBERTa, ALBERT, etc.
   - **Current**: LLMs (GPT-4, Claude, LLaMA) with billions of parameters

### What Makes Transformers Special?

1. **Self-Attention**: Understand relationships between all words in a sentence
2. **Parallel Processing**: Process entire sequences at once (vs sequential in RNNs)
3. **Transfer Learning**: Pre-train on massive data, fine-tune on specific tasks
4. **Contextual Embeddings**: Same word has different representations based on context

**Example**: In "The bank is closed" vs "The river bank is muddy", transformers understand "bank" differently!

**Exercise 4**: Research and compare

Research and write a brief comparison (in the cell below) between:
1. Traditional BoW/TF-IDF approaches
2. Word embeddings (Word2Vec, GloVe)
3. Transformer-based models (BERT, GPT)

Consider: representation, context handling, training requirements, performance.

**Your comparison here:**

...

## Summary

### Key Concepts Covered:

1. **Text Processing Pipeline**:
   - Tokenization (words and sentences)
   - Lowercasing and cleaning
   - Stop words removal
   - Stemming and lemmatization

2. **Text Representation**:
   - Bag of Words: Simple word counts
   - TF-IDF: Weighted by importance
   - Both lose word order and context

3. **Traditional Classification**:
   - TF-IDF + Naive Bayes
   - Works for simple tasks
   - Struggles with nuance and context

4. **Limitations and Evolution**:
   - No semantic understanding
   - Loss of word order
   - High dimensionality and sparsity
   - Evolution → Word Embeddings → RNNs → Transformers

### What's Next?

In **Module 01: Text Preprocessing**, we'll dive deeper into:
- Advanced tokenization strategies (BPE, WordPiece)
- Handling special cases (URLs, mentions, hashtags)
- Regular expressions for text cleaning
- Building production-ready preprocessing pipelines

### Additional Resources:

- **NLTK Book**: [nltk.org/book](https://www.nltk.org/book/)
- **Speech and Language Processing** by Jurafsky & Martin (free online)
- **Scikit-learn Text Processing**: [sklearn documentation](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)