# Lab 1: Introduction to Natural Language Processing

Welcome to NLP! In this lab, we'll explore the fundamentals of text processing and traditional NLP techniques that form the foundation for modern language understanding.

## Learning Objectives

By the end of this lab, you will:
- Preprocess text data effectively
- Understand tokenization strategies
- Implement Bag of Words and TF-IDF
- Build text classifiers with traditional methods
- Perform basic NLP tasks (POS tagging, NER)
- Build a spam detection system

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag, ne_chunk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Download required NLTK data
for resource in ['punkt', 'stopwords', 'averaged_perceptron_tagger', 
                'wordnet', 'maxent_ne_chunker', 'words']:
    try:
        nltk.download(resource, quiet=True)
    except:
        pass

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Part 1: Text Preprocessing

Text preprocessing is crucial for NLP. Common steps:

1. **Lowercasing**: Normalize case
2. **Tokenization**: Split into words/sentences
3. **Remove punctuation**: Clean special characters
4. **Remove stop words**: Filter common words
5. **Stemming/Lemmatization**: Reduce to root form

In [None]:
# Sample text
text = """Natural Language Processing (NLP) is a fascinating field of AI! 
It enables computers to understand, interpret, and generate human language. 
Modern NLP systems are incredibly powerful, aren't they?"""

print("Original text:")
print(text)
print("\n" + "="*70 + "\n")

# 1. Lowercasing
text_lower = text.lower()
print("1. Lowercased:")
print(text_lower)
print("\n" + "="*70 + "\n")

# 2. Tokenization
tokens = word_tokenize(text)
print("2. Tokens (words):")
print(tokens[:20])
print(f"Total tokens: {len(tokens)}")
print("\n" + "="*70 + "\n")

# 3. Remove punctuation
tokens_no_punct = [token for token in tokens if token.isalnum()]
print("3. After removing punctuation:")
print(tokens_no_punct)
print("\n" + "="*70 + "\n")

# 4. Remove stop words
stop_words = set(stopwords.words('english'))
tokens_no_stop = [token.lower() for token in tokens_no_punct 
                  if token.lower() not in stop_words]
print("4. After removing stop words:")
print(tokens_no_stop)
print("\n" + "="*70 + "\n")

# 5a. Stemming (aggressive)
stemmer = PorterStemmer()
stems = [stemmer.stem(token) for token in tokens_no_stop]
print("5a. Stemming:")
print(stems)
print("\n" + "="*70 + "\n")

# 5b. Lemmatization (smarter)
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in tokens_no_stop]
print("5b. Lemmatization:")
print(lemmas)

In [None]:
# Complete preprocessing function
def preprocess_text(text, remove_stopwords=True, use_lemmatization=True):
    """
    Complete text preprocessing pipeline.
    """
    # Lowercase
    text = text.lower()
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove punctuation and non-alphanumeric
    tokens = [token for token in tokens if token.isalnum()]
    
    # Remove stop words
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        tokens = [token for token in tokens if token not in stop_words]
    
    # Lemmatization or stemming
    if use_lemmatization:
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(token) for token in tokens]
    else:
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(token) for token in tokens]
    
    return tokens

# Test
processed = preprocess_text(text)
print("Fully processed text:")
print(processed)
print(f"\nOriginal tokens: {len(word_tokenize(text))}")
print(f"Processed tokens: {len(processed)}")
print(f"Reduction: {(1 - len(processed)/len(word_tokenize(text)))*100:.1f}%")

## Part 2: Bag of Words (BoW)

**Bag of Words** represents text as word occurrence counts, ignoring grammar and order.

### Process:
1. Create vocabulary from all documents
2. Count word occurrences in each document
3. Represent document as vector of counts

### Example:
- Doc1: "I love NLP"
- Doc2: "I love AI"
- Vocabulary: ["I", "love", "NLP", "AI"]
- Doc1 vector: [1, 1, 1, 0]
- Doc2 vector: [1, 1, 0, 1]

In [None]:
# Implement BoW from scratch
class BagOfWords:
    def __init__(self):
        self.vocabulary = {}
        self.vocab_list = []
    
    def fit(self, documents):
        """Build vocabulary from documents."""
        all_words = set()
        for doc in documents:
            tokens = preprocess_text(doc)
            all_words.update(tokens)
        
        self.vocab_list = sorted(list(all_words))
        self.vocabulary = {word: idx for idx, word in enumerate(self.vocab_list)}
    
    def transform(self, documents):
        """Convert documents to BoW vectors."""
        vectors = []
        
        for doc in documents:
            tokens = preprocess_text(doc)
            vector = np.zeros(len(self.vocabulary))
            
            for token in tokens:
                if token in self.vocabulary:
                    vector[self.vocabulary[token]] += 1
            
            vectors.append(vector)
        
        return np.array(vectors)
    
    def fit_transform(self, documents):
        """Fit and transform in one step."""
        self.fit(documents)
        return self.transform(documents)

# Test BoW
documents = [
    "I love machine learning",
    "Machine learning is amazing",
    "I love deep learning",
    "Deep learning and machine learning are related"
]

bow = BagOfWords()
bow_vectors = bow.fit_transform(documents)

print("Vocabulary:")
print(bow.vocab_list)
print(f"\nVocabulary size: {len(bow.vocab_list)}")
print(f"\nBoW vectors shape: {bow_vectors.shape}")
print("\nBoW representation:")
bow_df = pd.DataFrame(bow_vectors, columns=bow.vocab_list)
print(bow_df)

In [None]:
# Visualize BoW
plt.figure(figsize=(12, 6))
sns.heatmap(bow_df, annot=True, fmt='.0f', cmap='YlOrRd', cbar_kws={'label': 'Count'})
plt.xlabel('Words')
plt.ylabel('Documents')
plt.title('Bag of Words Representation')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("\nBoW Characteristics:")
print("✓ Simple and interpretable")
print("✓ Works well for short texts")
print("✗ Loses word order")
print("✗ High dimensionality")
print("✗ Doesn't capture semantics")

## Part 3: TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF weighs words by importance:
- Common words ("the", "is") get low weight
- Rare but meaningful words get high weight

### Formula:
$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

Where:
- $\text{TF}(t, d) = \frac{\text{count of } t \text{ in } d}{\text{total words in } d}$
- $\text{IDF}(t) = \log\frac{\text{total documents}}{\text{documents containing } t}$

In [None]:
# Implement TF-IDF from scratch
class TFIDF:
    def __init__(self):
        self.vocabulary = {}
        self.vocab_list = []
        self.idf = None
    
    def fit(self, documents):
        """Calculate IDF values."""
        # Build vocabulary
        all_words = set()
        for doc in documents:
            tokens = preprocess_text(doc)
            all_words.update(tokens)
        
        self.vocab_list = sorted(list(all_words))
        self.vocabulary = {word: idx for idx, word in enumerate(self.vocab_list)}
        
        # Calculate IDF
        n_docs = len(documents)
        doc_counts = np.zeros(len(self.vocabulary))
        
        for doc in documents:
            tokens = set(preprocess_text(doc))
            for token in tokens:
                if token in self.vocabulary:
                    doc_counts[self.vocabulary[token]] += 1
        
        self.idf = np.log(n_docs / (doc_counts + 1))  # +1 for smoothing
    
    def transform(self, documents):
        """Convert documents to TF-IDF vectors."""
        vectors = []
        
        for doc in documents:
            tokens = preprocess_text(doc)
            
            # Calculate TF
            tf = np.zeros(len(self.vocabulary))
            for token in tokens:
                if token in self.vocabulary:
                    tf[self.vocabulary[token]] += 1
            
            if len(tokens) > 0:
                tf = tf / len(tokens)
            
            # Calculate TF-IDF
            tfidf = tf * self.idf
            vectors.append(tfidf)
        
        return np.array(vectors)
    
    def fit_transform(self, documents):
        """Fit and transform in one step."""
        self.fit(documents)
        return self.transform(documents)

# Test TF-IDF
tfidf = TFIDF()
tfidf_vectors = tfidf.fit_transform(documents)

print("TF-IDF vectors shape:", tfidf_vectors.shape)
print("\nTF-IDF representation:")
tfidf_df = pd.DataFrame(tfidf_vectors, columns=tfidf.vocab_list)
print(tfidf_df.round(3))

In [None]:
# Compare BoW vs TF-IDF
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

sns.heatmap(bow_df, annot=True, fmt='.0f', cmap='YlOrRd', ax=axes[0])
axes[0].set_title('Bag of Words')
axes[0].set_xlabel('Words')
axes[0].set_ylabel('Documents')

sns.heatmap(tfidf_df, annot=True, fmt='.2f', cmap='YlGnBu', ax=axes[1])
axes[1].set_title('TF-IDF')
axes[1].set_xlabel('Words')
axes[1].set_ylabel('Documents')

plt.tight_layout()
plt.show()

print("\nTF-IDF Benefits:")
print("✓ Reduces weight of common words")
print("✓ Highlights distinctive words")
print("✓ Better for text classification")
print("✓ More informative than raw counts")

## Part 4: Text Classification - Spam Detection

Let's build a spam detector using traditional NLP methods.

In [None]:
# Create sample spam dataset
spam_data = [
    ("Win a free iPhone now! Click here!", "spam"),
    ("Meeting tomorrow at 3pm", "ham"),
    ("Congratulations! You've won $1000!", "spam"),
    ("Can you send me the report?", "ham"),
    ("FREE MONEY!!! Act now!!!", "spam"),
    ("Let's grab lunch tomorrow", "ham"),
    ("You are the lucky winner!", "spam"),
    ("Project deadline is Friday", "ham"),
    ("Click here for amazing deals!!!", "spam"),
    ("Thanks for your help", "ham"),
    ("Limited time offer! Buy now!", "spam"),
    ("See you at the conference", "ham"),
    ("Get rich quick scheme!", "spam"),
    ("Please review the document", "ham"),
    ("Claim your prize immediately!!!", "spam"),
    ("Coffee meeting at 10am?", "ham"),
]

# Add more examples
more_spam = [
    ("Earn money from home!", "spam"),
    ("Weekend plans?", "ham"),
    ("You've been selected for a special offer", "spam"),
    ("Meeting notes attached", "ham"),
]

spam_data.extend(more_spam)

texts = [text for text, label in spam_data]
labels = [label for text, label in spam_data]

print(f"Total messages: {len(texts)}")
print(f"Spam: {labels.count('spam')}")
print(f"Ham (not spam): {labels.count('ham')}")

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.3, random_state=42, stratify=labels
)

# Create TF-IDF features using sklearn
vectorizer = TfidfVectorizer(max_features=100, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(f"Training set: {X_train_tfidf.shape}")
print(f"Test set: {X_test_tfidf.shape}")
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")

In [None]:
# Train Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_tfidf, y_train)

# Train Logistic Regression
lr_classifier = LogisticRegression(max_iter=1000)
lr_classifier.fit(X_train_tfidf, y_train)

# Predictions
nb_pred = nb_classifier.predict(X_test_tfidf)
lr_pred = lr_classifier.predict(X_test_tfidf)

# Evaluate
print("Naive Bayes Results:")
print(classification_report(y_test, nb_pred))

print("\nLogistic Regression Results:")
print(classification_report(y_test, lr_pred))

In [None]:
# Test on new messages
new_messages = [
    "Congratulations! You won a free vacation!",
    "Can we reschedule our meeting?",
    "Click here to claim your reward!!!",
    "Thanks for the update"
]

new_tfidf = vectorizer.transform(new_messages)
predictions = nb_classifier.predict(new_tfidf)
probabilities = nb_classifier.predict_proba(new_tfidf)

print("Predictions on new messages:\n")
for msg, pred, prob in zip(new_messages, predictions, probabilities):
    spam_prob = prob[1] if pred == 'spam' else prob[0]
    print(f"Message: {msg}")
    print(f"Prediction: {pred.upper()}")
    print(f"Confidence: {spam_prob:.2%}")
    print()

## Part 5: POS Tagging and NER

**Part-of-Speech (POS) Tagging**: Identify word types (noun, verb, etc.)

**Named Entity Recognition (NER)**: Identify entities (person, organization, location)

In [None]:
# POS Tagging
sample_text = "Apple Inc. was founded by Steve Jobs in California. The company released the iPhone in 2007."

tokens = word_tokenize(sample_text)
pos_tags = pos_tag(tokens)

print("POS Tagging:")
for word, tag in pos_tags:
    print(f"{word:15s} -> {tag}")

print("\nCommon POS tags:")
print("NN: Noun, VB: Verb, JJ: Adjective")
print("RB: Adverb, DT: Determiner, IN: Preposition")

In [None]:
# Named Entity Recognition
named_entities = ne_chunk(pos_tags)

print("\nNamed Entities:")
for chunk in named_entities:
    if hasattr(chunk, 'label'):
        entity_text = ' '.join(c[0] for c in chunk)
        entity_label = chunk.label()
        print(f"{entity_text:20s} -> {entity_label}")

print("\nEntity types:")
print("PERSON: People")
print("ORGANIZATION: Companies, agencies")
print("GPE: Geopolitical entities (countries, cities)")

## Key Takeaways

1. **Preprocessing** is crucial for NLP success
2. **Tokenization** converts text to processable units
3. **Stop words** removal reduces noise
4. **Stemming** is fast but crude, **lemmatization** is smarter
5. **Bag of Words** is simple but loses word order
6. **TF-IDF** weighs words by importance
7. **Traditional methods** work well for simple tasks
8. **POS tagging** identifies grammatical roles
9. **NER** extracts entities from text
10. **Feature engineering** matters for classic ML

## Exercises

1. **Custom Tokenizer**: Build a tokenizer that handles contractions
2. **N-grams**: Implement bigrams and trigrams for better context
3. **Sentiment Analysis**: Build a movie review classifier
4. **Language Detection**: Classify text by language
5. **Topic Modeling**: Use TF-IDF to find document topics
6. **Text Similarity**: Compute cosine similarity between documents

## Next Steps

In Lab 2, we'll explore:
- Word embeddings (Word2Vec, GloVe)
- Dense representations
- Semantic relationships
- Neural language models

Great work! You now understand the foundations of NLP.