# NLP and Text Processing Crash Course for Data Science Assessments

**Last Updated:** 25 January 2026

This notebook covers essential Natural Language Processing (NLP) concepts commonly tested in data science interviews. We focus on practical text processing techniques, from basic preprocessing to TF-IDF and text classification.

## Table of Contents

1. [Introduction and Setup](#1-introduction-and-setup)
2. [Text Preprocessing Fundamentals](#2-text-preprocessing-fundamentals)
3. [Tokenisation](#3-tokenisation)
4. [Stopword Removal](#4-stopword-removal)
5. [Stemming and Lemmatisation](#5-stemming-and-lemmatisation)
6. [Bag of Words (BoW)](#6-bag-of-words-bow)
7. [TF-IDF (Term Frequency-Inverse Document Frequency)](#7-tf-idf-term-frequency-inverse-document-frequency)
8. [N-grams](#8-n-grams)
9. [Word Embeddings Concepts](#9-word-embeddings-concepts)
10. [Text Classification Pipeline](#10-text-classification-pipeline)
11. [Sentiment Analysis](#11-sentiment-analysis)
12. [Practice Questions](#12-practice-questions)
13. [Summary](#13-summary)

---

## 1. Introduction and Setup

NLP enables machines to understand, interpret, and generate human language. In data science interviews, you'll encounter questions about text preprocessing, feature extraction, and building text classification models.

**Key Libraries:**
- **NLTK**: Classic NLP library with comprehensive tools
- **scikit-learn**: TF-IDF vectorisation and text classification
- **re**: Regular expressions for text cleaning

In [None]:
import numpy as np
import pandas as pd
import re
from collections import Counter
from typing import List, Dict, Tuple

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)

print("All imports successful!")

### Sample Text Data

We'll use sample text data throughout this notebook to demonstrate NLP concepts.

In [None]:
sample_documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is a subset of artificial intelligence.",
    "Natural language processing enables computers to understand text.",
    "Data science combines statistics, programming, and domain knowledge.",
    "Python is widely used for data analysis and machine learning."
]

sample_text = """
Natural Language Processing (NLP) is a field of artificial intelligence 
that focuses on the interaction between computers and humans using natural language.
The ultimate objective of NLP is to read, decipher, understand, and make sense 
of human languages in a valuable manner.
"""

print("Sample documents loaded:")
for i, doc in enumerate(sample_documents):
    print(f"  {i+1}. {doc}")

---

## 2. Text Preprocessing Fundamentals

Text preprocessing transforms raw text into a clean, standardised format suitable for analysis. This is often the most time-consuming but critical step in any NLP pipeline.

**Common Preprocessing Steps:**

| Step | Description | Example |
|------|-------------|--------|
| Lowercasing | Convert to lowercase | "Hello" → "hello" |
| Punctuation Removal | Remove special characters | "Hello!" → "Hello" |
| Number Handling | Remove or normalise numbers | "2024" → "" or "NUM" |
| Whitespace Normalisation | Remove extra spaces | "hello  world" → "hello world" |
| HTML/URL Removal | Strip web artefacts | "<p>text</p>" → "text" |

In [None]:
def preprocess_text(
    text: str,
    lowercase: bool = True,
    remove_punctuation: bool = True,
    remove_numbers: bool = False,
    remove_urls: bool = True,
    remove_html: bool = True
) -> str:
    """Clean and preprocess text for NLP tasks.
    
    Args:
        text: Input text string.
        lowercase: Convert text to lowercase.
        remove_punctuation: Remove punctuation marks.
        remove_numbers: Remove numeric characters.
        remove_urls: Remove URLs from text.
        remove_html: Remove HTML tags.
    
    Returns:
        Cleaned text string.
    """
    if remove_html:
        text = re.sub(r'<[^>]+>', '', text)
    
    if remove_urls:
        text = re.sub(r'http\S+|www\.\S+', '', text)
    
    if lowercase:
        text = text.lower()
    
    if remove_punctuation:
        text = re.sub(r'[^\w\s]', '', text)
    
    if remove_numbers:
        text = re.sub(r'\d+', '', text)
    
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

In [None]:
messy_text = "<p>Check out https://example.com for MORE info!!! Price: $99.99</p>"

print(f"Original: {messy_text}")
print(f"Cleaned:  {preprocess_text(messy_text)}")
print(f"Keep numbers: {preprocess_text(messy_text, remove_numbers=False)}")

---

## 3. Tokenisation

**Tokenisation** is the process of breaking text into smaller units called tokens. Tokens can be words, sentences, or subwords.

**Types of Tokenisation:**
- **Word tokenisation**: Split text into words
- **Sentence tokenisation**: Split text into sentences
- **Subword tokenisation**: Split into subword units (used in modern models like BERT)

In [None]:
def tokenize_text(
    text: str,
    method: str = 'word'
) -> List[str]:
    """Tokenise text into words or sentences.
    
    Args:
        text: Input text string.
        method: Tokenisation method ('word' or 'sentence').
    
    Returns:
        List of tokens.
    """
    if method == 'word':
        return word_tokenize(text)
    elif method == 'sentence':
        return sent_tokenize(text)
    else:
        raise ValueError(f"Unknown method: {method}")

In [None]:
text = "Mr. Smith bought cheapsite.com for 1.5 million dollars. It's a great deal!"

print("Word tokens:")
print(tokenize_text(text, 'word'))

print("\nSentence tokens:")
print(tokenize_text(text, 'sentence'))

### Simple Tokenisation with Regular Expressions

For interview questions, you may need to implement tokenisation without NLTK.

In [None]:
def simple_tokenize(text: str) -> List[str]:
    """Simple word tokenisation using regex.
    
    Args:
        text: Input text string.
    
    Returns:
        List of word tokens.
    """
    return re.findall(r'[a-zA-Z]+', text.lower())


print("Simple tokenisation:")
print(simple_tokenize("Hello, World! This is NLP 101."))

---

## 4. Stopword Removal

**Stopwords** are common words that carry little meaningful information (e.g., "the", "is", "at"). Removing them reduces noise and dimensionality.

**When to remove stopwords:**
- Bag of Words / TF-IDF models
- Topic modelling
- Keyword extraction

**When to keep stopwords:**
- Sentiment analysis ("not good" vs "good")
- Named entity recognition
- Language models / deep learning

In [None]:
english_stopwords = set(stopwords.words('english'))

print(f"Number of English stopwords: {len(english_stopwords)}")
print(f"\nSample stopwords: {list(english_stopwords)[:20]}")

In [None]:
def remove_stopwords(
    tokens: List[str],
    stop_words: set = None
) -> List[str]:
    """Remove stopwords from a list of tokens.
    
    Args:
        tokens: List of word tokens.
        stop_words: Set of stopwords to remove.
    
    Returns:
        Filtered list of tokens.
    """
    if stop_words is None:
        stop_words = set(stopwords.words('english'))
    
    return [token for token in tokens if token.lower() not in stop_words]

In [None]:
text = "The quick brown fox jumps over the lazy dog"
tokens = simple_tokenize(text)

print(f"Original tokens: {tokens}")
print(f"After stopword removal: {remove_stopwords(tokens)}")

---

## 5. Stemming and Lemmatisation

Both techniques reduce words to their base form, but they work differently:

| Technique | Method | Example | Result |
|-----------|--------|---------|--------|
| **Stemming** | Rule-based suffix stripping | "running", "runs", "ran" | "run", "run", "ran" |
| **Lemmatisation** | Dictionary-based, considers POS | "running", "runs", "ran" | "run", "run", "run" |

**Stemming**: Faster but can produce non-words ("studies" → "studi")

**Lemmatisation**: More accurate but slower, requires POS tagging

In [None]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()


def stem_tokens(tokens: List[str]) -> List[str]:
    """Apply Porter stemming to tokens.
    
    Args:
        tokens: List of word tokens.
    
    Returns:
        List of stemmed tokens.
    """
    return [stemmer.stem(token) for token in tokens]


def lemmatize_tokens(tokens: List[str]) -> List[str]:
    """Apply lemmatisation to tokens.
    
    Args:
        tokens: List of word tokens.
    
    Returns:
        List of lemmatised tokens.
    """
    return [lemmatizer.lemmatize(token) for token in tokens]

In [None]:
words = ['running', 'runs', 'ran', 'studies', 'studying', 'better', 'wolves']

print(f"{'Original':<12} {'Stemmed':<12} {'Lemmatised':<12}")
print("-" * 36)
for word in words:
    print(f"{word:<12} {stemmer.stem(word):<12} {lemmatizer.lemmatize(word):<12}")

### Complete Text Preprocessing Pipeline

In [None]:
def preprocess_pipeline(
    text: str,
    remove_stops: bool = True,
    use_lemmatisation: bool = True
) -> List[str]:
    """Complete text preprocessing pipeline.
    
    Args:
        text: Input text string.
        remove_stops: Whether to remove stopwords.
        use_lemmatisation: Use lemmatisation (True) or stemming (False).
    
    Returns:
        List of processed tokens.
    """
    text = preprocess_text(text)
    tokens = simple_tokenize(text)
    
    if remove_stops:
        tokens = remove_stopwords(tokens)
    
    if use_lemmatisation:
        tokens = lemmatize_tokens(tokens)
    else:
        tokens = stem_tokens(tokens)
    
    return tokens

In [None]:
text = "The researchers are studying various machine learning algorithms!"

print(f"Original: {text}")
print(f"Processed: {preprocess_pipeline(text)}")

---

## 6. Bag of Words (BoW)

**Bag of Words** represents text as a vector of word counts, ignoring grammar and word order.

**How it works:**
1. Build vocabulary from all documents
2. For each document, count occurrences of each vocabulary word
3. Result: Document-term matrix

**Limitations:**
- Ignores word order ("dog bites man" = "man bites dog")
- High dimensionality with large vocabularies
- Sparse matrices

In [None]:
def create_bow_manually(documents: List[str]) -> Tuple[Dict[str, int], np.ndarray]:
    """Create Bag of Words representation manually.
    
    Args:
        documents: List of text documents.
    
    Returns:
        Tuple of (vocabulary dict, document-term matrix).
    """
    all_tokens = []
    doc_tokens = []
    
    for doc in documents:
        tokens = preprocess_pipeline(doc)
        doc_tokens.append(tokens)
        all_tokens.extend(tokens)
    
    vocabulary = {word: idx for idx, word in enumerate(sorted(set(all_tokens)))}
    
    matrix = np.zeros((len(documents), len(vocabulary)))
    for doc_idx, tokens in enumerate(doc_tokens):
        for token in tokens:
            if token in vocabulary:
                matrix[doc_idx, vocabulary[token]] += 1
    
    return vocabulary, matrix

In [None]:
docs = [
    "I love machine learning",
    "Machine learning is great",
    "I love data science"
]

vocab, bow_matrix = create_bow_manually(docs)

print("Vocabulary:")
print(vocab)
print("\nBoW Matrix:")
print(bow_matrix)

### Using scikit-learn's CountVectorizer

In [None]:
vectorizer = CountVectorizer(stop_words='english')
bow_sklearn = vectorizer.fit_transform(docs)

print("Feature names:")
print(vectorizer.get_feature_names_out())
print("\nBoW Matrix (sklearn):")
print(bow_sklearn.toarray())

---

## 7. TF-IDF (Term Frequency-Inverse Document Frequency)

**TF-IDF** weighs terms by their importance in a document relative to the entire corpus. Words that appear frequently in one document but rarely across all documents get higher scores.

**Formula:**

$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

Where:
- **TF (Term Frequency)**: $\frac{\text{count of term } t \text{ in document } d}{\text{total terms in document } d}$
- **IDF (Inverse Document Frequency)**: $\log\frac{\text{total documents}}{\text{documents containing term } t}$

**Why TF-IDF?**
- Reduces weight of common words
- Increases weight of distinctive words
- Better for information retrieval and document similarity

In [None]:
def compute_tf(document: List[str]) -> Dict[str, float]:
    """Compute term frequency for a document.
    
    Args:
        document: List of tokens.
    
    Returns:
        Dictionary mapping terms to their frequencies.
    """
    word_counts = Counter(document)
    total_words = len(document)
    return {word: count / total_words for word, count in word_counts.items()}


def compute_idf(documents: List[List[str]]) -> Dict[str, float]:
    """Compute inverse document frequency for a corpus.
    
    Args:
        documents: List of tokenised documents.
    
    Returns:
        Dictionary mapping terms to their IDF scores.
    """
    n_docs = len(documents)
    all_words = set(word for doc in documents for word in doc)
    
    idf = {}
    for word in all_words:
        doc_count = sum(1 for doc in documents if word in doc)
        idf[word] = np.log(n_docs / doc_count) + 1
    
    return idf


def compute_tfidf(
    documents: List[str]
) -> Tuple[List[Dict[str, float]], Dict[str, float]]:
    """Compute TF-IDF for a corpus.
    
    Args:
        documents: List of text documents.
    
    Returns:
        Tuple of (list of TF-IDF dicts per document, IDF dict).
    """
    tokenized_docs = [preprocess_pipeline(doc) for doc in documents]
    idf = compute_idf(tokenized_docs)
    
    tfidf_docs = []
    for doc in tokenized_docs:
        tf = compute_tf(doc)
        tfidf = {word: tf[word] * idf[word] for word in tf}
        tfidf_docs.append(tfidf)
    
    return tfidf_docs, idf

In [None]:
documents = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "Cats and dogs are pets"
]

tfidf_results, idf_scores = compute_tfidf(documents)

print("TF-IDF scores for each document:\n")
for i, tfidf in enumerate(tfidf_results):
    print(f"Document {i+1}: {documents[i]}")
    sorted_terms = sorted(tfidf.items(), key=lambda x: x[1], reverse=True)
    for term, score in sorted_terms:
        print(f"  {term}: {score:.4f}")
    print()

### Using scikit-learn's TfidfVectorizer

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=feature_names,
    index=[f'Doc {i+1}' for i in range(len(documents))]
)

print("TF-IDF Matrix (sklearn):")
print(tfidf_df.round(3))

---

## 8. N-grams

**N-grams** are contiguous sequences of n items from text. They capture local word order and context that BoW misses.

| Type | n | Example ("I love data science") |
|------|---|-------------------------------|
| Unigram | 1 | "I", "love", "data", "science" |
| Bigram | 2 | "I love", "love data", "data science" |
| Trigram | 3 | "I love data", "love data science" |

In [None]:
def generate_ngrams(tokens: List[str], n: int) -> List[Tuple[str, ...]]:
    """Generate n-grams from a list of tokens.
    
    Args:
        tokens: List of word tokens.
        n: Size of n-gram.
    
    Returns:
        List of n-gram tuples.
    """
    return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]

In [None]:
text = "I love machine learning and data science"
tokens = text.lower().split()

print(f"Unigrams: {generate_ngrams(tokens, 1)}")
print(f"Bigrams:  {generate_ngrams(tokens, 2)}")
print(f"Trigrams: {generate_ngrams(tokens, 3)}")

In [None]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
bigram_matrix = bigram_vectorizer.fit_transform(sample_documents)

print("Features with bigrams:")
print(bigram_vectorizer.get_feature_names_out())

---

## 9. Word Embeddings Concepts

**Word embeddings** represent words as dense vectors in a continuous vector space, capturing semantic relationships.

### Key Concepts

| Method | Description | Key Property |
|--------|-------------|-------------|
| **Word2Vec** | Neural network-based (CBOW, Skip-gram) | "king - man + woman = queen" |
| **GloVe** | Global word co-occurrence statistics | Captures global corpus statistics |
| **FastText** | Subword embeddings | Handles out-of-vocabulary words |

### Word2Vec Architectures

1. **CBOW (Continuous Bag of Words)**: Predicts target word from context
2. **Skip-gram**: Predicts context words from target word

**Interview Tip**: Be prepared to explain the difference between sparse (BoW/TF-IDF) and dense (embeddings) representations.

In [None]:
print("Sparse vs Dense Representations:\n")
print("Sparse (BoW/TF-IDF):")
print("  - High dimensional (vocabulary size)")
print("  - Mostly zeros")
print("  - No semantic relationships")
print("  - Example: [0, 0, 1, 0, 0, 2, 0, 0, 1, 0, ...]")
print("\nDense (Word Embeddings):")
print("  - Low dimensional (typically 50-300)")
print("  - All values non-zero")
print("  - Captures semantic similarity")
print("  - Example: [0.25, -0.13, 0.89, 0.02, -0.45, ...]")

### Cosine Similarity

Cosine similarity measures the angle between two vectors, commonly used to compare document or word similarity.

$$\text{cosine\_similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}$$

In [None]:
def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
    """Compute cosine similarity between two vectors.
    
    Args:
        vec1: First vector.
        vec2: Second vector.
    
    Returns:
        Cosine similarity score between -1 and 1.
    """
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    
    if norm1 == 0 or norm2 == 0:
        return 0.0
    
    return dot_product / (norm1 * norm2)

In [None]:
docs_for_similarity = [
    "I love machine learning",
    "I enjoy deep learning",
    "The weather is sunny today"
]

tfidf_vec = TfidfVectorizer()
tfidf_sim = tfidf_vec.fit_transform(docs_for_similarity).toarray()

print("Document Similarity (Cosine):")
print(f"Doc 1 vs Doc 2: {cosine_similarity(tfidf_sim[0], tfidf_sim[1]):.4f}")
print(f"Doc 1 vs Doc 3: {cosine_similarity(tfidf_sim[0], tfidf_sim[2]):.4f}")
print(f"Doc 2 vs Doc 3: {cosine_similarity(tfidf_sim[1], tfidf_sim[2]):.4f}")

---

## 10. Text Classification Pipeline

Text classification assigns predefined categories to text documents. Common applications include spam detection, sentiment analysis, and topic classification.

**Pipeline Steps:**
1. Text preprocessing
2. Feature extraction (TF-IDF)
3. Model training
4. Evaluation

In [None]:
texts = [
    "Great product, highly recommend!",
    "Terrible quality, waste of money",
    "Amazing experience, will buy again",
    "Disappointed with the purchase",
    "Excellent value for the price",
    "Poor customer service",
    "Love this item, perfect!",
    "Not worth the money",
    "Best purchase I've made",
    "Awful product, don't buy",
    "Fantastic quality and fast shipping",
    "Broken on arrival, very unhappy",
    "Exceeded my expectations",
    "Complete waste of time",
    "Highly satisfied customer",
    "Worst experience ever"
]

labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]

X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.25, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

In [None]:
def build_text_classifier(
    model_type: str = 'naive_bayes'
) -> Pipeline:
    """Build a text classification pipeline.
    
    Args:
        model_type: Type of classifier ('naive_bayes' or 'logistic').
    
    Returns:
        Scikit-learn Pipeline object.
    """
    if model_type == 'naive_bayes':
        classifier = MultinomialNB()
    elif model_type == 'logistic':
        classifier = LogisticRegression(max_iter=1000)
    else:
        raise ValueError(f"Unknown model type: {model_type}")
    
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2))),
        ('classifier', classifier)
    ])
    
    return pipeline

In [None]:
nb_pipeline = build_text_classifier('naive_bayes')
nb_pipeline.fit(X_train, y_train)

y_pred = nb_pipeline.predict(X_test)

print("Naive Bayes Classification Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

In [None]:
new_reviews = [
    "This product is absolutely wonderful!",
    "Terrible experience, never again",
    "It's okay, nothing special"
]

predictions = nb_pipeline.predict(new_reviews)
print("Predictions on new reviews:")
for review, pred in zip(new_reviews, predictions):
    sentiment = "Positive" if pred == 1 else "Negative"
    print(f"  '{review}' -> {sentiment}")

---

## 11. Sentiment Analysis

**Sentiment analysis** determines the emotional tone of text (positive, negative, neutral). It's a specific application of text classification.

**Approaches:**
1. **Lexicon-based**: Use predefined sentiment dictionaries
2. **Machine learning**: Train classifiers on labelled data
3. **Deep learning**: Use neural networks (LSTM, BERT)

In [None]:
positive_words = {'good', 'great', 'excellent', 'amazing', 'wonderful', 'fantastic',
                  'love', 'best', 'happy', 'perfect', 'recommend', 'satisfied'}
negative_words = {'bad', 'terrible', 'awful', 'horrible', 'worst', 'hate',
                  'disappointed', 'poor', 'waste', 'broken', 'unhappy', 'never'}


def lexicon_sentiment(text: str) -> Tuple[str, float]:
    """Simple lexicon-based sentiment analysis.
    
    Args:
        text: Input text string.
    
    Returns:
        Tuple of (sentiment label, score).
    """
    tokens = simple_tokenize(text)
    
    pos_count = sum(1 for t in tokens if t in positive_words)
    neg_count = sum(1 for t in tokens if t in negative_words)
    
    total = pos_count + neg_count
    if total == 0:
        return 'Neutral', 0.0
    
    score = (pos_count - neg_count) / total
    
    if score > 0:
        return 'Positive', score
    elif score < 0:
        return 'Negative', score
    else:
        return 'Neutral', score

In [None]:
test_sentences = [
    "This is a great product, I love it!",
    "Terrible experience, worst purchase ever",
    "The product arrived on time",
    "Good quality but poor customer service"
]

print("Lexicon-based Sentiment Analysis:\n")
for sentence in test_sentences:
    sentiment, score = lexicon_sentiment(sentence)
    print(f"Text: '{sentence}'")
    print(f"Sentiment: {sentiment} (score: {score:.2f})\n")

## 12. Practice Questions

Test your understanding with these interview-style questions. Try to solve each question in the empty code cell before revealing the answer.

### Question 1: Word Frequency Counter

Write a function that takes a sentence and returns a list of tuples containing each word and its frequency (TF). Remove punctuation, convert to lowercase, and remove English stopwords.

In [None]:
# Write your solution here

<details>
<summary>Click to reveal answer</summary>

```python
import re
from collections import Counter
from nltk.corpus import stopwords

def word_frequencies(sentence: str) -> List[Tuple[str, float]]:
    """Compute word frequencies (TF) for a sentence.
    
    Args:
        sentence: Input text string.
    
    Returns:
        List of (word, frequency) tuples.
    """
    stop_words = set(stopwords.words('english'))
    
    tokens = re.findall(r'[a-z]+', sentence.lower())
    tokens = [t for t in tokens if t not in stop_words]
    
    counts = Counter(tokens)
    total = sum(counts.values())
    
    return [(word, count / total) for word, count in counts.items()]


# Test
result = word_frequencies("The quick brown fox jumps over the lazy dog")
print(result)
# [('quick', 0.2), ('brown', 0.2), ('fox', 0.2), ('jumps', 0.2), ('lazy', 0.2)]
```

</details>


---

### Question 2: Document Similarity

Write a function that computes the cosine similarity between two documents using TF-IDF vectors.

In [None]:
# Write your solution here

<details>
<summary>Click to reveal answer</summary>

```python
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def document_similarity(doc1: str, doc2: str) -> float:
    """Compute cosine similarity between two documents.
    
    Args:
        doc1: First document.
        doc2: Second document.
    
    Returns:
        Cosine similarity score.
    """
    vectorizer = TfidfVectorizer()
    tfidf = vectorizer.fit_transform([doc1, doc2]).toarray()
    
    dot_product = np.dot(tfidf[0], tfidf[1])
    norm1 = np.linalg.norm(tfidf[0])
    norm2 = np.linalg.norm(tfidf[1])
    
    return dot_product / (norm1 * norm2)


# Test
sim = document_similarity(
    "Machine learning is fascinating",
    "Deep learning is a subset of machine learning"
)
print(f"Similarity: {sim:.4f}")
```

</details>


---

### Question 3: N-gram Generator

Implement a function that generates all n-grams from a given text and returns them with their frequencies.

In [None]:
# Write your solution here

<details>
<summary>Click to reveal answer</summary>

```python
from collections import Counter
from typing import Dict, Tuple

def ngram_frequencies(text: str, n: int) -> Dict[Tuple[str, ...], int]:
    """Generate n-grams and their frequencies.
    
    Args:
        text: Input text string.
        n: Size of n-gram.
    
    Returns:
        Dictionary mapping n-grams to counts.
    """
    tokens = text.lower().split()
    ngrams = [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]
    return dict(Counter(ngrams))


# Test
text = "I love data science and I love machine learning"
print(ngram_frequencies(text, 2))
# {('i', 'love'): 2, ('love', 'data'): 1, ...}
```

</details>


---

### Question 4: Custom Stopwords

Write a function that removes stopwords from text but allows adding custom stopwords to the default list.

In [None]:
# Write your solution here

<details>
<summary>Click to reveal answer</summary>

```python
from nltk.corpus import stopwords
from typing import List, Set

def remove_custom_stopwords(
    text: str,
    custom_stopwords: List[str] = None
) -> str:
    """Remove stopwords including custom ones.
    
    Args:
        text: Input text string.
        custom_stopwords: Additional stopwords to remove.
    
    Returns:
        Text with stopwords removed.
    """
    stop_words = set(stopwords.words('english'))
    
    if custom_stopwords:
        stop_words.update(word.lower() for word in custom_stopwords)
    
    tokens = text.lower().split()
    filtered = [t for t in tokens if t not in stop_words]
    
    return ' '.join(filtered)


# Test
text = "The data science course is very interesting"
print(remove_custom_stopwords(text, ['data', 'course']))
# "science interesting"
```

</details>


---

### Question 5: Text Classification Pipeline

Build a complete text classification pipeline that preprocesses text, extracts TF-IDF features, and trains a Naive Bayes classifier.

In [None]:
# Write your solution here

<details>
<summary>Click to reveal answer</summary>

```python
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def build_and_train_classifier(
    texts: List[str],
    labels: List[int],
    test_size: float = 0.2
) -> Tuple[Pipeline, float]:
    """Build and train a text classifier.
    
    Args:
        texts: List of text documents.
        labels: List of labels.
        test_size: Proportion for test set.
    
    Returns:
        Tuple of (trained pipeline, accuracy).
    """
    X_train, X_test, y_train, y_test = train_test_split(
        texts, labels, test_size=test_size, random_state=42
    )
    
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(stop_words='english')),
        ('clf', MultinomialNB())
    ])
    
    pipeline.fit(X_train, y_train)
    accuracy = accuracy_score(y_test, pipeline.predict(X_test))
    
    return pipeline, accuracy


# Test
texts = ["great product", "bad quality", "love it", "hate it"] * 10
labels = [1, 0, 1, 0] * 10
model, acc = build_and_train_classifier(texts, labels)
print(f"Accuracy: {acc:.2f}")
```

</details>


---

### Question 6: IDF Calculation

Implement a function that computes IDF scores for all unique words in a corpus.

In [None]:
# Write your solution here

<details>
<summary>Click to reveal answer</summary>

```python
import numpy as np
from typing import Dict, List

def compute_idf(documents: List[str]) -> Dict[str, float]:
    """Compute IDF scores for a corpus.
    
    Args:
        documents: List of text documents.
    
    Returns:
        Dictionary mapping words to IDF scores.
    """
    n_docs = len(documents)
    doc_words = [set(doc.lower().split()) for doc in documents]
    
    all_words = set().union(*doc_words)
    
    idf = {}
    for word in all_words:
        doc_count = sum(1 for doc in doc_words if word in doc)
        idf[word] = np.log(n_docs / doc_count) + 1
    
    return idf


# Test
docs = ["cat sat mat", "dog sat log", "cat dog pet"]
idf_scores = compute_idf(docs)
for word, score in sorted(idf_scores.items()):
    print(f"{word}: {score:.3f}")
```

</details>


---

### Question 7: Stemming vs Lemmatisation Comparison

Write a function that compares stemming and lemmatisation outputs for a list of words.

In [None]:
# Write your solution here

<details>
<summary>Click to reveal answer</summary>

```python
from nltk.stem import PorterStemmer, WordNetLemmatizer
from typing import List, Tuple

def compare_normalisation(words: List[str]) -> List[Tuple[str, str, str]]:
    """Compare stemming and lemmatisation for words.
    
    Args:
        words: List of words to process.
    
    Returns:
        List of (original, stemmed, lemmatised) tuples.
    """
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    
    results = []
    for word in words:
        stemmed = stemmer.stem(word)
        lemmatised = lemmatizer.lemmatize(word)
        results.append((word, stemmed, lemmatised))
    
    return results


# Test
words = ['running', 'runs', 'ran', 'studies', 'wolves']
for orig, stem, lemma in compare_normalisation(words):
    print(f"{orig:12} -> stem: {stem:10} lemma: {lemma}")
```

</details>


---

### Question 8: Most Important Words

Given a document and a TF-IDF matrix, find the top N most important words.

In [None]:
# Write your solution here

<details>
<summary>Click to reveal answer</summary>

```python
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def top_tfidf_words(
    documents: List[str],
    doc_index: int,
    n: int = 5
) -> List[Tuple[str, float]]:
    """Find top N words by TF-IDF score in a document.
    
    Args:
        documents: List of text documents.
        doc_index: Index of document to analyse.
        n: Number of top words to return.
    
    Returns:
        List of (word, score) tuples.
    """
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)
    
    feature_names = vectorizer.get_feature_names_out()
    doc_vector = tfidf_matrix[doc_index].toarray()[0]
    
    top_indices = np.argsort(doc_vector)[-n:][::-1]
    
    return [(feature_names[i], doc_vector[i]) for i in top_indices]


# Test
docs = [
    "Machine learning is transforming data science",
    "Deep learning uses neural networks",
    "Data science requires statistics knowledge"
]
top_words = top_tfidf_words(docs, 0, 3)
print("Top words in Doc 0:")
for word, score in top_words:
    print(f"  {word}: {score:.4f}")
```

</details>


---

### Question 9: Email Spam Detector

Build a simple rule-based spam detector that checks for common spam indicators.

In [None]:
# Write your solution here

<details>
<summary>Click to reveal answer</summary>

```python
import re

def detect_spam(email_text: str) -> Tuple[bool, List[str]]:
    """Detect spam using rule-based approach.
    
    Args:
        email_text: Email content to check.
    
    Returns:
        Tuple of (is_spam, list of triggered rules).
    """
    spam_indicators = [
        (r'\bfree\b', 'Contains "free"'),
        (r'\bwinner\b', 'Contains "winner"'),
        (r'\bcongratulations\b', 'Contains "congratulations"'),
        (r'\$\d+', 'Contains money amount'),
        (r'!{2,}', 'Multiple exclamation marks'),
        (r'URGENT', 'Contains "URGENT"'),
        (r'click here', 'Contains "click here"'),
    ]
    
    triggered = []
    text_lower = email_text.lower()
    
    for pattern, description in spam_indicators:
        if re.search(pattern, text_lower) or re.search(pattern, email_text):
            triggered.append(description)
    
    is_spam = len(triggered) >= 2
    return is_spam, triggered


# Test
email = "CONGRATULATIONS! You're a WINNER!! Click here to claim $1000 FREE!"
is_spam, rules = detect_spam(email)
print(f"Is spam: {is_spam}")
print(f"Triggered rules: {rules}")
```

</details>


---

### Question 10: Vocabulary Builder

Create a function that builds a vocabulary from a corpus with minimum frequency threshold.

In [None]:
# Write your solution here

<details>
<summary>Click to reveal answer</summary>

```python
from collections import Counter
from typing import Dict, List

def build_vocabulary(
    documents: List[str],
    min_freq: int = 1,
    max_vocab_size: int = None
) -> Dict[str, int]:
    """Build vocabulary with frequency threshold.
    
    Args:
        documents: List of text documents.
        min_freq: Minimum word frequency.
        max_vocab_size: Maximum vocabulary size.
    
    Returns:
        Dictionary mapping words to indices.
    """
    all_words = []
    for doc in documents:
        tokens = doc.lower().split()
        all_words.extend(tokens)
    
    word_counts = Counter(all_words)
    
    filtered = [(w, c) for w, c in word_counts.items() if c >= min_freq]
    filtered.sort(key=lambda x: (-x[1], x[0]))
    
    if max_vocab_size:
        filtered = filtered[:max_vocab_size]
    
    return {word: idx for idx, (word, _) in enumerate(filtered)}


# Test
docs = ["I love data", "I love science", "data science is great"]
vocab = build_vocabulary(docs, min_freq=2)
print(vocab)
# {'i': 0, 'love': 1, 'data': 2, 'science': 3}
```

</details>


---

### Question 11: Text Normalisation Pipeline

Create a configurable text normalisation pipeline class.

In [None]:
# Write your solution here

<details>
<summary>Click to reveal answer</summary>

```python
import re
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

class TextNormaliser:
    """Configurable text normalisation pipeline."""
    
    def __init__(
        self,
        lowercase: bool = True,
        remove_punctuation: bool = True,
        remove_stopwords: bool = True,
        use_stemming: bool = False,
        use_lemmatisation: bool = True
    ):
        """Initialise the normaliser.
        
        Args:
            lowercase: Convert to lowercase.
            remove_punctuation: Remove punctuation.
            remove_stopwords: Remove stopwords.
            use_stemming: Apply stemming.
            use_lemmatisation: Apply lemmatisation.
        """
        self.lowercase = lowercase
        self.remove_punctuation = remove_punctuation
        self.remove_stopwords = remove_stopwords
        self.use_stemming = use_stemming
        self.use_lemmatisation = use_lemmatisation
        
        self.stemmer = PorterStemmer()
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
    
    def normalise(self, text: str) -> str:
        """Normalise text according to configuration.
        
        Args:
            text: Input text string.
        
        Returns:
            Normalised text string.
        """
        if self.lowercase:
            text = text.lower()
        
        if self.remove_punctuation:
            text = re.sub(r'[^\w\s]', '', text)
        
        tokens = text.split()
        
        if self.remove_stopwords:
            tokens = [t for t in tokens if t not in self.stop_words]
        
        if self.use_stemming:
            tokens = [self.stemmer.stem(t) for t in tokens]
        elif self.use_lemmatisation:
            tokens = [self.lemmatizer.lemmatize(t) for t in tokens]
        
        return ' '.join(tokens)


# Test
normaliser = TextNormaliser(use_lemmatisation=True)
text = "The researchers were studying various algorithms!"
print(normaliser.normalise(text))
# "researcher studying various algorithm"
```

</details>


---

### Question 12: Keyword Extraction

Extract the most important keywords from a document using TF-IDF against a background corpus.

In [None]:
# Write your solution here

<details>
<summary>Click to reveal answer</summary>

```python
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from typing import List, Tuple

def extract_keywords(
    target_doc: str,
    background_corpus: List[str],
    n_keywords: int = 5
) -> List[Tuple[str, float]]:
    """Extract keywords from document using TF-IDF.
    
    Args:
        target_doc: Document to extract keywords from.
        background_corpus: Corpus for IDF calculation.
        n_keywords: Number of keywords to extract.
    
    Returns:
        List of (keyword, score) tuples.
    """
    corpus = background_corpus + [target_doc]
    
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(corpus)
    
    feature_names = vectorizer.get_feature_names_out()
    target_vector = tfidf_matrix[-1].toarray()[0]
    
    top_indices = np.argsort(target_vector)[-n_keywords:][::-1]
    
    keywords = [
        (feature_names[i], target_vector[i])
        for i in top_indices
        if target_vector[i] > 0
    ]
    
    return keywords


# Test
background = [
    "The economy is growing steadily",
    "Stock markets reached new highs",
    "Inflation rates remain stable"
]
target = "Machine learning revolutionises artificial intelligence applications"

keywords = extract_keywords(target, background, 3)
print("Keywords:")
for kw, score in keywords:
    print(f"  {kw}: {score:.4f}")
```

</details>


---

## 13. Summary

This notebook covered essential NLP concepts for data science interviews:

1. **Text Preprocessing**: Cleaning text by removing noise, normalising case, and handling special characters
2. **Tokenisation**: Breaking text into words or sentences
3. **Stopword Removal**: Filtering common words that carry little meaning
4. **Stemming and Lemmatisation**: Reducing words to their base forms
5. **Bag of Words**: Representing text as word count vectors
6. **TF-IDF**: Weighing terms by importance relative to corpus
7. **N-grams**: Capturing word sequences and local context
8. **Word Embeddings**: Dense vector representations capturing semantics
9. **Text Classification**: Building pipelines for categorising text
10. **Sentiment Analysis**: Determining emotional tone of text

---

### Key Interview Tips

- **Know when to use what**: TF-IDF for traditional ML, embeddings for deep learning
- **Preprocessing matters**: Always clean and normalise text before feature extraction
- **Understand the trade-offs**: Stemming is fast but crude; lemmatisation is accurate but slower
- **Be able to implement from scratch**: Interviewers often ask for manual TF-IDF or tokenisation
- **Consider the use case**: Sentiment analysis may need stopwords; topic modelling may not
- **Know cosine similarity**: The standard metric for comparing text vectors