# Advanced Natural Language Processing

This notebook covers advanced NLP techniques including tokenization, stemming, lemmatization, named entity recognition, text preprocessing, and vectorization methods.


## Release of Punkt Tokenizer ðŸ“¦

Before we begin, let's ensure all required NLTK data is downloaded.


In [1]:
import nltk

# Download required NLTK data
nltk_data_packages = ['punkt', 'punkt_tab', 'stopwords', 'averaged_perceptron_tagger', 'averaged_perceptron_tagger_eng', 'wordnet', 'maxent_ne_chunker', 'maxent_ne_chunker_tab', 'words']

for package in nltk_data_packages:
    try:
        if package == 'punkt':
            nltk.data.find('tokenizers/punkt')
        elif package == 'punkt_tab':
            nltk.data.find('tokenizers/punkt_tab')
        elif package == 'stopwords':
            nltk.data.find('corpora/stopwords')
        elif package == 'averaged_perceptron_tagger':
            nltk.data.find('taggers/averaged_perceptron_tagger')
        elif package == 'averaged_perceptron_tagger_eng':
            nltk.data.find('taggers/averaged_perceptron_tagger_eng')
        elif package == 'wordnet':
            nltk.data.find('corpora/wordnet')
        elif package == 'maxent_ne_chunker':
            nltk.data.find('chunkers/maxent_ne_chunker')
        elif package == 'maxent_ne_chunker_tab':
            nltk.data.find('chunkers/maxent_ne_chunker_tab')
        elif package == 'words':
            nltk.data.find('corpora/words')
        print(f"âœ“ {package} is already downloaded")
    except LookupError:
        print(f"Downloading {package}...")
        nltk.download(package, quiet=True)
        print(f"âœ“ {package} downloaded successfully")

print("\nAll required NLTK data is ready!")


âœ“ punkt is already downloaded
âœ“ punkt_tab is already downloaded
âœ“ stopwords is already downloaded
âœ“ averaged_perceptron_tagger is already downloaded
âœ“ averaged_perceptron_tagger_eng is already downloaded
Downloading wordnet...
âœ“ wordnet downloaded successfully
âœ“ maxent_ne_chunker is already downloaded
Downloading maxent_ne_chunker_tab...
âœ“ maxent_ne_chunker_tab downloaded successfully
âœ“ words is already downloaded

All required NLTK data is ready!


# Tokenization

**Tokenization** is the process of breaking down text into smaller units called tokens. These tokens can be words, sentences, or subwords.


## Sentence Tokenization

Splitting text into individual sentences.


In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize

# Example text with multiple sentences
text = "Natural Language Processing is fascinating. It enables machines to understand human language. NLP has many applications in today's world."

# Sentence tokenization
sentences = sent_tokenize(text)
print("Original text:")
print(text)
print("\nSentences:")
for i, sentence in enumerate(sentences, 1):
    print(f"{i}. {sentence}")


Original text:
Natural Language Processing is fascinating. It enables machines to understand human language. NLP has many applications in today's world.

Sentences:
1. Natural Language Processing is fascinating.
2. It enables machines to understand human language.
3. NLP has many applications in today's world.


## Word Tokenization

Splitting text into individual words.


In [3]:
# Word tokenization
text = "Don't judge a book by its cover"

tokens = word_tokenize(text)
print("Original text:", text)
print("Tokens:", tokens)
print(f"\nNumber of tokens: {len(tokens)}")

# Tokenize each sentence
multi_sentence = "Hello world! How are you? I'm doing great."
sentences = sent_tokenize(multi_sentence)
print("\n" + "="*50)
print("Multi-sentence tokenization:")
for sentence in sentences:
    tokens = word_tokenize(sentence)
    print(f"Sentence: {sentence}")
    print(f"Tokens: {tokens}\n")


Original text: Don't judge a book by its cover
Tokens: ['Do', "n't", 'judge', 'a', 'book', 'by', 'its', 'cover']

Number of tokens: 8

Multi-sentence tokenization:
Sentence: Hello world!
Tokens: ['Hello', 'world', '!']

Sentence: How are you?
Tokens: ['How', 'are', 'you', '?']

Sentence: I'm doing great.
Tokens: ['I', "'m", 'doing', 'great', '.']



# Stemming

**Stemming** is the process of reducing words to their root form by removing suffixes. It's a crude heuristic process that chops off the ends of words.

## Types of Stemmers

1. **Porter Stemmer**: Most common, aggressive stemming
2. **Snowball Stemmer**: Improved version of Porter, supports multiple languages
3. **Lancaster Stemmer**: More aggressive than Porter


In [4]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

# Example words
words = ["running", "runs", "runner", "ran", "beautifully", "beautiful", "happily", "happiness", "unhappiness"]

# Porter Stemmer
porter = PorterStemmer()
print("Porter Stemmer:")
print("-" * 50)
for word in words:
    stemmed = porter.stem(word)
    print(f"{word:15} -> {stemmed}")

# Snowball Stemmer (English)
snowball = SnowballStemmer('english')
print("\nSnowball Stemmer:")
print("-" * 50)
for word in words:
    stemmed = snowball.stem(word)
    print(f"{word:15} -> {stemmed}")

# Lancaster Stemmer (more aggressive)
lancaster = LancasterStemmer()
print("\nLancaster Stemmer:")
print("-" * 50)
for word in words:
    stemmed = lancaster.stem(word)
    print(f"{word:15} -> {stemmed}")


Porter Stemmer:
--------------------------------------------------
running         -> run
runs            -> run
runner          -> runner
ran             -> ran
beautifully     -> beauti
beautiful       -> beauti
happily         -> happili
happiness       -> happi
unhappiness     -> unhappi

Snowball Stemmer:
--------------------------------------------------
running         -> run
runs            -> run
runner          -> runner
ran             -> ran
beautifully     -> beauti
beautiful       -> beauti
happily         -> happili
happiness       -> happi
unhappiness     -> unhappi

Lancaster Stemmer:
--------------------------------------------------
running         -> run
runs            -> run
runner          -> run
ran             -> ran
beautifully     -> beauty
beautiful       -> beauty
happily         -> happy
happiness       -> happy
unhappiness     -> unhappy


## Stemming in Practice

Applying stemming to a sentence.


In [5]:
# Stemming a sentence
sentence = "The runners were running quickly and beautifully through the beautiful garden"
tokens = word_tokenize(sentence.lower())

porter = PorterStemmer()
stemmed_tokens = [porter.stem(token) for token in tokens]

print("Original sentence:", sentence)
print("\nOriginal tokens:", tokens)
print("Stemmed tokens:", stemmed_tokens)
print("\nStemmed sentence:", " ".join(stemmed_tokens))


Original sentence: The runners were running quickly and beautifully through the beautiful garden

Original tokens: ['the', 'runners', 'were', 'running', 'quickly', 'and', 'beautifully', 'through', 'the', 'beautiful', 'garden']
Stemmed tokens: ['the', 'runner', 'were', 'run', 'quickli', 'and', 'beauti', 'through', 'the', 'beauti', 'garden']

Stemmed sentence: the runner were run quickli and beauti through the beauti garden


# Lemmatization

**Lemmatization** is the process of reducing words to their base or dictionary form (lemma). Unlike stemming, lemmatization considers the context and part of speech of the word.

## Key Differences from Stemming

- **Stemming**: Fast but crude, may produce non-words
- **Lemmatization**: Slower but accurate, produces valid words
- **Example**: "better" â†’ stem: "better", lemma: "good"


In [6]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tag import pos_tag

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Example words with different parts of speech
words_examples = [
    ("better", 'a'),  # adjective
    ("running", 'v'),  # verb
    ("running", 'n'),  # noun
    ("mice", 'n'),  # noun (plural)
    ("was", 'v'),  # verb
    ("happily", 'r'),  # adverb
    ("happiness", 'n')  # noun
]

print("Lemmatization Examples:")
print("-" * 60)
print(f"{'Word':15} {'POS':5} {'Lemma':15}")
print("-" * 60)
for word, pos in words_examples:
    lemma = lemmatizer.lemmatize(word, pos=pos)
    print(f"{word:15} {pos:5} {lemma:15}")


Lemmatization Examples:
------------------------------------------------------------
Word            POS   Lemma          
------------------------------------------------------------
better          a     good           
running         v     run            
running         n     running        
mice            n     mouse          
was             v     be             
happily         r     happily        
happiness       n     happiness      


## Automatic POS Tagging for Lemmatization

Converting POS tags to WordNet format for better lemmatization.


In [7]:
def get_wordnet_pos(treebank_tag):
    """Convert treebank POS tag to WordNet POS tag"""
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # default

# Example sentence
sentence = "The better runner was running quickly and feeling happier"
tokens = word_tokenize(sentence)

# Get POS tags
pos_tags = pos_tag(tokens)

# Lemmatize with correct POS
lemmatized = []
for word, pos in pos_tags:
    wordnet_pos = get_wordnet_pos(pos)
    lemma = lemmatizer.lemmatize(word, pos=wordnet_pos)
    lemmatized.append(lemma)

print("Original sentence:", sentence)
print("\nTokens with POS tags:")
for word, pos in pos_tags:
    print(f"  {word:15} -> {pos}")
print("\nLemmatized tokens:", lemmatized)
print("Lemmatized sentence:", " ".join(lemmatized))


Original sentence: The better runner was running quickly and feeling happier

Tokens with POS tags:
  The             -> DT
  better          -> JJR
  runner          -> NN
  was             -> VBD
  running         -> VBG
  quickly         -> RB
  and             -> CC
  feeling         -> VBG
  happier         -> NN

Lemmatized tokens: ['The', 'good', 'runner', 'be', 'run', 'quickly', 'and', 'feel', 'happier']
Lemmatized sentence: The good runner be run quickly and feel happier


# Named Entity Recognition (NER)

**Named Entity Recognition** is the process of identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, and percentages.


In [8]:
from nltk import ne_chunk

# Example sentences with named entities
sentences = [
    "Apple Inc. is located in Cupertino, California.",
    "Barack Obama was the President of the United States.",
    "I visited Paris, France last summer.",
    "The conference will be held on March 15, 2024 at Stanford University."
]

for sentence in sentences:
    print(f"\nSentence: {sentence}")
    print("-" * 70)
    
    # Tokenize and POS tag
    tokens = word_tokenize(sentence)
    pos_tags = pos_tag(tokens)
    
    # Perform NER
    ner_tree = ne_chunk(pos_tags)
    
    print("Named Entities:")
    for subtree in ner_tree:
        if hasattr(subtree, 'label'):
            entity_name = ' '.join([token for token, pos in subtree.leaves()])
            entity_type = subtree.label()
            print(f"  {entity_name:30} -> {entity_type}")
    print()



Sentence: Apple Inc. is located in Cupertino, California.
----------------------------------------------------------------------
Named Entities:
  Apple                          -> PERSON
  Inc.                           -> ORGANIZATION
  Cupertino                      -> GPE
  California                     -> GPE


Sentence: Barack Obama was the President of the United States.
----------------------------------------------------------------------
Named Entities:
  Barack                         -> PERSON
  Obama                          -> PERSON
  United States                  -> GPE


Sentence: I visited Paris, France last summer.
----------------------------------------------------------------------
Named Entities:
  Paris                          -> GPE
  France                         -> GPE


Sentence: The conference will be held on March 15, 2024 at Stanford University.
----------------------------------------------------------------------
Named Entities:
  Stanford Universi

## Visualizing NER Tree Structure

Displaying the full NER tree structure.


In [9]:
# Detailed NER example
text = "The quick brown fox jumps over the lazy dog."

tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
ner_tree = ne_chunk(pos_tags)

print("Original text:", text)
print("\nTokenized:", tokens)
print("\nPOS Tags:", pos_tags)
print("\nNER Tree Structure:")
print(ner_tree)

# Extract named entities
print("\nExtracted Named Entities:")
entities = []
for subtree in ner_tree:
    if hasattr(subtree, 'label'):
        entity = ' '.join([token for token, pos in subtree.leaves()])
        entities.append((entity, subtree.label()))

if entities:
    for entity, label in entities:
        print(f"  {entity} -> {label}")
else:
    print("  No named entities found in this sentence.")


Original text: The quick brown fox jumps over the lazy dog.

Tokenized: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]

NER Tree Structure:
(S
  The/DT
  quick/JJ
  brown/NN
  fox/NN
  jumps/VBZ
  over/IN
  the/DT
  lazy/JJ
  dog/NN
  ./.)

Extracted Named Entities:
  No named entities found in this sentence.


# Text Preprocessing

**Text Preprocessing** is a crucial step in NLP that involves cleaning and preparing text data for analysis. Common preprocessing steps include:

1. **Lowercasing**: Convert all text to lowercase
2. **Removing Punctuation**: Remove special characters
3. **Removing Stopwords**: Remove common words that don't carry much meaning
4. **Tokenization**: Split text into tokens
5. **Stemming/Lemmatization**: Reduce words to their base forms


In [10]:
import string
from nltk.corpus import stopwords

# Example text
text = "The quick brown fox jumps over the lazy dog! It's a beautiful day in the neighborhood."

print("Original text:")
print(text)
print("\n" + "="*70)

# Step 1: Lowercasing
text_lower = text.lower()
print("\n1. After lowercasing:")
print(text_lower)

# Step 2: Tokenization
tokens = word_tokenize(text_lower)
print("\n2. After tokenization:")
print(tokens)

# Step 3: Removing punctuation
tokens_no_punct = [token for token in tokens if token not in string.punctuation]
print("\n3. After removing punctuation:")
print(tokens_no_punct)

# Step 4: Removing stopwords
stop_words = set(stopwords.words('english'))
tokens_no_stopwords = [token for token in tokens_no_punct if token not in stop_words]
print("\n4. After removing stopwords:")
print(tokens_no_stopwords)
print(f"Removed stopwords: {[w for w in tokens_no_punct if w in stop_words]}")

# Step 5: Stemming (optional)
porter = PorterStemmer()
tokens_stemmed = [porter.stem(token) for token in tokens_no_stopwords]
print("\n5. After stemming:")
print(tokens_stemmed)

print("\n" + "="*70)
print("Final preprocessed text:", " ".join(tokens_stemmed))


Original text:
The quick brown fox jumps over the lazy dog! It's a beautiful day in the neighborhood.


1. After lowercasing:
the quick brown fox jumps over the lazy dog! it's a beautiful day in the neighborhood.

2. After tokenization:
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '!', 'it', "'s", 'a', 'beautiful', 'day', 'in', 'the', 'neighborhood', '.']

3. After removing punctuation:
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'it', "'s", 'a', 'beautiful', 'day', 'in', 'the', 'neighborhood']

4. After removing stopwords:
['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', "'s", 'beautiful', 'day', 'neighborhood']
Removed stopwords: ['the', 'over', 'the', 'it', 'a', 'in', 'the']

5. After stemming:
['quick', 'brown', 'fox', 'jump', 'lazi', 'dog', "'s", 'beauti', 'day', 'neighborhood']

Final preprocessed text: quick brown fox jump lazi dog 's beauti day neighborhood


## Complete Preprocessing Function

Creating a reusable preprocessing function.


In [11]:
def preprocess_text(text, remove_stopwords=True, stem=True, lemmatize=False):
    """
    Complete text preprocessing pipeline
    
    Parameters:
    - text: Input text string
    - remove_stopwords: Whether to remove stopwords
    - stem: Whether to apply stemming
    - lemmatize: Whether to apply lemmatization (overrides stem if True)
    
    Returns:
    - List of preprocessed tokens
    """
    # Lowercase
    text = text.lower()
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove punctuation
    tokens = [token for token in tokens if token not in string.punctuation]
    
    # Remove stopwords
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        tokens = [token for token in tokens if token not in stop_words]
    
    # Lemmatization or Stemming
    if lemmatize:
        pos_tags = pos_tag(tokens)
        lemmatized = []
        for word, pos in pos_tags:
            wordnet_pos = get_wordnet_pos(pos)
            lemma = lemmatizer.lemmatize(word, pos=wordnet_pos)
            lemmatized.append(lemma)
        tokens = lemmatized
    elif stem:
        porter = PorterStemmer()
        tokens = [porter.stem(token) for token in tokens]
    
    return tokens

# Test the function
sample_texts = [
    "The quick brown fox jumps over the lazy dog!",
    "Natural Language Processing is amazing and powerful.",
    "I'm learning about text preprocessing techniques."
]

print("Text Preprocessing Examples:")
print("="*70)
for text in sample_texts:
    print(f"\nOriginal: {text}")
    preprocessed = preprocess_text(text, remove_stopwords=True, stem=True)
    print(f"Preprocessed: {' '.join(preprocessed)}")


Text Preprocessing Examples:

Original: The quick brown fox jumps over the lazy dog!
Preprocessed: quick brown fox jump lazi dog

Original: Natural Language Processing is amazing and powerful.
Preprocessed: natur languag process amaz power

Original: I'm learning about text preprocessing techniques.
Preprocessed: 'm learn text preprocess techniqu


# Bag of Words (BoW)

**Bag of Words** is a simple text representation method that creates a vocabulary of unique words and represents each document as a vector of word counts, ignoring word order and grammar.


In [12]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample documents
documents = [
    "I love natural language processing",
    "Natural language processing is fascinating",
    "I love machine learning and NLP",
    "Machine learning and natural language processing are related"
]

# Create CountVectorizer (Bag of Words)
vectorizer = CountVectorizer()

# Fit and transform documents
bow_matrix = vectorizer.fit_transform(documents)

# Get vocabulary
vocabulary = vectorizer.get_feature_names_out()

print("Documents:")
for i, doc in enumerate(documents, 1):
    print(f"{i}. {doc}")

print("\nVocabulary:", vocabulary)
print(f"\nVocabulary size: {len(vocabulary)}")
print(f"\nBoW Matrix shape: {bow_matrix.shape}")

# Convert to dense array for better visualization
bow_dense = bow_matrix.toarray()

# Create DataFrame for better visualization
df_bow = pd.DataFrame(bow_dense, columns=vocabulary, index=[f"Doc {i+1}" for i in range(len(documents))])
print("\nBag of Words Matrix:")
print(df_bow)


Documents:
1. I love natural language processing
2. Natural language processing is fascinating
3. I love machine learning and NLP
4. Machine learning and natural language processing are related

Vocabulary: ['and' 'are' 'fascinating' 'is' 'language' 'learning' 'love' 'machine'
 'natural' 'nlp' 'processing' 'related']

Vocabulary size: 12

BoW Matrix shape: (4, 12)

Bag of Words Matrix:
       and  are  fascinating  is  language  learning  love  machine  natural  \
Doc 1    0    0            0   0         1         0     1        0        1   
Doc 2    0    0            1   1         1         0     0        0        1   
Doc 3    1    0            0   0         0         1     1        1        0   
Doc 4    1    1            0   0         1         1     0        1        1   

       nlp  processing  related  
Doc 1    0           1        0  
Doc 2    0           1        0  
Doc 3    1           0        0  
Doc 4    0           1        1  


## Bag of Words with Custom Parameters

Using custom parameters like max_features, min_df, and max_df.


In [13]:
# Custom Bag of Words with preprocessing
vectorizer_custom = CountVectorizer(
    max_features=10,  # Keep only top 10 most frequent words
    min_df=1,  # Minimum document frequency
    max_df=0.8,  # Maximum document frequency (ignore words in >80% of docs)
    stop_words='english',  # Remove English stopwords
    lowercase=True,
    token_pattern=r'\b\w+\b'  # Word token pattern
)

bow_custom = vectorizer_custom.fit_transform(documents)
vocab_custom = vectorizer_custom.get_feature_names_out()

print("Custom BoW with preprocessing:")
print(f"Vocabulary: {vocab_custom}")
print(f"\nBoW Matrix:")
df_custom = pd.DataFrame(bow_custom.toarray(), columns=vocab_custom, 
                         index=[f"Doc {i+1}" for i in range(len(documents))])
print(df_custom)


Custom BoW with preprocessing:
Vocabulary: ['fascinating' 'language' 'learning' 'love' 'machine' 'natural' 'nlp'
 'processing' 'related']

BoW Matrix:
       fascinating  language  learning  love  machine  natural  nlp  \
Doc 1            0         1         0     1        0        1    0   
Doc 2            1         1         0     0        0        1    0   
Doc 3            0         0         1     1        1        0    1   
Doc 4            0         1         1     0        1        1    0   

       processing  related  
Doc 1           1        0  
Doc 2           1        0  
Doc 3           0        0  
Doc 4           1        1  


# TF-IDF (Term Frequency-Inverse Document Frequency)

**TF-IDF** is a numerical statistic that reflects how important a word is to a document in a collection of documents. It increases proportionally to the number of times a word appears in a document but is offset by the frequency of the word in the corpus.

## Formula

- **TF (Term Frequency)**: Number of times a term appears in a document
- **IDF (Inverse Document Frequency)**: Logarithmic inverse of the document frequency
- **TF-IDF = TF Ã— IDF**


In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Same documents as before
documents = [
    "I love natural language processing",
    "Natural language processing is fascinating",
    "I love machine learning and NLP",
    "Machine learning and natural language processing are related"
]

# Create TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get vocabulary
vocabulary = tfidf_vectorizer.get_feature_names_out()

print("Documents:")
for i, doc in enumerate(documents, 1):
    print(f"{i}. {doc}")

print(f"\nVocabulary size: {len(vocabulary)}")
print(f"TF-IDF Matrix shape: {tfidf_matrix.shape}")

# Convert to dense array
tfidf_dense = tfidf_matrix.toarray()

# Create DataFrame
df_tfidf = pd.DataFrame(tfidf_dense, columns=vocabulary, 
                        index=[f"Doc {i+1}" for i in range(len(documents))])
print("\nTF-IDF Matrix:")
print(df_tfidf.round(3))


Documents:
1. I love natural language processing
2. Natural language processing is fascinating
3. I love machine learning and NLP
4. Machine learning and natural language processing are related

Vocabulary size: 12
TF-IDF Matrix shape: (4, 12)

TF-IDF Matrix:
         and    are  fascinating     is  language  learning   love  machine  \
Doc 1  0.000  0.000        0.000  0.000     0.470     0.000  0.581    0.000   
Doc 2  0.000  0.000        0.557  0.557     0.356     0.000  0.000    0.000   
Doc 3  0.422  0.000        0.000  0.000     0.000     0.422  0.422    0.422   
Doc 4  0.350  0.443        0.000  0.000     0.283     0.350  0.000    0.350   

       natural    nlp  processing  related  
Doc 1    0.470  0.000       0.470    0.000  
Doc 2    0.356  0.000       0.356    0.000  
Doc 3    0.000  0.536       0.000    0.000  
Doc 4    0.283  0.000       0.283    0.443  


## Understanding TF-IDF Values

Analyzing which words have high TF-IDF scores in each document.


In [15]:
# Find top words for each document
for i, doc in enumerate(documents):
    print(f"\nDocument {i+1}: {doc}")
    print("-" * 60)
    
    # Get TF-IDF scores for this document
    scores = tfidf_dense[i]
    
    # Get top 5 words
    top_indices = scores.argsort()[-5:][::-1]
    
    print("Top 5 words by TF-IDF score:")
    for idx in top_indices:
        word = vocabulary[idx]
        score = scores[idx]
        print(f"  {word:20} -> {score:.4f}")



Document 1: I love natural language processing
------------------------------------------------------------
Top 5 words by TF-IDF score:
  love                 -> 0.5806
  processing           -> 0.4701
  natural              -> 0.4701
  language             -> 0.4701
  related              -> 0.0000

Document 2: Natural language processing is fascinating
------------------------------------------------------------
Top 5 words by TF-IDF score:
  is                   -> 0.5571
  fascinating          -> 0.5571
  processing           -> 0.3556
  natural              -> 0.3556
  language             -> 0.3556

Document 3: I love machine learning and NLP
------------------------------------------------------------
Top 5 words by TF-IDF score:
  nlp                  -> 0.5356
  machine              -> 0.4222
  love                 -> 0.4222
  learning             -> 0.4222
  and                  -> 0.4222

Document 4: Machine learning and natural language processing are related
------------

## TF-IDF with Custom Parameters

Using n-grams and custom parameters.


In [16]:
# TF-IDF with bigrams and custom parameters
tfidf_bigram = TfidfVectorizer(
    ngram_range=(1, 2),  # Unigrams and bigrams
    max_features=20,  # Top 20 features
    stop_words='english',
    min_df=1,
    max_df=0.8
)

tfidf_bigram_matrix = tfidf_bigram.fit_transform(documents)
vocab_bigram = tfidf_bigram.get_feature_names_out()

print("TF-IDF with Bigrams:")
print(f"\nVocabulary (first 20 features): {vocab_bigram}")
print(f"\nMatrix shape: {tfidf_bigram_matrix.shape}")

df_bigram = pd.DataFrame(tfidf_bigram_matrix.toarray(), columns=vocab_bigram,
                        index=[f"Doc {i+1}" for i in range(len(documents))])
print("\nTF-IDF Matrix with Bigrams:")
print(df_bigram.round(3))


TF-IDF with Bigrams:

Vocabulary (first 20 features): ['fascinating' 'language' 'language processing' 'learning'
 'learning natural' 'learning nlp' 'love' 'love machine' 'love natural'
 'machine' 'machine learning' 'natural' 'natural language' 'nlp'
 'processing' 'processing fascinating' 'processing related' 'related']

Matrix shape: (4, 18)

TF-IDF Matrix with Bigrams:
       fascinating  language  language processing  learning  learning natural  \
Doc 1        0.000     0.334                0.334     0.000             0.000   
Doc 2        0.498     0.318                0.318     0.000             0.000   
Doc 3        0.000     0.000                0.000     0.337             0.000   
Doc 4        0.000     0.243                0.243     0.300             0.381   

       learning nlp   love  love machine  love natural  machine  \
Doc 1         0.000  0.412         0.000         0.523    0.000   
Doc 2         0.000  0.000         0.000         0.000    0.000   
Doc 3         0.427 

# Word Embeddings

**Word Embeddings** are dense vector representations of words in a continuous vector space. Unlike sparse representations (like BoW or TF-IDF), embeddings capture semantic relationships between words.

## Types of Word Embeddings

1. **Word2Vec**: Predicts words from context (CBOW) or context from words (Skip-gram)
2. **GloVe**: Global Vectors for Word Representation using co-occurrence statistics
3. **FastText**: Extends Word2Vec with subword information
4. **Contextual Embeddings**: BERT, ELMo, GPT (capture context-dependent meanings)


In [17]:
try:
    import gensim
    from gensim.models import Word2Vec
    from gensim.downloader import load
    
    print("Gensim is available. Loading pre-trained GloVe embeddings...")
    
    # Load pre-trained GloVe embeddings (smaller model for demonstration)
    # This will download the model if not already present
    try:
        glove_model = load('glove-wiki-gigaword-50')
        print("âœ“ GloVe model loaded successfully!")
        print(f"\nVocabulary size: {len(glove_model.key_to_index)}")
        print(f"Vector dimensions: {glove_model.vector_size}")
        
        # Example: Get word vector
        word = "king"
        if word in glove_model:
            vector = glove_model[word]
            print(f"\nVector for '{word}' (first 10 dimensions): {vector[:10]}")
            print(f"Vector shape: {vector.shape}")
        
    except Exception as e:
        print(f"Error loading GloVe model: {e}")
        print("\nNote: You can install gensim with: pip install gensim")
        
except ImportError:
    print("Gensim is not installed.")
    print("\nTo use word embeddings, install gensim:")
    print("  pip install gensim")
    print("\nAlternatively, you can use other embedding libraries like:")
    print("  - spaCy (spacy.io)")
    print("  - transformers (Hugging Face)")


Gensim is not installed.

To use word embeddings, install gensim:
  pip install gensim

Alternatively, you can use other embedding libraries like:
  - spaCy (spacy.io)
  - transformers (Hugging Face)


## Finding Similar Words

Using word embeddings to find semantically similar words.


In [18]:
try:
    if 'glove_model' in locals():
        # Find most similar words
        test_words = ["king", "computer", "beautiful", "happy"]
        
        for word in test_words:
            if word in glove_model:
                print(f"\nWords similar to '{word}':")
                similar = glove_model.most_similar(word, topn=5)
                for similar_word, score in similar:
                    print(f"  {similar_word:20} (similarity: {score:.4f})")
            else:
                print(f"'{word}' not found in vocabulary")
    else:
        print("GloVe model not loaded. Please run the previous cell first.")
except NameError:
    print("GloVe model not available. Please install gensim and run the previous cell.")


GloVe model not loaded. Please run the previous cell first.


## Word Analogies

Demonstrating word relationships using embeddings.


In [19]:
try:
    if 'glove_model' in locals():
        # Word analogy: king - man + woman = ?
        # This should give us "queen"
        analogy_words = ["king", "man", "woman"]
        
        if all(word in glove_model for word in analogy_words):
            result = glove_model.most_similar(positive=["king", "woman"], 
                                           negative=["man"], topn=5)
            
            print("Word Analogy: king - man + woman = ?")
            print("-" * 50)
            for word, score in result:
                print(f"  {word:20} (score: {score:.4f})")
            
            print("\n" + "="*50)
            
            # Another analogy: computer - machine + human = ?
            analogy_words2 = ["computer", "machine", "human"]
            if all(word in glove_model for word in analogy_words2):
                result2 = glove_model.most_similar(positive=["computer", "human"], 
                                                 negative=["machine"], topn=5)
                
                print("Word Analogy: computer - machine + human = ?")
                print("-" * 50)
                for word, score in result2:
                    print(f"  {word:20} (score: {score:.4f})")
        else:
            print("Some words not found in vocabulary")
    else:
        print("GloVe model not available. Please install gensim and run the previous cell.")
except NameError:
    print("GloVe model not available. Please install gensim and run the previous cell.")


GloVe model not available. Please install gensim and run the previous cell.


## Training Custom Word2Vec Model

Creating a simple Word2Vec model from your own text data.


In [20]:
try:
    from gensim.models import Word2Vec
    
    # Sample sentences (tokenized)
    sentences = [
        ["natural", "language", "processing", "is", "fascinating"],
        ["machine", "learning", "and", "nlp", "are", "related"],
        ["word", "embeddings", "capture", "semantic", "relationships"],
        ["text", "preprocessing", "is", "important", "for", "nlp"],
        ["tokenization", "stemming", "and", "lemmatization", "are", "techniques"],
        ["named", "entity", "recognition", "identifies", "entities"],
        ["bag", "of", "words", "represents", "text", "as", "vectors"],
        ["tf", "idf", "weights", "words", "by", "importance"]
    ]
    
    # Train Word2Vec model
    model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
    
    print("Custom Word2Vec Model Trained!")
    print(f"Vocabulary size: {len(model.wv.key_to_index)}")
    print(f"Vector dimensions: {model.wv.vector_size}")
    
    # Find similar words
    if "nlp" in model.wv:
        print("\nWords similar to 'nlp':")
        similar = model.wv.most_similar("nlp", topn=5)
        for word, score in similar:
            print(f"  {word:20} (similarity: {score:.4f})")
    
    if "word" in model.wv:
        print("\nWords similar to 'word':")
        similar = model.wv.most_similar("word", topn=5)
        for word, score in similar:
            print(f"  {word:20} (similarity: {score:.4f})")
            
except ImportError:
    print("Gensim is not installed. Install it with: pip install gensim")


Gensim is not installed. Install it with: pip install gensim


# Summary

This notebook covered:

1. **Tokenization**: Breaking text into sentences and words
2. **Stemming**: Reducing words to their root forms
3. **Lemmatization**: More accurate word reduction using POS tags
4. **Named Entity Recognition**: Identifying entities in text
5. **Text Preprocessing**: Complete pipeline for cleaning text
6. **Bag of Words**: Simple word count representation
7. **TF-IDF**: Weighted word importance representation
8. **Word Embeddings**: Dense vector representations capturing semantics

## Next Steps

- Explore transformer models (BERT, GPT)
- Experiment with spaCy for advanced NLP
- Build text classification models
- Work with sequence models (RNNs, LSTMs, Transformers)
