# NLP Course – Session 1

Welcome to the first session of our NLP course! In this notebook, we'll cover:

1. **Characters, Words, and Sentences**  
2. **Text Normalization**  
3. **Tokenization** (Whitespace, Standard, Subword, and Tweet Tokenizer)

Let's get started!


In [1]:
!pip install nltk tokenizers
!pip install transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /Users/fabian/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/fabian/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/fabian/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/fabian/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [3]:
# -------------------
# SECTION 1: CHARACTERS, WORDS, SENTENCES
# -------------------

# Basic string operations to illustrate how Python handles text

sample_text = "Hello World! This is an example. Isn't it great to explore NLP?"

print("Original Text:")
print(sample_text)
print()

# 1. Length of the string (number of characters)
print("Length of the text (in characters):", len(sample_text))

# 2. Accessing individual characters
print("First character:", sample_text[0])
print("First 5 characters:", sample_text[:5])
print()

# 3. Splitting into words naively (by whitespace)
words_naive = sample_text.split()
print("Naive split into words by whitespace:")
print(words_naive)
print()

# 4. Splitting into sentences (very naively by '.')
sentences_naive = sample_text.split('.')
print("Naive split into sentences by '.' :")
print(sentences_naive)
print()


Original Text:
Hello World! This is an example. Isn't it great to explore NLP?

Length of the text (in characters): 63
First character: H
First 5 characters: Hello

Naive split into words by whitespace:
['Hello', 'World!', 'This', 'is', 'an', 'example.', "Isn't", 'it', 'great', 'to', 'explore', 'NLP?']

Naive split into sentences by '.' :
['Hello World! This is an example', " Isn't it great to explore NLP?"]



In [4]:
# -------------------
# SECTION 2: TEXT NORMALIZATION
# -------------------
# We'll demonstrate the use of Porter Stemmer and WordNet Lemmatizer from NLTK.

import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Make sure you have the relevant NLTK data downloaded:
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('omw-1.4')

porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()

example_words = ["studies", "studying", "studied", "leaves", "leaving", "better"]

print("Porter Stemmer vs. WordNet Lemmatizer:\n")
for word in example_words:
    stem = porter.stem(word)
    lemma = lemmatizer.lemmatize(word)  # default is noun lemmatization
    print(f"Word: {word:<10} | Stem: {stem:<10} | Lemma: {lemma:<10}")

# Note that the lemmatizer may need part-of-speech tags for full accuracy. 
# For example, 'studying' can lemmatize differently as a verb vs. a noun.


Porter Stemmer vs. WordNet Lemmatizer:

Word: studies    | Stem: studi      | Lemma: study     
Word: studying   | Stem: studi      | Lemma: studying  
Word: studied    | Stem: studi      | Lemma: studied   
Word: leaves     | Stem: leav       | Lemma: leaf      
Word: leaving    | Stem: leav       | Lemma: leaving   
Word: better     | Stem: better     | Lemma: better    


In [5]:
# -------------------
# SECTION 3: TOKENIZATION
# -------------------
# We'll look at several different approaches:
# 1. Simple whitespace splitting
# 2. Standard tokenizer (nltk.word_tokenize)
# 3. Subword tokenization (Byte-Pair Encoding using Hugging Face)
# 4. Twitter tokenizer (nltk.TweetTokenizer)

# We'll create a sample text including emojis, hashtags, etc.

text_for_tokenization = (
    "I'm soooo excited!!! #NLP is awesome. Check this out: https://example.com "
    "😀🔥 #fun @user"
)

print("Sample text for tokenization:")
print(text_for_tokenization)
print()

# --- 3.1 Simple whitespace splitting ---
whitespace_tokens = text_for_tokenization.split()
print("1) Whitespace splitting:")
print(whitespace_tokens)
print()


Sample text for tokenization:
I'm soooo excited!!! #NLP is awesome. Check this out: https://example.com 😀🔥 #fun @user

1) Whitespace splitting:
["I'm", 'soooo', 'excited!!!', '#NLP', 'is', 'awesome.', 'Check', 'this', 'out:', 'https://example.com', '😀🔥', '#fun', '@user']



In [6]:
# --- 3.2 Standard Tokenizer (NLTK) ---
from nltk.tokenize import word_tokenize

nltk_tokens = word_tokenize(text_for_tokenization)
print("2) NLTK's Standard Tokenizer (word_tokenize):")
print(nltk_tokens)
print()


2) NLTK's Standard Tokenizer (word_tokenize):
['I', "'m", 'soooo', 'excited', '!', '!', '!', '#', 'NLP', 'is', 'awesome', '.', 'Check', 'this', 'out', ':', 'https', ':', '//example.com', '😀🔥', '#', 'fun', '@', 'user']



In [7]:
# --- 3.3 Subword Tokenization ---
# Requires: pip install transformers

from transformers import BertTokenizer

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
subword_tokens = bert_tokenizer.tokenize(text_for_tokenization)

print("3) Subword Tokenization (BERT):")
print(subword_tokens)
print()


3) Subword Tokenization (BERT):
['i', "'", 'm', 'soo', '##oo', 'excited', '!', '!', '!', '#', 'nl', '##p', 'is', 'awesome', '.', 'check', 'this', 'out', ':', 'https', ':', '/', '/', 'example', '.', 'com', '[UNK]', '#', 'fun', '@', 'user']



In [8]:
# --- 3.4 Twitter Tokenizer (NLTK) ---
from nltk.tokenize import TweetTokenizer

tweet_tokenizer = TweetTokenizer()
tweet_tokens = tweet_tokenizer.tokenize(text_for_tokenization)

print("4) NLTK's Twitter Tokenizer (TweetTokenizer):")
print(tweet_tokens)


4) NLTK's Twitter Tokenizer (TweetTokenizer):
["I'm", 'soooo', 'excited', '!', '!', '!', '#NLP', 'is', 'awesome', '.', 'Check', 'this', 'out', ':', 'https://example.com', '😀', '🔥', '#fun', '@user']


## Task 1: Comparing the Different Tokenization Outputs

1. **Whitespace** – Splits only on spaces (may miss punctuation, emojis, etc.).  
2. **NLTK Standard** – Splits punctuation, contractions more carefully.  
3. **Subword ** – Splits into subword units (useful for handling unknown words, morphological variations).  
4. **Tweet Tokenizer** – Specifically designed for social media text, handling hashtags, mentions, emojis, etc.


In [9]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize, TweetTokenizer
from tokenizers import ByteLevelBPETokenizer

# Make sure you have these downloaded if you haven't already:
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('omw-1.4')

# Sample text with hashtags, emojis, URL, etc.
text_for_tokenization = (
    "I'm soooo excited!!! :-) #NLP is awesome. Check this out: https://example.com "
    "😀🔥 #fun @user"
)

# 1) Simple Whitespace Split

# 2) NLTK Standard Tokenizer

# 3) Subword Tokenizer (BertTokenizer)

# 4) NLTK Tweet Tokenizer


# compare the tokenized texts: What is different?


## Task 2: Byte Pair Encoding
To better understand what is happening in BPE, let's implement PPE ourselves.

1. Complete the function byte_pair_encoding.
2. Show the vocabulary of the encoded result.

In [10]:
# Tip: You can use the following code to find the most frequent pair in a dictionary:
sample_dict = {'a': 1, 'b': 2, 'c': 3}
print(max(sample_dict, key=sample_dict.get))


c


In [None]:
from collections import defaultdict
# defaultdict automatically counts the number of occurrences of each element returning a dictionary

def get_pair_stats(tokenized_text):
    """
    Given a tokenized text (list of tokens), return a dictionary
    mapping each adjacent pair to its frequency (count of occurrences).
    """
    pairs = defaultdict(int)
    for tokens in tokenized_text:
        for i in range(len(tokens) - 1):
            pair = (tokens[i], tokens[i+1])
            pairs[pair] += 1
    return pairs

def merge_tokens(pair_to_merge, tokenized_text):
    """
    Merge all occurrences of 'pair_to_merge' in the tokenized_text.
    
    pair_to_merge: a tuple of two tokens to be merged.
    tokenized_text: list of lists of tokens.
    """
    new_token = "".join(pair_to_merge)
    merged_text = []
    for tokens in tokenized_text:
        merged_tokens = []
        skip_next = False
        for i in range(len(tokens)):
            if skip_next:
                skip_next = False
                continue
            
            if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == pair_to_merge:
                # Merge the two tokens
                merged_tokens.append(new_token)
                skip_next = True
            else:
                merged_tokens.append(tokens[i])
        merged_text.append(merged_tokens)
    return merged_text



def byte_pair_encoding(texts, max_merges=10):
    """
    Performs Byte Pair Encoding (BPE) on a list of text strings.
    
    1. Tokenize each text string initially by splitting into characters
       (you might adjust this to split by subwords, add special symbols, etc.).
    2. Iteratively find and merge the most frequent pair up to max_merges times.
    
    :param texts: List of raw text strings to be encoded.
    :param max_merges: The maximum number of pair merges to perform (stopping criterion).

    :return: (encoded_texts, merges)
       - encoded_texts: Final tokenized text after BPE merges.
       - merges: List of merges performed (in order).
    """

    tokenized_text = []
    merges = []

    # 1. Tokenize the text into characters
    for text in texts:
        # Convert to list of characters; you could also treat whitespaces or special tokens differently
        # E.g. text: "hello world" -> ['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
        
    
    
    
    
    # 2. Find the most frequent pair in the entire corpus, for max_merges iterations
     
        # Sort pairs by frequency (highest first)
             
        
        # 3. Merge that pair throughout the tokenized text
        
        
    return tokenized_text, merges



# Example usage:
sample_texts = [
    "Wild wild west is a movie",
    "Kanye West is a rapper",
    "A good movieis a good movie",
    "A Wrapper is a person who wraps gifts",
    "A Burrito is a Mexican wrap per se",
    "Per perdes means by foot"
]

# Perform BPE with up to 10 merges
encoded_result, performed_merges = byte_pair_encoding(sample_texts, max_merges=10)

print("Final Tokenized Text:")
for line in encoded_result:
    print(line)

print("\nMerges Performed:")
print(performed_merges)
