<a href="https://colab.research.google.com/github/Krishnan-Raghavan/Packt/blob/main/DataCleaningAndPreparationChapter12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install required libraries

In [1]:
!pip install transformers==4.42.4
!pip install beautifulsoup4==4.12.3
!pip install langchain-text-splitters==0.2.2
!pip install tiktoken==0.7.0
!pip install langchain==0.2.10
!pip install langchain-experimental==0.0.62
!pip install langchain-huggingface==0.0.3
!pip install presidio_analyzer==2.2.355
!pip install presidio_anonymizer==2.2.355
!pip install rapidfuzz-3.9.4 thefuzz-0.22.1
!pip install stanza==1.8.2
!pip install tf-keras-2.17.0

Collecting langchain-text-splitters==0.2.2
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl.metadata (2.1 kB)
Collecting langchain-core<0.3.0,>=0.2.10 (from langchain-text-splitters==0.2.2)
  Downloading langchain_core-0.2.35-py3-none-any.whl.metadata (6.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.3.0,>=0.2.10->langchain-text-splitters==0.2.2)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langsmith<0.2.0,>=0.1.75 (from langchain-core<0.3.0,>=0.2.10->langchain-text-splitters==0.2.2)
  Downloading langsmith-0.1.106-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain-core<0.3.0,>=0.2.10->langchain-text-splitters==0.2.2)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpointer>=1.9 (from jsonpatch<2.0,>=1.33->langchain-core<0.3.0,>=0.2.10->langchain-text-splitters==0.2.2)
  Downloading jsonpointer-3.0.0-py2.py3-none-any.whl.metadata (2.3 kB)
Collecting httpx<

Text Cleaning

In [3]:
from bs4 import BeautifulSoup
from transformers import BertTokenizer

# Sample user reviews
reviews = [
    "<html>This product    is <b>amazing!</b></html>",
    "The product is good, but it could be better!!!",
    "I've never seen such a terrible      product. 0/10",
    "The product is AWESOME!!! Highly recommended!",
]

# a. Removing HTML tags and Special Characters
def clean_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

# b. Handling Capitalization and Letter Case
def standardize_case(text):
    return text.lower()

# c. Dealing with Numerical Values and Symbols
def remove_numbers_and_symbols(text):
    return ''.join(e for e in text if e.isalpha() or e.isspace())

# d. Addressing Whitespace and Formatting Issues
def remove_extra_whitespace(text):
    return ' '.join(text.split())


# Applying the text preprocessing pipeline
def preprocess_text(text):
    text = clean_html_tags(text)
    text = standardize_case(text)
    text = remove_numbers_and_symbols(text)
    text = remove_extra_whitespace(text)
    return text

# Preprocess all reviews
preprocessed_reviews = [preprocess_text(review) for review in reviews]

print("Original Reviews:")
for review in reviews:
    print(f"- {review}")

print("\nPreprocessed Reviews:")
for preprocessed_review in preprocessed_reviews:
    print(f"- {preprocessed_review}")

Original Reviews:
- <html>This product    is <b>amazing!</b></html>
- The product is good, but it could be better!!!
- I've never seen such a terrible      product. 0/10
- The product is AWESOME!!! Highly recommended!

Preprocessed Reviews:
- this product is amazing
- the product is good but it could be better
- ive never seen such a terrible product
- the product is awesome highly recommended


  soup = BeautifulSoup(text, "html.parser")


Puntuation

In [4]:
import string

# Sample text
text = "I love this product!!! It's amazing!!!"


# Option 1: Replace symbols and punctuation
replaced_text = text.translate(str.maketrans(string.punctuation, " " * len(string.punctuation)))
print("Replaced Text:", replaced_text)

# Option 2: Remove symbols and punctuation
removed_text = "".join(char for char in text if char.isalnum() or char.isspace())
print("Removed Text:", removed_text)

Replaced Text: I love this product    It s amazing   
Removed Text: I love this product Its amazing


Personally Identifiable Information

In [5]:
import pandas as pd
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

# Sample DataFrame
data = {
    'text': [
        "Hello, my name is John Doe. My email is john.doe@example.com",
        "Contact Jane Smith at jane.smith@work.com",
        "Call her at 987-654-3210.",
        "This is a test message without PII."
    ]
}

df = pd.DataFrame(data)

# Initialize the analyzer and anonymizer engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def anonymize_text(text):
    """ Anonymize PII entities in text """
    # Analyze the text to detect PII entities
    analyzer_results = analyzer.analyze(text=text, entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"], language="en")

    # Define the anonymization configuration
    operators = {
        "PERSON": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 4, "from_end": True}),
        "EMAIL_ADDRESS": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 5, "from_end": True}),
        "PHONE_NUMBER": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 6, "from_end": True})
    }

    # Anonymize the detected PII entities
    anonymized_result = anonymizer.anonymize(text=text, analyzer_results=analyzer_results, operators=operators)

    return anonymized_result.text

# Apply the anonymization function to the DataFrame
df['anonymized_text'] = df['text'].apply(anonymize_text)

# Display the DataFrame
print(df['anonymized_text'])



[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.




0    Hello, my name is John****. My email is john.d...
1            Contact Jane S**** at jane.smith@wor*****
2                            Call her at 987-65******.
3                  This is a test message without PII.
Name: anonymized_text, dtype: object


Dealing With Rare Words.

In [6]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Initialize the GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Define a text prompt with a rare word
text = "The quokka, a rare marsupial,"

# Encode the input text to tensor
indexed_tokens = tokenizer.encode(text, return_tensors='pt')

# Generate text until the output length reaches 50 tokens
output_text = model.generate(indexed_tokens, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)

# Decode the output text
output_text_decoded = tokenizer.decode(output_text[0], skip_special_tokens=True)
print(output_text_decoded)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


The quokka, a rare marsupial, is one of the world's most endangered species.

"It's a very rare species, but it's one that has been around for thousands of years," said Dr. John D.


Spell Checker

In [7]:
from transformers import pipeline

def fix_spelling(text):
    # Initialize the spelling correction pipeline
    spell_check = pipeline("text2text-generation", model="oliverguhr/spelling-correction-english-base")

    # Generate the corrected text
    corrected = spell_check(text, max_length=2048)[0]['generated_text']

    return corrected

# Test the function with some sample text containing spelling mistakes
sample_text = "y name si from Grece."
corrected_text = fix_spelling(sample_text)

print("Original text:", sample_text)
print("Corrected text:", corrected_text)

config.json:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Original text: y name si from Grece.
Corrected text: My name is from Greece.


Fuzzy Matching

In [9]:
!pip install thefuzz

Collecting thefuzz
  Downloading thefuzz-0.22.1-py3-none-any.whl.metadata (3.9 kB)
Collecting rapidfuzz<4.0.0,>=3.0.0 (from thefuzz)
  Downloading rapidfuzz-3.9.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading thefuzz-0.22.1-py3-none-any.whl (8.2 kB)
Downloading rapidfuzz-3.9.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, thefuzz
Successfully installed rapidfuzz-3.9.6 thefuzz-0.22.1


In [10]:
from transformers import pipeline
from thefuzz import process, fuzz

def fix_spelling(text, threshold=80):
    # Initialize the spelling correction pipeline
    spell_check = pipeline("text2text-generation", model="oliverguhr/spelling-correction-english-base")

    # Generate the corrected text
    corrected = spell_check(text, max_length=2048)[0]['generated_text']

    # Split the original and corrected texts into words
    original_words = text.split()
    corrected_words = corrected.split()

    # Create a dictionary of common English words (you can expand this list)
    common_words = set(['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have', 'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you', 'do', 'at'])

    # Fuzzy match each word
    final_words = []
    for orig, corr in zip(original_words, corrected_words):
        if orig.lower() in common_words:
            final_words.append(orig)  # Keep common words as they are
        else:
            # Use fuzzy matching to find the best match
            matches = process.extractOne(orig, [corr], scorer=fuzz.ratio)
            if matches[1] >= threshold:
                final_words.append(matches[0])
            else:
                final_words.append(orig)  # Keep the original word if no good match found

    return ' '.join(final_words)

# Test the function with some sample text containing spelling mistakes
sample_text = "Lets do a copmarsion of speling mistaks in this sentense."
corrected_text = fix_spelling(sample_text)

print("Original text:", sample_text)
print("Corrected text:", corrected_text)

Original text: Lets do a copmarsion of speling mistaks in this sentense.
Corrected text: Let's do a comparison of speling mistaks in this sentence.


Fixed Length Chunking

In [11]:
# Step 1: Load Example Data
reviews = [
    "This smartphone has an excellent camera. The photos are sharp and the colors are vibrant. Overall, very satisfied with my purchase.",
    "I was disappointed with the laptop's performance. It frequently lags and the battery life is shorter than expected.",
    "The blender works great for making smoothies. It's powerful and easy to clean. Definitely worth the price.",
    "Customer support was unresponsive. I had to wait a long time for a reply, and my issue was not resolved satisfactorily.",
    "The book is a fascinating read. The storyline is engaging and the characters are well-developed. Highly recommend to all readers."
]

# Step 2: Create the TokenTextSplitter
from langchain_text_splitters import TokenTextSplitter

# Initialize the TokenTextSplitter with a chunk size of 50 tokens and no overlap
text_splitter = TokenTextSplitter(chunk_size=50, chunk_overlap=0)

# Step 3: Join Reviews and Split Text
# Combine the reviews into a single text block for chunking
text_block = " ".join(reviews)

# Split the text into token-based chunks
chunks = text_splitter.split_text(text_block)

# Print the chunks
print("Chunks with 50 tokens each:")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}:")
    print(chunk)
    print("\n")

# Step 4: Experiment with Different Chunk Sizes
chunk_sizes = [20, 70, 150]

for size in chunk_sizes:
    print(f"Chunk Size: {size}")
    text_splitter = TokenTextSplitter(chunk_size=size, chunk_overlap=0)
    chunks = text_splitter.split_text(text_block)

    for i, chunk in enumerate(chunks):
        print(f"Chunk {i + 1}:")
        print(chunk)
        print("\n")

Chunks with 50 tokens each:
Chunk 1:
This smartphone has an excellent camera. The photos are sharp and the colors are vibrant. Overall, very satisfied with my purchase. I was disappointed with the laptop's performance. It frequently lags and the battery life is shorter than expected. The blender works


Chunk 2:
 great for making smoothies. It's powerful and easy to clean. Definitely worth the price. Customer support was unresponsive. I had to wait a long time for a reply, and my issue was not resolved satisfactorily. The book is a


Chunk 3:
 fascinating read. The storyline is engaging and the characters are well-developed. Highly recommend to all readers.


Chunk Size: 20
Chunk 1:
This smartphone has an excellent camera. The photos are sharp and the colors are vibrant. Overall, very


Chunk 2:
 satisfied with my purchase. I was disappointed with the laptop's performance. It frequently lags and the


Chunk 3:
 battery life is shorter than expected. The blender works great for making s

Recursive Character Chunking

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

reviews = [
    "This smartphone has an excellent camera. The photos are sharp and the colors are vibrant. Overall, very satisfied with my purchase.",
    "I was disappointed with the laptop's performance. It frequently lags and the battery life is shorter than expected.",
    "The blender works great for making smoothies. It's powerful and easy to clean. Definitely worth the price.",
    "Customer support was unresponsive. I had to wait a long time for a reply, and my issue was not resolved satisfactorily.",
    "The book is a fascinating read. The storyline is engaging and the characters are well-developed. Highly recommend to all readers."
]

# Combine the reviews into a single text block for chunking
text_block = " ".join(reviews)

# Create a RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=200,
    chunk_overlap=0,
    length_function=len
)

# Split the text into chunks
chunks = text_splitter.split_text(text_block)

# Print the chunks
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}:")
    print(chunk.strip())
    print("-" * 50)

Chunk 1:
This smartphone has an excellent camera. The photos are sharp and the colors are vibrant. Overall, very satisfied with my purchase. I was disappointed with the laptop's performance. It frequently lags
--------------------------------------------------
Chunk 2:
and the battery life is shorter than expected. The blender works great for making smoothies. It's powerful and easy to clean. Definitely worth the price. Customer support was unresponsive. I had to
--------------------------------------------------
Chunk 3:
wait a long time for a reply, and my issue was not resolved satisfactorily. The book is a fascinating read. The storyline is engaging and the characters are well-developed. Highly recommend to all
--------------------------------------------------
Chunk 4:
readers.
--------------------------------------------------


Semantic Chunking

In [13]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings
import os

reviews = [
    "This smartphone has an excellent camera. The photos are sharp and the colors are vibrant. Overall, very satisfied with my purchase.",
    "I was disappointed with the laptop's performance. It frequently lags and the battery life is shorter than expected.",
    "The blender works great for making smoothies. It's powerful and easy to clean. Definitely worth the price.",
    "Customer support was unresponsive. I had to wait a long time for a reply, and my issue was not resolved satisfactorily.",
    "The book is a fascinating read. The storyline is engaging and the characters are well-developed. Highly recommend to all readers."
]
# Combine the reviews into a single text block for chunking
text_block = " ".join(reviews)

text_splitter = SemanticChunker(HuggingFaceEmbeddings())

docs = text_splitter.create_documents([text_block])

for i, doc in enumerate(docs):
    print(f"Chunk {i + 1}:")
    print(doc.page_content)
    print("\n")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Chunk 1:
This smartphone has an excellent camera. The photos are sharp and the colors are vibrant. Overall, very satisfied with my purchase. I was disappointed with the laptop's performance. It frequently lags and the battery life is shorter than expected. The blender works great for making smoothies. It's powerful and easy to clean.


Chunk 2:
Definitely worth the price. Customer support was unresponsive. I had to wait a long time for a reply, and my issue was not resolved satisfactorily. The book is a fascinating read. The storyline is engaging and the characters are well-developed. Highly recommend to all readers.




Word Tokenization

In [14]:
import nltk
from nltk.tokenize import word_tokenize

# Download the necessary NLTK data (run this once)
nltk.download('punkt')

# Sample text
text = "The quick brown fox jumps over the lazy dog. It's unaffordable!"

# Perform word tokenization
word_tokens = word_tokenize(text)

print("Word tokens:")
print(word_tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Word tokens:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', 'It', "'s", 'unaffordable', '!']


Byte Pair Encoding

In [15]:
from tokenizers import Tokenizer

# Load the pre-trained GPT-2 BPE tokenizer
tokenizer = Tokenizer.from_pretrained("gpt2")

# Sample text
text = "Tokenization in medical texts can include words like hyperlipidemia.."

# Tokenize the text
encoding = tokenizer.encode(text)

# Print the tokens
print("Tokens:", encoding.tokens)

# Print the token IDs
print("Token IDs:", encoding.ids)

# Decode the token IDs back to text
decoded_text = tokenizer.decode(encoding.ids)
print("Decoded Text:", decoded_text)

Tokens: ['Token', 'ization', 'Ġin', 'Ġmedical', 'Ġtexts', 'Ġcan', 'Ġinclude', 'Ġwords', 'Ġlike', 'Ġhyper', 'lip', 'id', 'emia', '..']
Token IDs: [30642, 1634, 287, 3315, 13399, 460, 2291, 2456, 588, 8718, 40712, 312, 22859, 492]
Decoded Text: Tokenization in medical texts can include words like hyperlipidemia..


Wordpiece Tokenization

In [16]:
from transformers import BertTokenizer

# Load the pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Sample text
text = "Tokenization in medical texts can include words like hyperlipidemia."


# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Tokens: ['token', '##ization', 'in', 'medical', 'texts', 'can', 'include', 'words', 'like', 'hyper', '##lip', '##ide', '##mia', '.']
Input IDs: [19204, 3989, 1999, 2966, 6981, 2064, 2421, 2616, 2066, 23760, 15000, 5178, 10092, 1012]


Specialised Tokenisers

In [17]:
import stanza
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from collections import Counter
import numpy as np
import torch

# Initialize Stanza for biomedical text
stanza.download('en', package='mimic', processors='tokenize')
nlp = stanza.Pipeline('en', package='mimic', processors='tokenize')

# Initialize standard GPT-2 tokenizer
standard_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
standard_tokenizer.pad_token = standard_tokenizer.eos_token  # Set pad_token to eos_token
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.config.pad_token_id = model.config.eos_token_id  # Set pad_token_id for the model

# Sample medical corpus
corpus = [
    "The patient suffered a myocardial infarction.",
    "Early detection of heart attack is crucial.",
    "Treatment for myocardial infarction includes medication.",
    "Patients with heart conditions require regular check-ups.",
    "Myocardial infarction can lead to severe complications."
]

def stanza_tokenize(text):
    doc = nlp(text)
    tokens = [word.text for sent in doc.sentences for word in sent.words]
    return tokens

def calculate_oov_and_compression(corpus, tokenizer):
    oov_count = 0
    total_tokens = 0
    all_tokens = []

    for sentence in corpus:
        tokens = tokenizer.tokenize(sentence) if hasattr(tokenizer, 'tokenize') else stanza_tokenize(sentence)
        all_tokens.extend(tokens)
        total_tokens += len(tokens)
        oov_count += tokens.count(tokenizer.oov_token) if hasattr(tokenizer, 'oov_token') else 0

    oov_rate = (oov_count / total_tokens) * 100 if total_tokens > 0 else 0
    avg_tokens_per_sentence = total_tokens / len(corpus)

    return oov_rate, avg_tokens_per_sentence, all_tokens

def analyze_token_utilization(tokens):
    token_counts = Counter(tokens)
    total_tokens = len(tokens)
    utilization = {token: count / total_tokens for token, count in token_counts.items()}
    return utilization

def calculate_perplexity(tokenizer, model, text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
    return torch.exp(outputs.loss).item()

# Evaluation
for tokenizer_name, tokenizer in [("Standard GPT-2", standard_tokenizer), ("Stanza Medical", stanza_tokenize)]:
    oov_rate, avg_tokens, all_tokens = calculate_oov_and_compression(corpus, tokenizer)
    utilization = analyze_token_utilization(all_tokens)

    print(f"\n{tokenizer_name} Tokenizer:")
    print(f"OOV Rate: {oov_rate:.2f}%")
    print(f"Average Tokens per Sentence: {avg_tokens:.2f}")
    print("Top 5 Most Used Tokens:")
    for token, freq in sorted(utilization.items(), key=lambda x: x[1], reverse=True)[:5]:
        print(f"  {token}: {freq:.2%}")


# Example output for "myocardial infarction"
term = "myocardial infarction"
print(f"\nTokenizing '{term}':")
print(f"Standard GPT-2: {standard_tokenizer.tokenize(term)}")
print(f"Stanza Medical: {stanza_tokenize(term)}")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading these customized packages for language: en (English)...
| Processor | Package |
-----------------------
| tokenize  | mimic   |



Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/tokenize/mimic.pt:   0%|       …

INFO:stanza:Downloaded file to /root/stanza_resources/en/tokenize/mimic.pt
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | mimic   |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
  checkpoint = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Done loading processors!



Standard GPT-2 Tokenizer:
OOV Rate: 0.00%
Average Tokens per Sentence: 10.80
Top 5 Most Used Tokens:
  .: 9.26%
  ocard: 5.56%
  ial: 5.56%
  Ġinf: 5.56%
  ar: 5.56%

Stanza Medical Tokenizer:
OOV Rate: 0.00%
Average Tokens per Sentence: 7.60
Top 5 Most Used Tokens:
  .: 13.16%
  infarction: 7.89%
  myocardial: 5.26%
  heart: 5.26%
  The: 2.63%

Tokenizing 'myocardial infarction':
Standard GPT-2: ['my', 'ocard', 'ial', 'Ġinf', 'ar', 'ction']
Stanza Medical: ['myocardial', 'infarction']


Embedding BERT

In [18]:
# Import necessary libraries
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Input sentence
sentence = "BERT embeddings are very useful for natural language processing tasks."

# Tokenize the input sentence
inputs = tokenizer(sentence, return_tensors='pt')

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Extract the last hidden states (embeddings)
last_hidden_states = outputs.last_hidden_state

# Print the shape of the embeddings tensor
print("Shape of the embeddings tensor:", last_hidden_states.shape)

# Print the embeddings for the first token (CLS token)
cls_embedding = last_hidden_states[0, 0, :].numpy()
print("CLS token embedding:", cls_embedding)

# Print the embeddings for the first word
first_word_embedding = last_hidden_states[0, 1, :].numpy()
print("First word embedding:", first_word_embedding)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Shape of the embeddings tensor: torch.Size([1, 16, 768])
CLS token embedding: [-3.21215332e-01 -2.54888654e-01 -2.41873056e-01 -1.88670367e-01
 -5.91540039e-01 -3.44994932e-01  1.57489255e-01  2.16801450e-01
 -9.57079604e-02 -5.51995486e-02 -2.11107686e-01 -8.69692937e-02
 -2.51640141e-01  1.18354090e-01 -7.69019639e-03  1.99790195e-01
 -2.67407298e-01  7.36691356e-01  1.63221925e-01  2.63329037e-02
 -2.36481667e-01 -5.19480288e-01  1.22939847e-01 -2.96923876e-01
 -9.85415652e-02 -2.40907639e-01  9.38061550e-02 -4.28971559e-01
  1.93395093e-02  6.62799254e-02 -5.05006552e-01  5.39705157e-01
 -5.35447039e-02 -2.60150224e-01  7.31656432e-01  5.03807403e-02
  3.07886809e-01 -1.82147965e-01  4.43979919e-01 -3.25604305e-02
 -2.21726015e-01 -1.86660305e-01  4.38654006e-01 -3.08478484e-05
 -1.77291021e-01 -7.49238804e-02 -3.23982167e+00 -1.89109594e-01
 -5.21501899e-01 -4.00790900e-01 -3.22984934e-01 -5.96076325e-02
  6.03826046e-02  4.64077204e-01  3.60599369e-01  1.48977295e-01
 -2.27633342

BGE Embedding

In [19]:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

# Define the model name and parameters
model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}

# Initialize the embeddings model
bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

# Sample sentences to embed
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "I love machine learning and natural language processing."
]

# Generate embeddings for each sentence
embeddings = [bge_embeddings.embed_query(sentence) for sentence in sentences]

# Print the embeddings
for i, embedding in enumerate(embeddings):
    print(f"Embedding for sentence {i+1}: {embedding[:5]}...")  # Print the first 5 values for brevity
    print(f"Length of embedding: {len(embedding)}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding for sentence 1: [-0.07455343753099442, -0.004580758046358824, 0.02168508991599083, 0.0645817294716835, 0.02027861773967743]...
Length of embedding: 384
Embedding for sentence 2: [-0.025911716744303703, 0.0050039575435221195, -0.011821541003882885, -0.020849445834755898, 0.061141155660152435]...
Length of embedding: 384


General Text Embedding

In [20]:
from sentence_transformers import SentenceTransformer

# Load the GTE-base model
model = SentenceTransformer('thenlper/gte-base')

# Sample texts to embed
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "I love machine learning and natural language processing.",
    "Embeddings are useful for many NLP tasks."
]

# Generate embeddings
embeddings = model.encode(texts)

# Print the shape of the embeddings
print(f"Shape of embeddings: {embeddings.shape}")

# Print the first few values of the first embedding
print(f"First few values of the first embedding: {embeddings[0][:5]}")

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/219M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Shape of embeddings: (3, 768)
First few values of the first embedding: [-0.0237603  -0.04635289  0.02570768  0.01606993  0.05594611]
