<a href="https://colab.research.google.com/github/Hassan2711/B-log/blob/master/JsTokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install datasets tokenizers numpy


Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xx

### Load Dataset from Hugging Face

Here, we load the dataset containing Javascript code snippets using the Hugging Face `datasets` library. We are working with the `CodeXGLUE` dataset, which contains high-quality Python code snippets for training the tokenizer. The function `load_dataset()` is used to load this dataset.


In [2]:
from datasets import load_dataset

dataset = load_dataset("CM/codexglue_code2text_javascript")

dataset


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/916 [00:00<?, ?B/s]

(…)-00000-of-00001-c481419bd6c4b8e1.parquet:   0%|          | 0.00/58.4M [00:00<?, ?B/s]

(…)-00000-of-00001-d91289cd9166f28f.parquet:   0%|          | 0.00/3.78M [00:00<?, ?B/s]

(…)-00000-of-00001-bdd622fe43a421f5.parquet:   0%|          | 0.00/3.59M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/58025 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3885 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3291 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url'],
        num_rows: 58025
    })
    validation: Dataset({
        features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url'],
        num_rows: 3885
    })
    test: Dataset({
        features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url'],
        num_rows: 3291
    })
})

### Data Preprocessing

In this section, we preprocess the JavaScript code to standardize and prepare it for tokenization. The preprocessing includes:
1. **Removing comments**: Both single-line (`//`) and multi-line (`/* */`) comments are removed.
2. **Converting camelCase to snake_case**: This is done for variable and function names to maintain consistency.
3. **Normalizing spaces and newlines**: Multiple spaces and newline characters are collapsed into single spaces to standardize the input.
4. **Preserving important symbols**: Key symbols such as brackets, commas, and operators are maintained.

The preprocessing function performs the following:
- First, single-line comments are removed using a regular expression to find and replace patterns starting with `//`.
- Then, multi-line comments are removed using a regular expression that captures anything between `/*` and `*/`.
- JavaScript keywords are matched and standardized by ensuring spaces around them.
- Camel case is converted to snake case using regular expressions.
- Finally, unnecessary spaces are stripped and braces `{}` are standardized.

This ensures the dataset is clean and uniform, ready for training the tokenizer.


In [3]:
import re

def preprocess_code(code):
    code = re.sub(r'//.*', '', code)

    code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)

    keywords = [
        'function', 'return', 'if', 'else', 'for', 'while', 'break', 'continue', 'let', 'const', 'var', 'import', 'export', 'class', 'extends',
        'new', 'try', 'catch', 'finally', 'throw', 'const', 'this', 'switch', 'case', 'default', 'void', 'async', 'await', 'instanceof', 'typeof'
    ]
    for keyword in keywords:
        code = re.sub(r'\b' + re.escape(keyword) + r'\b', f' {keyword} ', code)

    code = re.sub(r'([a-z0-9])([A-Z])', r'\1_\2', code)
    code = code.lower()

    code = re.sub(r'\s+', ' ', code).strip()

    code = re.sub(r'([{}()\[\];,])', r' \1 ', code)

    return code


### BPE Tokenizer Class

In this section, we define a class called `BPETokenizer` that implements the Byte-Pair Encoding (BPE) tokenizer using the `tokenizers` library. This class includes methods for initializing, training, and using the tokenizer.

1. **`__init__(self, vocabulary=None)`**:
    - Initializes the tokenizer using the `BPE` model.
    - The `ByteLevel` pre-tokenizer splits the input text based on byte-level encoding.
    - The tokenizer is configured to truncate text sequences to a maximum length of 512 tokens to ensure uniformity during training.

2. **`train(self, dataset, vocabulary_size=50000, min_frequency=2)`**:
    - This method trains the tokenizer on a given dataset.
    - It uses `BpeTrainer` with a specified vocabulary size and minimum frequency for token inclusion.
    - The dataset is passed through `train_from_iterator`, which processes the dataset and updates the tokenizer.

3. **`tokenize(self, text)`**:
    - This method takes a `text` input and tokenizes it using the trained tokenizer.
    - The method returns the list of tokens for the given input text.

The `BPETokenizer` class is initialized with default parameters but can be customized for different vocabulary sizes and token frequency requirements. It helps convert raw text into structured tokenized format for efficient model training.


In [4]:
from tokenizers import Tokenizer, models, pre_tokenizers, trainers

class BPETokenizer:
    def __init__(self, vocabulary=None):
        self.tokenizer = Tokenizer(models.BPE())
        self.tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
        self.tokenizer.enable_truncation(max_length=512)

    def train(self, dataset, vocabulary_size=50000, min_frequency=2):
        trainer = trainers.BpeTrainer(vocab_size=vocabulary_size, min_frequency=min_frequency)

        print("Training tokenizer...")
        self.tokenizer.train_from_iterator(dataset, trainer=trainer)

        print(f"Tokenizer trained. Vocabulary size: {vocabulary_size}")

    def tokenize(self, text):
        return self.tokenizer.encode(text).tokens

tokenizer = BPETokenizer()


### Train the Tokenizer on the Dataset

In this section, we initialize the `BPETokenizer` and train it on the cleaned dataset (`train_texts_cleaned`).

1. **Initializing the Tokenizer**:
   - The `BPETokenizer` class is instantiated, which sets up the Byte-Pair Encoding model with necessary configurations for pre-tokenizing and truncating text sequences.

2. **Training the Tokenizer**:
   - The `train` method is called, passing the cleaned dataset (`train_texts_cleaned`). This process uses the BPE model and the `train_from_iterator` method to process the text, learning the vocabulary and token patterns from the provided data.

3. **Previewing Tokens from the Trained Vocabulary**:
   - After training the tokenizer, we access the vocabulary using `get_vocab()` and print out a sample of the first 10 tokens. This allows us to inspect the most frequent and significant tokens learned by the tokenizer during the training process.

This step helps us confirm that the tokenizer has successfully learned the vocabulary and is ready for tokenization of text.


In [8]:
train_texts_cleaned = [preprocess_code(sample) for sample in dataset['train']['original_string']]

tokenizer = BPETokenizer()

tokenizer.train(train_texts_cleaned)

print("Sample tokens from the trained vocabulary:")

vocab = tokenizer.tokenizer.get_vocab()
sample_tokens = list(vocab.keys())[:10]
print(sample_tokens)


Training tokenizer...
Tokenizer trained. Vocabulary size: 50000
Sample tokens from the trained vocabulary:
['Ġcommons', 'phicon', 'Ġrtcpeer', 'nosu', 'respawn', 'byweek', 'Ġrawquals', "><'", 'Ġbuilder', 'Ġ/\\\\\\\\']


### Tokenization Visualization

#### 1. **Tokenization Efficiency**
   - **Purpose**: Shows the relationship between sentence length and token count.
   - **Code**: Tokenize each sentence, store lengths and token counts, then plot them.

#### 2. **Vocabulary Distribution**
   - **Purpose**: Displays the top 20 most frequent tokens in the tokenizer's vocabulary.
   - **Code**: Extract, sort, and plot the top 20 most frequent tokens by frequency.

#### 3. **Sample Tokenization**
   - **Purpose**: Visualizes tokenization for the first 5 code samples.
   - **Code**: Tokenize 5 samples, print original and cleaned tokens.

#### 4. **Token Distribution**
   - **Purpose**: Shows the distribution of cleaned tokens (excluding "Ġ").
   - **Code**: Tokenize the dataset, clean tokens, then plot the top 20 token frequencies.


In [23]:
import matplotlib.pyplot as plt
import numpy as np

def plot_tokenization_efficiency(dataset, tokenizer):
    token_counts = []
    sentence_lengths = []

    for text in dataset:
        tokens = tokenizer.tokenize(text)
        token_counts.append(len(tokens))
        sentence_lengths.append(len(text.split()))

    plt.figure(figsize=(10, 6))
    plt.scatter(sentence_lengths, token_counts, alpha=0.5)
    plt.title("Tokenization Efficiency vs. Sentence Length")
    plt.xlabel("Sentence Length (words)")
    plt.ylabel("Token Count")
    plt.show()

def plot_vocab_distribution(tokenizer):
    vocab = tokenizer.tokenizer.get_vocab()
    token_frequencies = {token: count for token, count in vocab.items()}
    sorted_tokens = sorted(token_frequencies.items(), key=lambda x: x[1], reverse=True)

    top_tokens = sorted_tokens[:20]
    tokens = [token[0] for token in top_tokens]
    frequencies = [token[1] for token in top_tokens]

    plt.figure(figsize=(10, 6))
    plt.barh(tokens, frequencies, color='skyblue')
    plt.title("Top 20 Most Frequent Tokens")
    plt.xlabel("Frequency")
    plt.ylabel("Tokens")
    plt.gca().invert_yaxis()
    plt.show()

def visualize_tokenization(dataset, tokenizer):
    print("\nVisualizing Tokenization for Multiple Samples:")
    for i in range(5):
        sample_code = dataset[i]
        tokens = tokenizer.tokenize(sample_code)
        print(f"\nSample {i+1} Original Code:")
        print(sample_code)
        print(f"\nSample {i+1} Tokenized:")
        print(tokens)
        cleaned_tokens = [token.lstrip('Ġ') for token in tokens]
        print("Cleaned Tokens:", cleaned_tokens)

def plot_token_distribution(dataset, tokenizer):
    token_counts = []

    token_frequencies = {}

    for text in dataset:
        tokens = tokenizer.tokenize(text)
        cleaned_tokens = [token.lstrip('Ġ') for token in tokens]
        for token in cleaned_tokens:
            if token in token_frequencies:
                token_frequencies[token] += 1
            else:
                token_frequencies[token] = 1

    sorted_tokens = sorted(token_frequencies.items(), key=lambda x: x[1], reverse=True)

    top_tokens = sorted_tokens[:20]
    tokens = [token[0] for token in top_tokens]
    frequencies = [token[1] for token in top_tokens]

    plt.figure(figsize=(10, 6))
    plt.bar(tokens, frequencies, color='lightgreen')
    plt.title("Token Distribution Across Dataset (Cleaned Data)")
    plt.xlabel("Tokens")
    plt.ylabel("Frequency")
    plt.xticks(rotation=90)
    plt.show()

# Tokenization for Sample JavaScript code
def tokenize_code(code, tokenizer):
    tokens = tokenizer.tokenize(code)
    cleaned_tokens = [token.lstrip('Ġ') for token in tokens]
    print("Original Code:")
    print(code)
    print("Tokenized Output:")
    print(cleaned_tokens)

sortArray_code = """
function sortArray(arr) {
    if (!arr || arr.length === 0) {
        console.log("Array is empty.");
        return [];
    }

    function compare(a, b) {
        if (a < b) {
            return -1;
        } else if (a > b) {
            return 1;
        }
        return 0;
    }

    let sortedArray = arr.sort(compare);
    return sortedArray;
}
"""

multiply_code = """
function multiply(a, b) {
    return a * b;
}

let result = multiply(2, 3);
"""
tokenize_code(sortArray_code, tokenizer)
tokenize_code(multiply_code, tokenizer)


plot_tokenization_efficiency(dataset['train']['original_string'], tokenizer)
plot_vocab_distribution(tokenizer)
plot_token_distribution(dataset['train']['original_string'], tokenizer)
visualize_tokenization(dataset['train']['original_string'], tokenizer)


Original Code:

function sortArray(arr) {
    if (!arr || arr.length === 0) {
        console.log("Array is empty.");
        return [];
    }

    function compare(a, b) {
        if (a < b) {
            return -1;
        } else if (a > b) {
            return 1;
        }
        return 0;
    }

    let sortedArray = arr.sort(compare);
    return sortedArray;
}

Tokenized Output:
['', 'function', 'sor', 'tr', 'ray', '(', 'arr', ')', '{', '', '', 'if', '(', '!', 'arr', '||', 'arr', '.', 'length', '===', '0', ')', '{', '', '', '', '', 'console', '.', 'log', '(', '"', 'r', 'ray', 'is', 'empty', '."', ')', ';', '', '', '', '', 'return', '[', ']', ';', '', '', '}', '', '', 'function', 'compare', '(', 'a', ',', 'b', ')', '{', '', '', '', '', 'if', '(', 'a', '<', 'b', ')', '{', '', '', '', '', '', '', 'return', '-', '1', ';', '', '', '', '', '}', 'else', 'if', '(', 'a', '>', 'b', ')', '{', '', '', '', '', '', '', 'return', '1', ';', '', '', '', '', '}', '', '', '', '', 'return', '0', ';'

### Evaluating the Tokenizer

In this section, we evaluate the efficiency and performance of the trained tokenizer. We calculate two important metrics:

#### 1. **Vocabulary Size**
   - **Purpose**: The vocabulary size indicates how many unique tokens the tokenizer has learned during training.
   - **Code**:
     - We calculate the vocabulary size using `len(tokenizer.tokenizer.get_vocab())`, which returns the total number of unique tokens in the tokenizer's vocabulary.
     - This is printed to show how many tokens the tokenizer has identified from the dataset.

#### 2. **Tokenization Efficiency**
   - **Purpose**: Tokenization efficiency measures how well the tokenizer breaks down the text into tokens. We calculate the average number of tokens per sentence.
   - **Code**:
     - For each sample in the dataset, the code tokenizes the text and counts the number of tokens.
     - The `calculate_tokenization_efficiency` function calculates the average number of tokens per sentence by dividing the total token count by the number of sentences.
     - This efficiency score is printed as the average number of tokens per sentence.

#### 3. **Out-of-Vocabulary (OOV) Rate**
   - **Purpose**: The OOV rate measures how many tokens from the dataset are not found in the tokenizer's vocabulary. A lower OOV rate indicates a better tokenizer, as it can cover more of the vocabulary.
   - **Code**:
     - The `calculate_oov_rate` function computes the OOV rate by comparing the tokens in each sentence with the tokenizer's vocabulary.
     - For each token in the dataset, if it's not in the vocabulary, it is counted as an out-of-vocabulary token.
     - The total number of OOV tokens is divided by the total number of tokens to get the OOV rate.
     - This rate is printed to indicate how well the tokenizer generalizes.

These metrics help evaluate the tokenizer's performance and provide insights into how effective it is for the given dataset.



In [18]:
vocabulary_size = len(tokenizer.tokenizer.get_vocab())
print(f"Vocabulary size: {vocabulary_size}")

def calculate_tokenization_efficiency(dataset):
    token_counts = []
    for text in dataset:
        tokens = tokenizer.tokenize(text)
        token_counts.append(len(tokens))

    average_tokens_per_sentence = sum(token_counts) / len(token_counts)
    return average_tokens_per_sentence

tokenization_efficiency = calculate_tokenization_efficiency(train_texts_cleaned)
print(f"Tokenization efficiency (average tokens per sentence): {tokenization_efficiency:.2f}")

def calculate_oov_rate(dataset):
    vocab_set = set(tokenizer.tokenizer.get_vocab().keys())
    oov_count = 0
    total_tokens = 0

    for text in dataset:
        tokens = tokenizer.tokenize(text)
        total_tokens += len(tokens)
        oov_count += sum(1 for token in tokens if token not in vocab_set)

    oov_rate = oov_count / total_tokens
    return oov_rate

oov_rate = calculate_oov_rate(train_texts_cleaned)
print(f"Out-of-Vocabulary (OOV) rate: {oov_rate:.4f}")


Vocabulary size: 37017
Tokenization efficiency (average tokens per sentence): 183.13
Out-of-Vocabulary (OOV) rate: 0.0000
