Dataset Link: https://statso.io/next-word-prediction-case-study/

This command installs Hugging Face's `transformers` library for using pre-trained NLP models like BERT and DistilBERT. Hugging Face provide tools, libraries, and models for NLP tasks like text classification, translation, summarization, and more.

In [None]:
!pip install transformers



Here's a brief explanation for each of these import statements:

1. **`import numpy as np`**: Imports the `numpy` library, commonly used for numerical and array-based operations. Here, it will help with data manipulation and handling arrays in your code.

2. **`import tensorflow as tf`**: Imports the `tensorflow` library, a popular framework for building and training machine learning models. TensorFlow provides deep learning functionalities, and the alias `tf` makes it easier to call functions within the library.

3. **`from transformers import DistilBertTokenizer, TFDistilBertForMaskedLM`**:
   - `DistilBertTokenizer` is a tokenizer that converts text into token IDs, making it suitable for processing by the DistilBERT model.
   - `TFDistilBertForMaskedLM` is a TensorFlow-compatible DistilBERT model specifically fine-tuned for masked language modeling (MLM), a task where the model predicts masked words in a sentence.

4. **`import re`**: Imports Python’s `re` module for regular expression operations. This helps with tasks like cleaning text or pattern matching, useful for preprocessing text data before feeding it into models.

In [None]:
import numpy as np
import tensorflow as tf
from transformers import DistilBertTokenizer, TFDistilBertForMaskedLM
import re

This code loads a pre-trained DistilBERT tokenizer and model. The tokenizer converts text into tokens, while the model is used for Masked Language Modeling (MLM), predicting missing words in a sentence. Both are loaded from the `distilbert-base-uncased` version, which does not differentiate between uppercase and lowercase letters.

In [None]:
# Load the DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = TFDistilBertForMaskedLM.from_pretrained('distilbert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


This code reads the contents of a text file located at /content/sherlock-holm.es_stories_plain-text_advs.txt and stores the text in the variable text.

In [None]:
# Read and preprocess the text
with open('/content/sherlock-holm.es_stories_plain-text_advs.txt', 'r', encoding='utf-8') as file:
    text = file.read()

This code splits the text into sentences, converts them to lowercase, removes non-alphanumeric characters, and filters out empty sentences.

In [None]:
# Cleaning and preparing sentences
sentences = text.split('\n')
sentences = [re.sub(r'[^a-zA-Z0-9\s]', '', sentence.lower()) for sentence in sentences if sentence.strip()]


In [None]:
# Prediction function with top-k sampling and repetition penalty
def predict_next_words(seed_text, next_words=5, temperature=1.0, top_k=10, max_repeats=2):
    generated_text = seed_text
    repeat_tracker = {}  # Dictionary to track word repetition

    for _ in range(next_words):
        # Add a mask token at the end of the input text
        input_text = generated_text + " [MASK]"
        input_ids = tokenizer.encode(input_text, return_tensors="tf")

        # Predict the masked token and adjust with temperature
        predictions = model(input_ids).logits
        mask_token_index = tf.where(input_ids == tokenizer.mask_token_id)[0, 1]
        logits = predictions[0, mask_token_index] / temperature

        # Apply top-k sampling: Select the top-k highest scores
        sorted_indices = tf.argsort(logits, direction='DESCENDING')[:top_k]

        # Filter out words that have been used repeatedly
        predicted_word = None
        for token_id in sorted_indices:
            word = tokenizer.decode([token_id])
            if word.isalpha():
                # Avoid over-repeating the same word
                if repeat_tracker.get(word, 0) < max_repeats:
                    predicted_word = word
                    repeat_tracker[word] = repeat_tracker.get(word, 0) + 1
                    break

        # Use fallback word if no valid word is found
        if not predicted_word:
            predicted_word = "..."

        # Append the predicted word to the generated text
        generated_text += " " + predicted_word

    return generated_text

In [None]:
# Example usage
seed_text = "Hi my name is"
next_words = 3
predicted_text = predict_next_words(seed_text, next_words)
print("Predicted text:", predicted_text)

Predicted text: Hi my name is daddy yankee daddy
