<a href="https://colab.research.google.com/github/FalineRezvani/simpleInsightTools/blob/main/countBasedLanguageModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Count-Based Language Model

2025-04-09

Count-based language modeling uses conditional probability for word prediction.  This notebook will implement a language model found in The Hundred-Page Language Model Book, by Andriy Burkov, and use data scraped from the GeeksForGeeks website to train the model and make predictions.

Language model code and original results can be found in the book [here](https://www.thelmbook.com/).

Code for web scraping can be found in my repo [here](https://github.com/FalineRezvani/simpleInsightTools/blob/main/2025-03-20/companyCulture.py).

Importing Libraries

In [1]:
import re # Regular expressions for text processing
import math # Python's built-in module for mathematical operations (log, exp)
import random # Python's built-in module for random number generation
from collections import defaultdict # Efficient dictionary operations
import pickle, os # Saving and loading the model
import pandas as pd

In [2]:
# The set_seed function, from Python's random module ensures reproducibility
def set_seed(seed):
    random.seed(seed)

Bringing in CSV File

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Reading to dataframe the web-scraped data saved to csv file on Google Drive
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv(r'/content/drive/MyDrive/geeksRLDescriptions.csv')

# Inspect dataframe
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,description
0,Reinforcement learning - GeeksforGeeks\r\nSep ...
1,Machine Learning Tutorial - GeeksforGeeks\r\n5...
2,A Beginner's Guide to Deep Reinforcement Learn...
3,Upper Confidence Bound Algorithm in Reinforcem...
4,On-policy vs off-policy methods Reinforcement ...


In [5]:
# Creating corpus from dataframe
corpus = []

for i in range(0, 9):
  description = re.compile('<.*?>').sub(repl=' ', string=df.iloc[:,0][i]) # Locating and substituting HTML markup with a space
  description = re.compile('[...\r\n,-]').sub(' ', description) # Locating and substituting specific symbols with a space
  description = description.lower() # Converting to lowercase
  corpus.append(description) # Placing in empty list

# Creating single string out of list of strings
corpus = ''.join(corpus)

# Verify
corpus[:100]

'reinforcement learning   geeksforgeeks  sep 4  2024     reinforcement learning (rl) is a branch of m'

Implementing a Count-Based Language Model

In [6]:
class CountLanguageModel:
  def __init__(self, n):
    self.n = n
    self.ngram_counts = [{}for _ in range(n)]
    self.total_unigrams = 0

  def predict_next_token(self, context):
    for n in range(self.n, 1, -1):
      if len(context) >+ n - 1:
        context_n = tuple(context[-(n - 1):])
        counts = self.ngram_counts[n - 1].get(context_n)
        if counts:
          return max(counts.items(), key = lambda x: x[1])[0]
    unigram_counts = self.ngram_counts[0].get(())
    if unigram_counts:
      return max(unigram_counts.items(), key = lambda x: x[1])[0]
    return None

  def get_probability(self, token, context):
      for n in range(self.n, 1, -1):
          if len(context) >= n - 1:
              context_n = tuple(context[-(n - 1):])
              counts = self.ngram_counts[n - 1].get(context_n)
              if counts:
                  total = sum(counts.values())
                  count = counts.get(token, 0)
                  if count > 0:
                      return count / total
      unigram_counts = self.ngram_counts[0].get(())
      count = unigram_counts.get(token, 0)
      V = len(unigram_counts)
      return (count + 1) / (self.total_unigrams + V)

Methods to Train Model, Generate Text, Compute Perplexity, Tokenize, Download/Pre-Process Data, and Set Hyperparameters.

In [14]:
def train(model, tokens):
    # Train models for each n-gram size from 1 to n
    for n in range(1, model.n + 1):
        counts = model.ngram_counts[n - 1]
        # Slide a window of size n over the corpus
        for i in range(len(tokens) - n + 1):
            # Split into context (n-1 tokens) and next token
            context = tuple(tokens[i:i + n - 1])
            next_token = tokens[i + n - 1]

            # Initialize counts dictionary for this context if needed
            if context not in counts:
                counts[context] = defaultdict(int)

            # Increment count for this context-token pair
            counts[context][next_token] = counts[context][next_token] + 1

    # Store total number of tokens for unigram probability calculations
    model.total_unigrams = len(tokens)


def generate_text(model, context, num_tokens):
    # Start with the provided context
    generated = list(context)

    # Generate new tokens until we reach the desired length
    while len(generated) - len(context) < num_tokens:
        # Use the last n-1 tokens as context for prediction
        next_token = model.predict_next_token(generated[-(model.n-1):])
        generated.append(next_token)

        # Stop if we've generated enough tokens AND found a period
        # This helps ensure complete sentences
        if len(generated) - len(context) >= num_tokens and next_token == '.':
            break

    # Join tokens with spaces to create readable text
    return ' '.join(generated)


def compute_perplexity(model, tokens, context_size):
    # Handle empty token list
    if not tokens:
        return float('inf')

    # Initialize log likelihood accumulator
    total_log_likelihood = 0
    num_tokens = len(tokens)

    # Calculate probability for each token given its context
    for i in range(num_tokens):
        # Get appropriate context window, handling start of sequence
        context_start = max(0, i - context_size)
        context = tuple(tokens[context_start:i])
        token = tokens[i]

        # Get probability of this token given its context
        probability = model.get_probability(token, context)

        # Add log probability to total (using log for numerical stability)
        total_log_likelihood += math.log(probability)

    # Calculate average log likelihood
    average_log_likelihood = total_log_likelihood / num_tokens

    # Convert to perplexity: exp(-average_log_likelihood)
    # Lower perplexity indicates better model performance
    perplexity = math.exp(-average_log_likelihood)
    return perplexity


def tokenize(text):
    # Finding letters and periods using the start and end of words and one to the left
    return re.findall(r"\b[a-zA]+\b|[.]", text)


def download_and_prepare_data(corpus):
    # Convert text to tokens
    tokens = tokenize(corpus)

    # Split into training (90%) and test (10%) sets
    # This reserves data on which to test the models predictions
    split_index = int(len(tokens) * 0.9)
    train_corpus = tokens[:split_index]
    test_corpus = tokens[split_index:]

    return train_corpus, test_corpus


def get_hyperparameters():
    # Size of n-grams to use in the model
    n = 5
    return n

Saving/Loading a Model

In [8]:
def save_model(model, model_name):
    # Create models directory if it doesn't exist
    os.makedirs('models', exist_ok=True)

    # Construct file path
    model_path = os.path.join('models', f'{model_name}.pkl')

    try:
        print(f"Saving model to {model_path}...")
        with open(model_path, 'wb') as f:
            pickle.dump({
                'n': model.n,
                'ngram_counts': model.ngram_counts,
                'total_unigrams': model.total_unigrams
            }, f)
        print("Model saved successfully.")
        return model_path
    except IOError as e:
        print(f"Error saving model: {e}")
        raise

def load_model(model_name):
    model_path = os.path.join('models', f'{model_name}.pkl')

    try:
        print(f"Loading model from {model_path}...")
        with open(model_path, 'rb') as f:
            model_data = pickle.load(f)

        # Create new model instance
        model = CountLanguageModel(model_data['n'])

        # Restore model state
        model.ngram_counts = model_data['ngram_counts']
        model.total_unigrams = model_data['total_unigrams']

        print("Model loaded successfully.")
        return model
    except FileNotFoundError:
        print(f"Model file not found: {model_path}")
        raise
    except IOError as e:
        print(f"Error loading model: {e}")
        raise

##Training the Model

In [15]:
# Main model training block
if __name__ == "__main__":
    # Initialize random seeds for reproducibility
    set_seed(42)
    n = get_hyperparameters()
    model_name = "count_model"

    train_corpus, test_corpus = download_and_prepare_data(corpus)

    # Train the model and evaluate its performance
    print("\nTraining the model...")
    model = CountLanguageModel(n)
    train(model, train_corpus)
    print("\nModel training complete.")

    perplexity = compute_perplexity(model, test_corpus, n)
    print(f"\nPerplexity on test corpus: {perplexity:.2f}")

    save_model(model, model_name)


Training the model...

Model training complete.

Perplexity on test corpus: 21.19
Saving model to models/count_model.pkl...
Model saved successfully.


The output of the model training shows us the __perplexity__, which measures how "sure" the model is about its choice of next word (token). A higher number means the model is picking one out of that number of tokens as its choice.  A lower number suggests a more "sure" choice.

##Testing the Model

In [16]:
# Main model testing block
if __name__ == "__main__":

    model = load_model(model_name)

    # These contexts gives the model a chance to finish sentences
    contexts = [
        "i will build a",
        "the best place to",
        "she was riding a"
    ]

    # Generate completions for each context
    for context in contexts:
        tokens = tokenize(context)
        next_token = model.predict_next_token(tokens)
        print(f"\nContext: {context}")
        print(f"Next token: {next_token}")
        print(f"Generated text: {generate_text(model, tokens, 10)}")

Loading model from models/count_model.pkl...
Model loaded successfully.

Context: i will build a
Next token: branch
Generated text: i will build a branch of machine learning focused on making decisions to maximize

Context: the best place to
Next token: maximize
Generated text: the best place to maximize cumulative rewards in a given situation machine learning tutorial

Context: she was riding a
Next token: branch
Generated text: she was riding a branch of machine learning focused on making decisions to maximize
