# **N-grams in Language Modeling**


An **N-gram** is a contiguous sequence of nnn words from a given text or speech corpus. It is widely used in **Natural Language Processing (NLP)** to predict the probability of a word occurring given the preceding words.

### **How N-grams Work**

*   N-grams capture word sequences and help in probabilistic language modeling.
    
*   They assume that the probability of a word in a sentence depends only on the previous n−1n-1n−1 words (Markov assumption).
    
*   This helps in applications such as **speech recognition, text generation, machine translation, and autocomplete features**.
    

#### **N-gram Types**

1.  **Unigram (n = 1):** Each word is treated independently.
    
    *   Example: "I", "love", "machine", "learning"
        
2.  **Bigram (n = 2):** Probability of a word depends on the previous one.
    
    *   Example: "I love", "love machine", "machine learning"
        
3.  **Trigram (n = 3):** Probability of a word depends on the last two words.
    
    *   Example: "I love machine", "love machine learning"
        
4.  **4-gram, 5-gram, etc.:** Consider longer contexts.
    

### **Trigram (3-gram) Model**

A **trigram model** uses the previous two words to predict the next word in a sequence.

#### **Example:**

Consider the sentence:

> "I love machine learning."

If we are using a trigram model and have already seen "I love machine", we can predict the most probable next word.

**How does the model predict the next word?**

*   Based on a corpus (large text dataset), the model calculates the probability of possible next words.
    
*   It uses **Conditional Probability**:
    

If "learning" has the highest probability, the model predicts:

> **"I love machine learning"**

This means:

*   The probability of "learning" following "I love machine" is estimated by counting how often "I love machine learning" appears in a corpus and dividing it by how often "I love machine" appears.
    

### **Applications of N-gram Models**

1.  **Speech Recognition:** Predict the most probable next word based on past words.
    
2.  **Autocomplete & Text Prediction:** Used in Google search and mobile keyboards.
    
3.  **Spelling Correction:** Identifies probable words from misspelled text.
    
4.  **Machine Translation:** Helps translate text based on learned word sequences.
    
5.  **Sentiment Analysis:** Can capture phrase-based sentiments (e.g., "not very good" vs. "very good").
    

### **Limitations of N-gram Models**

1.  **Data Sparsity:** Large datasets are needed to get reliable probabilities.
    
2.  **Lack of Long-Term Context:** A trigram model only considers the last two words, ignoring long-range dependencies.
    
3.  **High Computational Cost:** Storing probabilities for large n-grams can be memory-intensive.
    
4.  **Smoothing Required:** Some word combinations might never appear in the training data, requiring techniques like **Laplace Smoothing** to handle unseen words.
    


In [23]:
import nltk
from nltk.util import ngrams
from collections import Counter

# Long sentence as input
text = """
Data science is a rapidly growing field that combines statistics, machine learning, and programming to extract insights from data.
Many companies use data science to improve decision-making, automate processes, and create intelligent systems.
The demand for skilled data scientists continues to increase as more industries recognize the value of data-driven strategies.
"""

# Tokenize text (convert to lowercase for consistency)
words = text.lower().split()

print("Tokenized Words:", words)


Tokenized Words: ['data', 'science', 'is', 'a', 'rapidly', 'growing', 'field', 'that', 'combines', 'statistics,', 'machine', 'learning,', 'and', 'programming', 'to', 'extract', 'insights', 'from', 'data.', 'many', 'companies', 'use', 'data', 'science', 'to', 'improve', 'decision-making,', 'automate', 'processes,', 'and', 'create', 'intelligent', 'systems.', 'the', 'demand', 'for', 'skilled', 'data', 'scientists', 'continues', 'to', 'increase', 'as', 'more', 'industries', 'recognize', 'the', 'value', 'of', 'data-driven', 'strategies.']


In [25]:
# Generate bigrams (2-grams) and trigrams (3-grams)
bigrams = list(ngrams(words, 2))
trigrams = list(ngrams(words, 3))

print("\nExample Bigrams:")
for bigram in bigrams:
    print(bigram)

print("\nExample Trigrams:")
for trigram in trigrams:
    print(trigram)





Example Bigrams:
('data', 'science')
('science', 'is')
('is', 'a')
('a', 'rapidly')
('rapidly', 'growing')
('growing', 'field')
('field', 'that')
('that', 'combines')
('combines', 'statistics,')
('statistics,', 'machine')
('machine', 'learning,')
('learning,', 'and')
('and', 'programming')
('programming', 'to')
('to', 'extract')
('extract', 'insights')
('insights', 'from')
('from', 'data.')
('data.', 'many')
('many', 'companies')
('companies', 'use')
('use', 'data')
('data', 'science')
('science', 'to')
('to', 'improve')
('improve', 'decision-making,')
('decision-making,', 'automate')
('automate', 'processes,')
('processes,', 'and')
('and', 'create')
('create', 'intelligent')
('intelligent', 'systems.')
('systems.', 'the')
('the', 'demand')
('demand', 'for')
('for', 'skilled')
('skilled', 'data')
('data', 'scientists')
('scientists', 'continues')
('continues', 'to')
('to', 'increase')
('increase', 'as')
('as', 'more')
('more', 'industries')
('industries', 'recognize')
('recognize', 't

In [27]:
# Count occurrences of each bigram and trigram
bigram_freq = Counter(bigrams)
trigram_freq = Counter(trigrams)

print("\nMost Common Bigrams:", bigram_freq.most_common(3))
print("\nMost Common Trigrams:", trigram_freq.most_common(3))



Most Common Bigrams: [(('data', 'science'), 2), (('science', 'is'), 1), (('is', 'a'), 1)]

Most Common Trigrams: [(('data', 'science', 'is'), 1), (('science', 'is', 'a'), 1), (('is', 'a', 'rapidly'), 1)]


In [29]:
def predict_next_word(bigram):
    """Predict the most probable next word given a bigram"""
    possible_trigrams = {trigram: count for trigram, count in trigram_freq.items() if trigram[:2] == bigram}
    
    if not possible_trigrams:
        return None  # No prediction possible
    
    # Return the most frequent next word
    return max(possible_trigrams, key=possible_trigrams.get)[2]

# Test prediction with a bigram
bigram_input = ('science', 'powerful')  # Example bigram
predicted_word = predict_next_word(bigram_input)

print(f"\nGiven Bigram: {bigram_input}")
print(f"Predicted Next Word: {predicted_word}")



Given Bigram: ('science', 'powerful')
Predicted Next Word: None


In [33]:
def predict_next_word(word):
    """Predict the most probable next word given a single word (unigram)"""
    possible_bigrams = {bigram: count for bigram, count in bigram_freq.items() if bigram[0] == word}
    
    if not possible_bigrams:
        return None  # No prediction possible
    
    # Return the most frequent next word
    return max(possible_bigrams, key=possible_bigrams.get)[1]

# Test prediction
input_word = 'growing'  # Example unigram
predicted_word = predict_next_word(input_word)

print(f"\nGiven Word: {input_word}")
print(f"Predicted Next Word: {predicted_word}")



Given Word: growing
Predicted Next Word: field


**Advantages of N-grams**
-------------------------

✅ **Simple & Fast**

✅ **Works Well for Small Datasets**

**Limitations of N-grams**

--------------------------

❌ **No Long-Term Context** (Only considers limited previous words)

❌ **Data Sparsity** (Rare sequences are hard to predict)

❌ **Fixed Window** (N-gram size limits the model's ability to learn complex patterns)

## 1. Importing Libraries

- `tensorflow.keras.models.Sequential`: Defines a sequential neural network model where layers are added one after another.
- `tensorflow.keras.layers.Embedding`: Converts words into dense vector representations (word embeddings) that capture semantic meaning.
- `tensorflow.keras.layers.LSTM`: A type of recurrent neural network (RNN) used for processing sequential data, capable of learning long-term dependencies.
- `tensorflow.keras.layers.Dense`: Fully connected layer that applies transformations to the input.
- `tensorflow.keras.preprocessing.text.Tokenizer`: Tokenizes (splits) text into individual words or subwords and converts them into numerical values.
- `tensorflow.keras.preprocessing.sequence.pad_sequences`: Ensures that all sequences have the same length by padding shorter sequences.
- `numpy`: Used for numerical computations.
- `regex`: A library for handling advanced regular expressions, which can be used for text preprocessing.


In [48]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import regex as re


# **Text Tokenization and N-gram Sequence Generation in TensorFlow**

This section explains how to preprocess a text file into structured data for training an **LSTM-based language model**. The process includes **text tokenization, n-gram sequence creation, padding, and target encoding**.

---

## **1. Reading and Splitting the Text File**

### **Function: `file_to_sentence_list(file_path)`**
- Opens the text file and reads its content.
- Splits the text into **sentences** using delimiters like `.` (period), `?` (question mark), and `!` (exclamation mark).
- Uses **regular expressions (`regex`)** to ensure sentences are correctly identified.
- Returns a **list of cleaned sentences**.

---

## **2. Tokenization**

### **Using `Tokenizer` from Keras**
- The **Tokenizer** converts sentences into numerical sequences.
- It assigns a **unique integer** to each word in the text.
- The `word_index` dictionary stores the mapping of words to integers.
- The `total_words` variable represents the vocabulary size (number of unique words + 1 for padding).

---

## **3. Generating Input Sequences (N-grams)**

### **Steps:**
- Iterate through each sentence.
- Convert the sentence into a sequence of word indices.
- Generate **n-gram sequences** by progressively adding words.

**Example:**  
For a sentence `"I love pizza"`:
- Tokens: `[1, 2, 3]`
- Generated sequences:
  - `[1, 2]`
  - `[1, 2, 3]`

Each sequence helps the model learn contextual word predictions.

---

## **4. Padding Sequences and Splitting Data**

- Sentences vary in length, so we **pad sequences** to make them uniform.
- `pad_sequences()` ensures that all sequences have the **same length** by adding zeros at the beginning (`padding='pre'`).
- The **maximum sequence length** (`max_sequence_len`) is determined by the longest n-gram.
- The dataset is split into:
  - `X`: **Predictors** (all but the last word in the sequence).
  - `y`: **Labels** (the last word in the sequence).

---

## **5. Converting Targets to One-Hot Encoding**

- The output `y` needs to be **categorical**, representing words as probability distributions.
- `tf.keras.utils.to_categorical()` converts numerical labels into a **one-hot encoded format**.
- The number of classes is set to `total_words` (the vocabulary size).

---

## **Final Output**
At the end of this preprocessing pipeline, we have:
- **`X`**: A matrix of padded n-gram sequences.
- **`y`**: One-hot encoded target words.
- This structured data is ready for training an **LSTM-based language model**.


In [50]:
def file_to_sentence_list(file_path):
    with open(file_path, 'r') as file:
        text = file.read()

    # Splitting the text into sentences using
    # delimiters like '.', '?', and '!'
    sentences = [sentence.strip() for sentence in re.split(
        r'(?<=[.!?])\s+', text) if sentence.strip()]

    return sentences

file_path = 'pizza.txt'
text_data = file_to_sentence_list(file_path)

# Tokenize the text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_data)
total_words = len(tokenizer.word_index) + 1

"""

Converts each sentence into a list of tokenized words.
Creates n-gram sequences:
Iterates through each tokenized sentence.
For each tokenized word, generates an incremental sequence.
Appends it to input_sequences.

"""
input_sequences = []
for line in text_data:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

""""
Finds the maximum sequence length in input_sequences.
Pads sequences to make them the same length:
Uses pad_sequences() from Keras.
padding='pre' ensures shorter sequences are padded at the beginning.
Splits input_sequences into features (X) and labels (y):
X → All tokens except the last ([:, :-1]).
y → The last token in the sequence ([:, -1]).
"""

# Pad sequences and split into predictors and label

"""
Why Do We Use Padding?
When training deep learning models, especially Recurrent Neural Networks (RNNs),
Long Short-Term Memory (LSTMs),and Transformers, all input sequences must have the same length because:
"""
max_sequence_len = max([len(seq) for seq in input_sequences])
input_sequences = np.array(pad_sequences(
    input_sequences, maxlen=max_sequence_len, padding='pre'))
X, y = input_sequences[:, :-1], input_sequences[:, -1]

# Convert target data to one-hot encoding
y = tf.keras.utils.to_categorical(y, num_classes=total_words)


 # **Building an LSTM-based Language Model with TensorFlow** 
 
 In this section, we define and compile a **Sequential LSTM model** for text generation or next-word prediction.  --- 
 
 ## **1. Model Architecture** 
 
 ### **1.1 Embedding Layer** 
 
 model.add(Embedding(total_words, 10, input_length=max_sequence_len-1))   `

*   **Purpose**: Converts integer-encoded words into dense vector representations.
    
*   total\_words: Vocabulary size (number of unique words).
    
*   10: The size of the embedding vector (each word is represented by a 10-dimensional vector).
    
*   input\_length=max\_sequence\_len-1: The input length (one less than the maximum sequence length).
    

### **1.2 LSTM Layer**
model.add(LSTM(128))   `

*   **Purpose**: Captures sequential dependencies in text data.
    
*   128: Number of LSTM units (neurons) that process the sequential data.
    
*   Uses **Long Short-Term Memory (LSTM)** to remember long-term dependencies in the text.
    

### **1.3 Dense (Output) Layer**
model.add(Dense(total_words, activation='softmax'))   `

*   **Purpose**: Produces the probability distribution over all possible next words.
    
*   total\_words: The number of unique words (vocabulary size).
    
*   activation='softmax': Ensures that the model outputs probabilities, making it suitable for multi-class classification.
    

**2\. Model Compilation**
-------------------------

model.compile(loss='categorical_crossentropy',                optimizer='adam', metrics=['accuracy'])   `

*   **Loss Function**: categorical\_crossentropy
    
    *   Used for multi-class classification.
        
    *   Measures how well the model predicts the next word.
        
*   **Optimizer**: adam
    
    *   Adaptive learning rate optimization algorithm.
        
*   **Metric**: accuracy
    
    *   Used to measure the proportion of correctly predicted words.
        

**Summary**
-----------

*   The model takes a sequence of words as input.
    
*   The **Embedding Layer** converts words into dense vectors.
    
*   The **LSTM Layer** captures contextual dependencies.
    
*   The **Dense Layer** predicts the next word.
    
*   The model is compiled with categorical crossentropy loss and the Adam optimizer.

In [52]:
# Define the model
model = Sequential()
model.add(Embedding(total_words, 10,
                    input_length=max_sequence_len-1))
model.add(LSTM(128))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='adam', metrics=['accuracy'])


In [58]:
# Train the model
model.fit(X, y, epochs=500, verbose=1)


Epoch 1/500
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 18ms/step - accuracy: 0.9605 - loss: 0.1368
Epoch 2/500
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 18ms/step - accuracy: 0.9611 - loss: 0.1303
Epoch 3/500
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - accuracy: 0.9649 - loss: 0.1198
Epoch 4/500
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - accuracy: 0.9728 - loss: 0.1066
Epoch 5/500
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 19ms/step - accuracy: 0.9660 - loss: 0.1266
Epoch 6/500
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - accuracy: 0.9567 - loss: 0.1388
Epoch 7/500
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - accuracy: 0.9611 - loss: 0.1303
Epoch 8/500
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 18ms/step - accuracy: 0.9729 - loss: 0.1158
Epoch 9/500
[1m51/51[0m [32m━━━━━━━━━

<keras.src.callbacks.history.History at 0x35218a030>

 # **Generating Next Word Predictions with the LSTM Model**  
 
 This section describes how to generate **next-word predictions** using the trained LSTM model.
 
## **1. Input Seed Text**  seed_text = "i like "  next_words = 10   `

*   The seed\_text is the **starting phrase** for prediction.
    
*   next\_words = 10 specifies the number of words to generate.
    

**2\. Word Tokenization & Padding**
-----------------------------------
  tokenizer.texts_to_sequences([seed_text])[0] 
  
  token_list = pad_sequences([token_list], 
  
  maxlen=max_sequence_len-1, padding='pre')   `

*   The Tokenizer converts the seed\_text into a numerical sequence.
    
*   Since LSTM models require **fixed-length inputs**, pad\_sequences() ensures uniform input size.
    
*   padding='pre' pads the sequence at the beginning with zeros.
    

**3\. Predicting the Next Word**
--------------------------------
predicted_probs = model.predict(token_list) 
predicted_word = tokenizer.index_word[np.argmax(predicted_probs)]   `

*   **Model Prediction**:
    
    *   The LSTM model predicts the **probability distribution** over all words in the vocabulary.
        
    *   The output is a **softmax probability array**, where each value represents the likelihood of a word being the next word.
        
*   **Selecting the Next Word**:
    
    *   np.argmax(predicted\_probs): Finds the index of the word with the highest probability.
        
    *   tokenizer.index\_word\[\]: Converts the index back to the actual word.
        

**4\. Iterative Word Generation**
---------------------------------
 seed_text += " " + predicted_word   `

*   The predicted word is appended to seed\_text, allowing the model to generate **longer sequences**.
    
*   The loop continues for next\_words iterations, generating a sentence word by word.
    

**5\. Output**
--------------
print("Next predicted words:", seed_text)   `

*   Displays the **generated sentence** after next\_words iterations.

In [64]:
# Generate next word predictions
seed_text = "i like "
next_words = 10

for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences(
        [token_list], maxlen=max_sequence_len-1, padding='pre')
    predicted_probs = model.predict(token_list)
    predicted_word = tokenizer.index_word[np.argmax(predicted_probs)]
    seed_text += " " + predicted_word

print("Next predicted words:", seed_text)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
Next predicted words: i like  its soft and chewy neapolitan crust topped with the perfect
