<h1>Task 02-N-gram Models</h1>
<p>N-gram models are a type of probabilistic language model used in natural language processing and speech recognition. They predict the likelihood of a word given the previous words in a sequence. In this Task, we will implement a simple N-gram model and apply it to a text dataset. <p\>

In [1]:
from typing import List, Tuple, Dict, Optional


Data Preparation:
<ul>
    <li>Load the text dataset provided. Dataset Link: </li>
    <li>Preprocess the text by converting it to lowercase and removing punctuation.</li>
    <li>Split the text into tokens (words).</li>
</ul>

In [None]:
def preprocess_text(text: str) -> str:
    """
    Preprocesses the input text by converting it to lowercase and removing punctuation.

    Args:
        text (str): The input text to be preprocessed.

    Returns:
        str: The preprocessed text.
    """
    pass


In [None]:
def generate_tokens(text: str) -> List[str]:
    """
    Generates tokens (words) from the input text.

    Args:
        text (str): The input text.

    Returns:
        List[str]: List of tokens (words).
    """
    pass



Building N-gram Model:
<ul>
  <li>  Implement a function to generate N-grams of a given text and N.
    <li>Calculate the frequencies of each N-gram in the dataset.
</ul>

To implement the generate_ngrams function, start by understanding its function signature, which takes in a list of tokens (tokens) and an integer N, representing the desired length of the N-grams, and returns a list of N-grams. Begin by creating an empty list to store the generated N-grams. Then, iterate through the list of tokens using a loop. For each token, create an N-gram by combining it with the next N-1 tokens (if available), ensuring not to go out of bounds. Append each generated N-gram to the list. Finally, handle edge cases, such as when the number of tokens is less than N, and test your implementation with various inputs to ensure correctness.

In [None]:
def generate_ngrams(tokens: List[str], N: int) -> List[Tuple[str]]:
    """
    Generates N-grams from the list of tokens.

    Args:
        tokens (List[str]): List of tokens (words).
        N (int): The value of N for N-grams.

    Returns:
        List[Tuple[str]]: List of N-grams.
    """
    pass


To implement the build_ngram_model function, begin by understanding its function signature. This function takes in a string of text (text) and an integer N, representing the desired length of the N-grams, and returns the N-gram model represented as a dictionary where keys are N-grams (tuples of N words) and values are the frequencies of occurrence of each N-gram in the text. Start by preprocessing the input text, converting it to lowercase and removing punctuation. Then, tokenize the preprocessed text into individual words or tokens. Next, generate N-grams from the tokens using the generate_ngrams function implemented earlier. Iterate through the generated N-grams and count the frequency of each N-gram. Store these frequencies in a dictionary where the keys are the N-grams and the values are their frequencies. Finally, return the N-gram model. Test your implementation with different texts and N values to ensure correctness.

In [None]:
def build_ngram_model(text: str, N: int) -> Dict[Tuple[str], int]:
    """
    Builds the N-gram model from the input text.

    Args:
        text (str): The input text.
        N (int): The value of N for N-grams.

    Returns:
        Dict[Tuple[str], int]: The N-gram model represented as a dictionary where keys are N-grams and values are frequencies.
    """
    pass



Predictive Model <br>
<li>Implement a function to predict the next word given the previous N-1 words.
<li>Test the predictive model on sample input sequences


To implement the predict_next_word function, start by understanding its function signature. This function takes in the N-gram model (model) represented as a dictionary and a tuple of previous words (prev_words). It aims to predict the next word based on the previous words and the frequencies of N-grams in the model. First, check if there are any N-grams in the model that match the previous words. If there are matching N-grams, retrieve all possible next words associated with those N-grams and randomly select one of them as the predicted next word. If no next words are found, return None. Ensure to handle edge cases, such as when the previous words are not found in the model. Test your implementation with different N-gram models and input word sequences to validate its functionality.

In [None]:
def predict_next_word(model: Dict[Tuple[str], int], prev_words: Tuple[str]) -> Optional[str]:
    """
    Predicts the next word given the previous words and the N-gram model.

    Args:
        model (Dict[Tuple[str], int]): The N-gram model.
        prev_words (Tuple[str]): Tuple of previous words.

    Returns:
        Optional[str]: The predicted next word, or None if no next word is found.
    """
    pass


Evaluating n-gram models <br>
<h3>What is Perplexity</h3>
Perplexity is a measurement used to evaluate the performance of a language model, such as an N-gram model, in predicting a sequence of words. It quantifies how well the model predicts unseen data or text. A lower perplexity indicates that the model assigns higher probabilities to unseen words, suggesting better generalization ability.<br>

### 🔢 Steps to Implement Perplexity Calculation

To compute the **perplexity** of a test dataset using a trained N-gram model, follow these steps:

1. **Preprocess the test text**  
   Apply the same preprocessing steps (e.g., lowercasing, punctuation removal) that were used during training.

2. **Tokenize the test text**  
   Split the preprocessed text into individual word tokens.

3. **Generate N-grams**  
   Use the same N value (e.g., bigram, trigram) as in the trained model to create N-grams from the tokenized test text.

4. **Calculate log probabilities**  
   For each N-gram in the test set, retrieve its log probability from the trained N-gram model.

5. **Compute the average log probability**  
   Sum all the log probabilities and divide by the number of N-grams to get the average.

6. **Calculate perplexity**  

   Take the exponential of the negative average log probability to obtain the perplexity score.


In [None]:
def evaluate_perplexity(model: Dict[Tuple[str], int], test_text: str, N: int) -> float:
    """
    Evaluates the perplexity of the N-gram model on the test text.

    Args:
        model (Dict[Tuple[str], int]): The N-gram model.
        test_text (str): The test text.
        N (int): The value of N for N-grams.

    Returns:
        float: The perplexity value.
    """
    pass
    #perplexity = 2 ** (-total_log_probability / len(test_ngrams))


<h2>Test Cases</h2>

In [None]:
# Test cases for preprocess_text function
preprocess_text_test_cases = [
    ("This is a Test.", "this is a test"),
    ("Hello World!", "hello world"),
    ("A B C D E F G", "a b c d e f g")
]

# Test cases for generate_tokens function
generate_tokens_test_cases = [
    ("This is a test.", ["This", "is", "a", "test."]),
    ("Another example.", ["Another", "example."]),
    ("123 456 789", ["123", "456", "789"])
]

# Test cases for generate_ngrams function
generate_ngrams_test_cases = [
    (["This", "is", "a", "test"], 2, [("This", "is"), ("is", "a"), ("a", "test")]),
    (["Another", "example"], 2, [("Another", "example")]),
    (["123", "456", "789"], 2, [("123", "456"), ("456", "789")])
]

In [None]:
 # Run test cases
print("\nRunning Test Cases:")
print("-" * 20)

# Test preprocess_text function
print("\nTesting preprocess_text function:")
for test_case in preprocess_text_test_cases:
    input_text, expected_output = test_case
    print(expected_output)
    print(preprocess_text(input_text))
    assert preprocess_text(input_text) == expected_output
print("All preprocess_text test cases passed.")

# Test generate_tokens function
print("\nTesting generate_tokens function:")
for test_case in generate_tokens_test_cases:
    input_text, expected_output = test_case
    assert generate_tokens(input_text) == expected_output
print("All generate_tokens test cases passed.")

# Test generate_ngrams function
print("\nTesting generate_ngrams function:")
for test_case in generate_ngrams_test_cases:
    input_tokens, n, expected_output = test_case
    assert generate_ngrams(input_tokens, n) == expected_output
print("All generate_ngrams test cases passed.")



In this task, you are required to evaluate the model on the dataset given (TextFile.txt).

In [None]:
if __name__ == "__main__":
    # Read text from file
    with open("TextFile.txt", "r") as file:
        text = file.read()

    # Preprocess the text
    preprocessed_text = preprocess_text(text)
    print("Preprocessed Text:", preprocessed_text)

    # Generate tokens
    tokens = generate_tokens(preprocessed_text)
    print("Tokens:", tokens)

    # Generate N-grams
    N = 3
    ngrams = generate_ngrams(tokens, N)
    print(f"{N}-grams:", ngrams)

    # Build N-gram model
    ngram_model = build_ngram_model(tokens, N)
    print(f"{N}-gram Model:", ngram_model)

    # Predict next word
    prev_words = ('quick', 'brown', )
    next_word = predict_next_word(ngram_model, prev_words)
    print("Predicted Next Word:", next_word)

    # Evaluate perplexity
    test_text = "This is a test."
    perplexity = evaluate_perplexity(ngram_model, test_text, N)
    print("Perplexity:", perplexity)


 # Deliverables :-

Submit .ipynb notebook with your name and registration number clearly written on the top. No need to submit the text data alongwith. Please strcitly follow the submission guidelines.