Type: Discussion  
Max score: 100  
Start: Aug 29, 7:00 am  
~~Due: Sep 6, 10:00 am **(<t:1725588000:R>)**~~  
Due: Sep 18, 10:00 am **(<t:1726624800:R>)**

Allow late submissions: ✅

-----

# Exploring Text Generation with N-Grams in Python

Overview: This activity introduces students to n-gram text generation using Python, allowing them to experiment with different sample texts and n-values. By implementing and modifying an n-gram model, students will observe how variations in input text and n-value affect the generated output. This hands-on experience will enhance their understanding of probabilistic text generation and the role of n-grams in natural language processing.

## Objectives:

1. To explore the concept of n-grams and their application in text generation.
2. To implement an n-gram model in Python with different sample texts and n-values.
3. To analyze and compare the results of text generation with various inputs and configurations.

## Learning Outcomes:

1. Students will be able to explain how n-grams are used in generating text.
2. Students will gain practical experience in coding and using n-gram models.
3. Students will understand how input text and n-value influence the generated text.

**Background:** In natural language processing (NLP), n-grams are used to model sequences of words and predict subsequent words based on previous context. N-gram models are a fundamental technique in text generation, where the goal is to create new text that resembles a given sample. Understanding how different n-gram sizes and input texts affect the generated output is crucial for developing effective text generation models.

*(cont.)*

*(cont.)*

**Problem Statement:** You are tasked with exploring the behavior of n-gram models in text generation. Specifically, you need to:

1. **Implement an n-gram Model:** Build a Python program that constructs an n-gram model from a given sample text. The program should handle various values of n to create models ranging from bigrams to higher-order n-grams.

2. **Generate Text with Different Inputs:** Use the n-gram model to generate text based on different sample inputs. Compare the generated text for various n-values and observe how the choice of n and input text affects the output.

3. **Analyze the Results:** Examine the coherence and relevance of the generated text. Identify any patterns or issues related to different n-gram sizes and sample texts. Assess how well the generated text reflects the structure and style of the input text.

*(cont.)*

*(cont.)*

## Tasks:

### 1. Create an n-gram Model:
- Implement a Python script that builds an n-gram model from a provided text. The script should handle varying sizes of n-grams (e.g., bigrams, trigrams, and 4-grams).

### 2. Generate and Compare Texts:
- Use the model to generate text sequences for different input texts. Experiment with different n-values and analyze how these affect the text output.
- Example inputs could be:
 - Text 1: "The quick brown fox jumps over the lazy dog."
 - Text 2: "Data science is an interdisciplinary field that uses scientific methods."
- Generate text sequences with various n-values (e.g., trigram, and 4-gram) for each input text.

### 3. Evaluate and Discuss:
- Compare the quality of the generated text across different n-values. Consider aspects such as coherence, relevance, and adherence to the input text's style.
- Discuss any observed patterns or anomalies in the generated text and suggest potential improvements to the n-gram model or text generation process.

Screenshot your code-based, the output of activity and write your insights.

-----

First 5 submitted without error - 100  
Second 5 submitted without error - 90  
The rest of submission without error - 85  
Late submission without error - 75  
No submission and with an error is 0

In [16]:
import random
from collections import defaultdict

## Sample text for n-gram
text = "Data science is an interdisciplinary field that uses scientific methods."

## Place request for user input and computation inside massive loop
while True:
  ## Create a multi-value n-gram model
  n = int(input("Enter the n-value: "))
  ngrams = defaultdict(list)
  words = text.split()

  for i in range(len(words) - n + 1):
    gram = tuple(words[i:i+n])
    next_word = words[i + n] if i + n < len(words) else None
    #add next word if it's not None
    if next_word:
        ngrams[gram].append(next_word)

  ### User input
  user_input = input("Enter a word or phrase: ")
  words = user_input.split()

  ### Find matching grams
  matching_grams = [gram for gram in ngrams if gram[-1] == words[-1]]

  if matching_grams:
    current_gram = random.choice(matching_grams)
    result = list(current_gram)

    #### Iterate through each gram and append suitable ones
    for _ in range(10):
      if current_gram in ngrams and ngrams[current_gram]:
        next_word = random.choice(ngrams[current_gram])
        result.append(next_word)
        current_gram = tuple(result[-n:])
      else:
        break

    print("Autocomplete suggestion: ", ' '.join(result))

  else:
    print("No matching grams found.")

  ### Ask for continuation of loop
  print("")
  if input("Do you wish to try again? (y/n): ").lower() not in ["y", "yes", True]:
    print("The program has ended.")
    break
  else:
    print("")

Enter the n-value: 3
Enter a word or phrase: data science
No matching grams found.

Do you wish to try again? (y/n): y

Enter the n-value: 2
Enter a word or phrase: data science
Autocomplete suggestion:  Data science is an interdisciplinary field that uses scientific methods.

Do you wish to try again? (y/n): y

Enter the n-value: 2
Enter a word or phrase: interdisciplinary
Autocomplete suggestion:  an interdisciplinary field that uses scientific methods.

Do you wish to try again? (y/n): y

Enter the n-value: 4
Enter a word or phrase: interdisciplinary
Autocomplete suggestion:  science is an interdisciplinary field that uses scientific methods.

Do you wish to try again? (y/n): n
The program has ended.
