# Text Generator Using Markov Chains

## Assignment 1 - GenAI

This notebook implements a simple text generator using Markov Chains. A Markov Chain is a stochastic model that describes a sequence of possible events where the probability of each event depends only on the state attained in the previous event.

### How it works:
1. **Training**: Analyze input text to build a probability model
2. **State Transitions**: Track which words follow other words
3. **Generation**: Use the model to generate new text based on learned patterns

## 1. Import Required Libraries

In [1]:
import random
import re
from collections import defaultdict, Counter
import numpy as np

## 2. Markov Chain Text Generator Class

The `MarkovChain` class implements:
- **Training**: Builds transition probabilities from input text
- **Text Generation**: Creates new text based on learned patterns
- **Order (n-gram)**: Controls context size (default is bigram - 2 words)

In [2]:
class MarkovChain:
    """
    A simple Markov Chain text generator.
    
    Parameters:
    -----------
    order : int
        The order of the Markov chain (n-gram size). Default is 1 (bigram).
    """
    
    def __init__(self, order=1):
        self.order = order
        self.model = defaultdict(list)
        self.start_words = []
        
    def tokenize(self, text):
        """Tokenize text into words."""
        # Remove extra whitespace and split into words
        text = re.sub(r'\s+', ' ', text.strip())
        words = text.split()
        return words
    
    def train(self, text):
        """
        Train the Markov chain on input text.
        
        Parameters:
        -----------
        text : str
            The training text corpus
        """
        words = self.tokenize(text)
        
        if len(words) < self.order + 1:
            raise ValueError(f"Text too short for order {self.order}")
        
        # Build the model
        for i in range(len(words) - self.order):
            # Create state (current word(s))
            if self.order == 1:
                state = words[i]
            else:
                state = tuple(words[i:i + self.order])
            
            # Next word
            next_word = words[i + self.order]
            
            # Add transition
            self.model[state].append(next_word)
        
        # Store possible starting states
        if self.order == 1:
            self.start_words = [words[i] for i in range(len(words) - self.order)]
        else:
            self.start_words = [tuple(words[i:i + self.order]) 
                               for i in range(len(words) - self.order)]
    
    def generate(self, length=50, start_word=None):
        """
        Generate text using the trained Markov chain.
        
        Parameters:
        -----------
        length : int
            Number of words to generate
        start_word : str or tuple
            Starting word(s). If None, chosen randomly.
            
        Returns:
        --------
        str : Generated text
        """
        if not self.model:
            raise ValueError("Model not trained. Call train() first.")
        
        # Choose starting state
        if start_word is None:
            current_state = random.choice(self.start_words)
        else:
            current_state = start_word if self.order > 1 else start_word
        
        # Initialize result
        if self.order == 1:
            result = [current_state]
        else:
            result = list(current_state)
        
        # Generate words
        for _ in range(length - self.order):
            if current_state not in self.model:
                # If we hit a dead end, start from a random state
                current_state = random.choice(self.start_words)
                if self.order == 1:
                    result.append(current_state)
                else:
                    result.extend(list(current_state))
            
            # Choose next word based on current state
            next_word = random.choice(self.model[current_state])
            result.append(next_word)
            
            # Update state
            if self.order == 1:
                current_state = next_word
            else:
                current_state = tuple(list(current_state)[1:] + [next_word])
        
        return ' '.join(result)
    
    def get_statistics(self):
        """Return statistics about the trained model."""
        total_states = len(self.model)
        total_transitions = sum(len(v) for v in self.model.values())
        avg_transitions = total_transitions / total_states if total_states > 0 else 0
        
        return {
            'total_states': total_states,
            'total_transitions': total_transitions,
            'avg_transitions_per_state': avg_transitions,
            'order': self.order
        }

## 3. Sample Training Text

Let's use some sample text to train our model. You can replace this with any text corpus.

In [3]:
sample_text = """
Artificial intelligence is transforming the world. Machine learning algorithms can learn from data.
Deep learning is a subset of machine learning. Neural networks are inspired by the human brain.
Natural language processing helps computers understand human language. Computer vision enables machines to see.
Artificial intelligence can solve complex problems. Machine learning models need training data.
Deep learning models require large amounts of data. Neural networks consist of layers of neurons.
Natural language processing is used in chatbots. Computer vision is used in autonomous vehicles.
The future of artificial intelligence is bright. Machine learning is revolutionizing many industries.
Deep learning has achieved remarkable results. Neural networks can recognize patterns in data.
Natural language processing improves human-computer interaction. Computer vision technology is advancing rapidly.
Artificial intelligence will continue to evolve. Machine learning techniques are becoming more sophisticated.
Deep learning frameworks make development easier. Neural networks can be trained on GPUs.
Natural language processing models understand context. Computer vision systems can detect objects.
"""

print("Training text loaded!")
print(f"Text length: {len(sample_text)} characters")
print(f"Word count: {len(sample_text.split())} words")

Training text loaded!
Text length: 1210 characters
Word count: 160 words


## 4. Example 1: Order-1 Markov Chain (Bigram)

In a first-order Markov chain, each word depends only on the previous word.

In [4]:
# Create and train Order-1 Markov Chain
markov_order1 = MarkovChain(order=1)
markov_order1.train(sample_text)

# Display statistics
stats = markov_order1.get_statistics()
print("Model Statistics:")
print(f"  Order: {stats['order']}")
print(f"  Unique states: {stats['total_states']}")
print(f"  Total transitions: {stats['total_transitions']}")
print(f"  Average transitions per state: {stats['avg_transitions_per_state']:.2f}")

print("\n" + "="*60)
print("Generated Text (Order-1):")
print("="*60)

# Generate multiple examples
for i in range(3):
    generated = markov_order1.generate(length=30)
    print(f"\nExample {i+1}:")
    print(generated)

Model Statistics:
  Order: 1
  Unique states: 93
  Total transitions: 159
  Average transitions per state: 1.71

Generated Text (Order-1):

Example 1:
problems. Machine learning models understand context. Computer vision is advancing rapidly. Artificial intelligence is used in chatbots. Computer vision enables machines to evolve. Machine learning models need training data. Deep

Example 2:
data. Deep learning techniques are becoming more sophisticated. Deep learning techniques are inspired by the human brain. Natural language processing is used in autonomous vehicles. The future of layers of

Example 3:
development easier. Neural networks consist of data. Deep learning models need training data. Deep learning techniques are becoming more sophisticated. Deep learning is transforming the world. Machine learning has achieved


## 5. Example 2: Order-2 Markov Chain (Trigram)

In a second-order Markov chain, each word depends on the previous two words, creating more coherent text.

In [5]:
# Create and train Order-2 Markov Chain
markov_order2 = MarkovChain(order=2)
markov_order2.train(sample_text)

# Display statistics
stats = markov_order2.get_statistics()
print("Model Statistics:")
print(f"  Order: {stats['order']}")
print(f"  Unique states: {stats['total_states']}")
print(f"  Total transitions: {stats['total_transitions']}")
print(f"  Average transitions per state: {stats['avg_transitions_per_state']:.2f}")

print("\n" + "="*60)
print("Generated Text (Order-2):")
print("="*60)

# Generate multiple examples
for i in range(3):
    generated = markov_order2.generate(length=30)
    print(f"\nExample {i+1}:")
    print(generated)

Model Statistics:
  Order: 2
  Unique states: 131
  Total transitions: 158
  Average transitions per state: 1.21

Generated Text (Order-2):

Example 1:
revolutionizing many industries. Deep learning is revolutionizing many industries. Deep learning has achieved remarkable results. Neural networks can be trained on GPUs. Natural language processing is used in chatbots. Computer

Example 2:
understand context. Computer vision technology is advancing rapidly. Artificial intelligence can solve complex problems. Machine learning techniques are becoming more sophisticated. Deep learning frameworks make development easier. Neural networks can

Example 3:
language processing improves human-computer interaction. Computer vision technology is advancing rapidly. Artificial intelligence is transforming the world. Machine learning algorithms can learn from data. Deep learning has achieved remarkable results.


## 6. Interactive Text Generation

Generate text with custom parameters:

In [6]:
# Customize these parameters
WORD_COUNT = 40
START_WORD = "Artificial"  # or None for random start

print(f"Generating {WORD_COUNT} words starting with '{START_WORD}'...\n")

# Generate with Order-1
print("Order-1 Generation:")
print("-" * 60)
text1 = markov_order1.generate(length=WORD_COUNT, start_word=START_WORD)
print(text1)

print("\n" + "="*60 + "\n")

# Generate with Order-2
print("Order-2 Generation:")
print("-" * 60)
# For order-2, we need a tuple of 2 words
if START_WORD:
    # Find a valid starting bigram that starts with START_WORD
    valid_starts = [state for state in markov_order2.start_words 
                   if state[0] == START_WORD]
    if valid_starts:
        start_state = random.choice(valid_starts)
        text2 = markov_order2.generate(length=WORD_COUNT, start_word=start_state)
    else:
        text2 = markov_order2.generate(length=WORD_COUNT)
else:
    text2 = markov_order2.generate(length=WORD_COUNT)
print(text2)

Generating 40 words starting with 'Artificial'...

Order-1 Generation:
------------------------------------------------------------
Artificial intelligence is used in chatbots. Computer vision enables machines to evolve. Machine learning models understand human brain. Natural language processing models require large amounts of machine learning. Neural networks are becoming more sophisticated. Deep learning is used in chatbots.


Order-2 Generation:
------------------------------------------------------------
Artificial intelligence will continue to evolve. Machine learning models require large amounts of data. Neural networks consist of layers of neurons. Natural language processing improves human-computer interaction. Computer vision is used in chatbots. Computer vision technology is advancing rapidly. Artificial


## 7. Train on Your Own Text

You can train the model on any text you want. Replace the text below with your own corpus:

In [7]:
# Example: Train on a different text corpus
custom_text = """
Python is a high-level programming language. Python is easy to learn and powerful.
Many developers love Python for its simplicity. Python has a vast ecosystem of libraries.
Data science professionals use Python extensively. Python supports multiple programming paradigms.
Web developers build applications with Python frameworks. Python code is readable and maintainable.
Machine learning engineers prefer Python for AI projects. Python has excellent community support.
"""

# Create new model and train
custom_markov = MarkovChain(order=1)
custom_markov.train(custom_text)

print("Custom Model Statistics:")
print(custom_markov.get_statistics())

print("\n" + "="*60)
print("Generated Text from Custom Model:")
print("="*60)

for i in range(2):
    print(f"\nExample {i+1}:")
    print(custom_markov.generate(length=25))

Custom Model Statistics:
{'total_states': 46, 'total_transitions': 63, 'avg_transitions_per_state': 1.3695652173913044, 'order': 1}

Generated Text from Custom Model:

Example 1:
developers love Python has excellent community support. AI projects. Python for its simplicity. Python is a high-level programming paradigms. Web developers build applications with Python has

Example 2:
and maintainable. Machine learning engineers prefer Python for AI projects. Python for AI projects. Python is easy to learn and powerful. Many developers love Python


## 8. Conclusion

### Key Observations:

1. **Order-1 (Bigram)**: 
   - Generates text based on single word context
   - More random and diverse output
   - May produce less coherent sentences

2. **Order-2 (Trigram)**:
   - Uses two-word context for predictions
   - More coherent and structured output
   - Better captures phrase patterns

3. **Trade-offs**:
   - Higher order → More coherent but needs more training data
   - Lower order → More creative but less structured

### Applications:
- Text generation and completion
- Creative writing assistance
- Chatbot responses
- Data augmentation
- Style mimicry

### Limitations:
- No long-term context memory
- Requires substantial training data
- May reproduce training data verbatim
- No semantic understanding

## 9. Extensions and Improvements

Try these enhancements:

1. **Load text from files**: Train on books, articles, or documents
2. **Sentence-aware generation**: Start/end at sentence boundaries
3. **Temperature parameter**: Control randomness in word selection
4. **Character-level models**: Generate text character by character
5. **Visualization**: Show state transition graphs
6. **Save/Load models**: Persist trained models for reuse

In [8]:
# Example Extension: Visualize most common transitions
print("Most common word transitions (Order-1 Model):")
print("="*60)

# Get top 10 states with most transitions
state_counts = [(state, len(transitions)) 
                for state, transitions in markov_order1.model.items()]
state_counts.sort(key=lambda x: x[1], reverse=True)

for i, (state, count) in enumerate(state_counts[:10], 1):
    next_words = Counter(markov_order1.model[state])
    top_next = next_words.most_common(3)
    print(f"{i}. '{state}' → {count} transitions")
    print(f"   Most common next words: {top_next}")
    print()

Most common word transitions (Order-1 Model):
1. 'learning' → 8 transitions
   Most common next words: [('is', 2), ('models', 2), ('algorithms', 1)]

2. 'is' → 7 transitions
   Most common next words: [('used', 2), ('transforming', 1), ('a', 1)]

3. 'can' → 5 transitions
   Most common next words: [('learn', 1), ('solve', 1), ('recognize', 1)]

4. 'of' → 5 transitions
   Most common next words: [('machine', 1), ('data.', 1), ('layers', 1)]

5. 'intelligence' → 4 transitions
   Most common next words: [('is', 2), ('can', 1), ('will', 1)]

6. 'Machine' → 4 transitions
   Most common next words: [('learning', 4)]

7. 'data.' → 4 transitions
   Most common next words: [('Deep', 2), ('Neural', 1), ('Natural', 1)]

8. 'Deep' → 4 transitions
   Most common next words: [('learning', 4)]

9. 'Neural' → 4 transitions
   Most common next words: [('networks', 4)]

10. 'networks' → 4 transitions
   Most common next words: [('can', 2), ('are', 1), ('consist', 1)]

