<a href="https://colab.research.google.com/github/Nandanpujan/Gen-Ai/blob/main/TextGenerator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Implement a simple text generation algorithm using Markov chains. This task involves creating a statistical model that predicts the probability of a character or word based on the previous one(s).

## Prepare the training data

### Subtask:
Load a text corpus and preprocess it by cleaning the text and splitting it into a sequence of tokens (characters or words).


**Reasoning**:
Load the text data, convert to lowercase, remove punctuation, and split into words.



In [None]:
import re

file_path = '/tmp/text_corpus.txt'
with open(file_path, 'r') as f:
    text = f.read()

text = text.lower()
text = re.sub(r'[^a-z\s]', '', text)
words = text.split()

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/text_corpus.txt'

## Build the markov chain model

### Subtask:
Create a dictionary or similar data structure to store the transition probabilities of tokens based on the preceding token(s).


**Reasoning**:
Since the previous subtask failed due to a missing file, I need to create a dummy text file to proceed with the current subtask of building the Markov chain. Then, I will initialize the markov chain dictionary and populate it with word transitions and their counts based on the dummy text.



In [None]:
import os

file_path = '/tmp/text_corpus.txt'
dummy_text = "this is a test text this is another test"
with open(file_path, 'w') as f:
    f.write(dummy_text)

with open(file_path, 'r') as f:
    text = f.read()

text = text.lower()
text = re.sub(r'[^a-z\s]', '', text)
words = text.split()

markov_chain = {}

for i in range(len(words)):
    current_word = words[i]
    if i > 0:
        prev_word = words[i-1]
        if prev_word not in markov_chain:
            markov_chain[prev_word] = {}
        if current_word not in markov_chain[prev_word]:
            markov_chain[prev_word][current_word] = 0
        markov_chain[prev_word][current_word] += 1
    else:
        # Handle the start of the text, use a special token or None
        if None not in markov_chain:
            markov_chain[None] = {}
        if current_word not in markov_chain[None]:
            markov_chain[None][current_word] = 0
        markov_chain[None][current_word] += 1

# Convert counts to probabilities
for prev_word, next_words in markov_chain.items():
    total_count = sum(next_words.values())
    for next_word, count in next_words.items():
        markov_chain[prev_word][next_word] = count / total_count

print(markov_chain)

{None: {'this': 1.0}, 'this': {'is': 1.0}, 'is': {'a': 0.5, 'another': 0.5}, 'a': {'test': 1.0}, 'test': {'text': 1.0}, 'text': {'this': 1.0}, 'another': {'test': 1.0}}


## Generate text

### Subtask:
Start with an initial token and iteratively generate subsequent tokens based on the probabilities in the Markov chain model until a desired length is reached.


**Reasoning**:
Implement the `generate_text` function to generate text using the Markov chain and then call the function and print the result.



In [None]:
import random

def generate_text(markov_chain, length):
    current_word = random.choices(list(markov_chain[None].keys()), weights=list(markov_chain[None].values()), k=1)[0]
    generated_text = [current_word]

    for _ in range(length - 1):
        if current_word in markov_chain:
            next_words = markov_chain[current_word]
            if next_words:
                current_word = random.choices(list(next_words.keys()), weights=list(next_words.values()), k=1)[0]
                generated_text.append(current_word)
            else:
                # If there are no next words, stop or choose a new starting word
                # For simplicity, let's stop here
                break
        else:
            # If the current word is not in the markov chain, stop or choose a new starting word
            # For simplicity, let's stop here
            break

    return " ".join(generated_text)

generated_text_output = generate_text(markov_chain, 20)
print(generated_text_output)

this is a test text this is a test text this is a test text this is another test text


## Summary:

### Data Analysis Key Findings

*   The initial attempt to load the text corpus resulted in a `FileNotFoundError` because the file was not present at the specified path (`/tmp/text_corpus.txt`).
*   A dummy text file with the content "this is a test text this is another test" was created to enable further steps.
*   The text was successfully preprocessed by converting it to lowercase, removing non-alphabetical characters (except spaces), and splitting it into a list of words.
*   A first-order Markov chain model was built as a dictionary. This dictionary stores the transition probabilities between words, using `None` to represent the start of the text.
*   The transition counts were successfully converted into probabilities by dividing the count of each next word by the total count for the preceding word.
*   A function `generate_text` was implemented to generate text of a specified length using the built Markov chain model.
*   The text generation starts with a word chosen probabilistically from those that can follow `None` (the start of the text).
*   Subsequent words are chosen probabilistically based on the preceding word's transitions defined in the Markov chain.
*   The generated text output, "this is a test text this is a test text this is a test text this is another test text", demonstrated the algorithm's ability to produce sequences based on the learned probabilities.

### Insights or Next Steps

*   The current model is a simple first-order Markov chain. To generate more complex and coherent text, consider implementing a higher-order Markov chain (considering two or more preceding words).
*   The model currently handles only lowercase alphabetical words. Future work could involve handling punctuation, capitalization, and a wider range of characters to improve the naturalness of the generated text.
