**Important! Please do not remove any cells, including the test cells, even if they appear empty. They contain hidden tests, and deleting them could result in a loss of points, as the exercises are graded automatically. Only edit the cells where you are instructed to write your solution.**  

# Exercise 4: Text Generation using LSTM

### Objective
In this assignment, you will implement a character-level text generation model using Long Short-Term Memory (LSTM) networks in PyTorch. The goal is to understand how LSTMs work for sequential data and how to train them effectively to generate new text based on an input sequence.

You will follow the steps below:
1. Load and preprocess a text dataset
2. Character-level encoding by constructing the vocabulary and dictionary (2 points)
3. Batch generation for training (6 points)
4. Defining the character-level LSTM model (6 points)
5. Training loop (3 points)
6. Text generation using the trained model (3 points)


**IMPORTANT NOTE**: Kindly remove the line "raise NotImplementedError()" from all cells wherever present.

**Deliverables**:

Submit the completed notebook (ex4.ipynb) and your trained model (best_model.pth). Do not change the name of the notebook file. It may result in 0 points for this exercise

In [1]:
skip_training = False   # You can set it to True if you want to run inference on your trained model.

In [2]:
# Do not delete this cell

### Import the necessary libraries

In [3]:
import random
import re

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from tqdm import tqdm

In [4]:
if torch.backends.mps.is_available():
    device = torch.device('mps')
else:
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


### 1. Load and Preprocess the Text Dataset

We will be using *Alice's Adventures in Wonderland* by Lewis Carroll as our dataset. You can download it from [Project Gutenberg](https://www.gutenberg.org/):

[Alice's Adventures in Wonderland by Lewis Carroll (Project Gutenberg Page)](https://www.gutenberg.org/ebooks/11) \
[Direct Text File Download](https://www.gutenberg.org/files/11/11-0.txt)

We’ve chosen Alice's Adventures in Wonderland as a relatively small text to make training still manageable on a CPU. However, you are highly encouraged to explore other texts from Project Gutenberg or other public domain sources.

This section contains the following steps:
1. Load the dataset into Python
2. Remove metadata to focus on the main part of the text
3. Clean the text by removing special characters and converting it to lowercase
   
The goal is to preprocess the dataset by filtering out any metadata that is not part of the text, converting the text to lowercase, and removing unnecessary punctuation. We will also build a dictionary to map each unique character to a unique integer.

#### 1.1. Load the Dataset

We start by loading the text dataset into Python. The dataset should be a plain text file. The first step is to load and inspect a small portion of the raw text to understand its structure to identify any unwanted metadata or special characters that should be removed during preprocessing.

In [5]:
txt_path = '/content/alice.txt' # replace 'alice.txt' with your txt path

In [6]:
# Do not delete this cell

In [7]:
#with open(txt_path, 'r') as file:
with open(txt_path, 'r', encoding = 'utf-8') as file:
    raw_text = file.read()
####
print('===First 1500 characters before any processing:\n\n')
print(raw_text[:1500])

print('\n\n\n===Ending characters before any processing:\n')
print(raw_text[-19000:-17000])

===First 1500 characters before any processing:


﻿The Project Gutenberg eBook of Alice's Adventures in Wonderland
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Alice's Adventures in Wonderland

Author: Lewis Carroll

Release date: June 27, 2008 [eBook #11]
                Most recently updated: October 21, 2024

Language: English

Credits: Arthur DiBianca and David Widger


*** START OF THE PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***
[Illustration]




Alice’s Adventures in Wonderland

by Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0

Contents

 CHAPTER I.     D

#### 1.2. Remove Metadata and Focus on the Main Text

Text files may contain introductary or ending metadata such as copyright information. We want to focus only on the main body of the text. For Alice's Adventures in Wonderland, we remove everything before the first chapter and after the Project Gutenberg closing markers.

In [8]:
# For this example, we are removing everything before 'CHAPTER I.\nDown the Rabbit-Hole'
# and after the end marker
start_index = raw_text.find('CHAPTER I.\nDown the Rabbit-Hole')

end_index = raw_text.find('*** END OF THE PROJECT GUTENBERG') # closing markers of Project Gutenberg

trimmed_text = raw_text[start_index:end_index]

print('===Text after removing metadata:\n')
print(trimmed_text[:1500])

===Text after removing metadata:

CHAPTER I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into
the book her sister was reading, but it had no pictures or
conversations in it, “and what is the use of a book,” thought Alice
“without pictures or conversations?”

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure of
making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so _very_ remarkable in that; nor did Alice think it
so _very_ much out of the way to hear the Rabbit say to itself, “Oh
dear! Oh dear! I shall be late!” (when she thought it over afterwards,
it occurred to her that she ought to have wondered at this, but at the
time it all seemed quite natural); but when the Rabbit a

#### 1.3. Clean the Text

Next, we preprocess the text by removing any special characters, leaving only alphanumeric characters, and normalizing spaces. We also convert all text to lowercase to standardize the format. This helps the model learn without case sensitivity or irrelevant symbols.

##### Steps to follow:
##### 1. Convert text to lowercase:
First, convert the text to lowercase to avoid treating uppercase and lowercase letters as different characters.
##### 2. Remove special characters:
Then, you need to remove any character that is not a letter `(a-z)`, a number `(0-9)`, or a space `\s`.
##### 3. Handling double spaces:
After removing characters, there may be extra spaces in the text. Make sure that sequences of multiple spaces are reduced to just a single space.

**Hint:** You can use the `re.sub()` method of regular expressions library `re` to replace the patterns. In this case, you will be replacing non-alphanumeric characters (`[^a-z0-9\s]`) with spaces, and whitespace sequences (`\s+`) with a single space.

In [9]:
def preprocess_text(text):
    """
    Preprocesses the input text by i. converting it to lowercase,
    ii. removing non-alphanumeric characters (except spaces),
    iii. and normalizing spaces.

    Args:
    text -- The raw input text as a string

    Returns:
    cleaned_text -- The processed text where all the preprocessing steps are applied
    """
    # 1. Convert text to lowercase
    # 2. Remove special characters
    # 3. Remove double spaces

    # 1. Convert text to lowercase
    text = text.lower()

    # 2. Replace any character that is NOT a letter, number, or whitespace with a space
    text = re.sub(r'[^a-z0-9\s]', ' ', text)

    # 3. Replace one or more whitespace characters with a single space
    text = re.sub(r'\s+', ' ', text)

    # Optional: strip leading/trailing spaces (good practice)
    cleaned_text = text.strip()

    return cleaned_text

cleaned_text = preprocess_text(trimmed_text)
print('Text after cleaning and converting to lowercase:\n')
print(cleaned_text[:1000])


Text after cleaning and converting to lowercase:

chapter i down the rabbit hole alice was beginning to get very tired of sitting by her sister on the bank and of having nothing to do once or twice she had peeped into the book her sister was reading but it had no pictures or conversations in it and what is the use of a book thought alice without pictures or conversations so she was considering in her own mind as well as she could for the hot day made her feel very sleepy and stupid whether the pleasure of making a daisy chain would be worth the trouble of getting up and picking the daisies when suddenly a white rabbit with pink eyes ran close by her there was nothing so very remarkable in that nor did alice think it so very much out of the way to hear the rabbit say to itself oh dear oh dear i shall be late when she thought it over afterwards it occurred to her that she ought to have wondered at this but at the time it all seemed quite natural but when the rabbit actually took a watch 

In [10]:
def test_text_cleaning(text, cleaned_text):
    """
    Visible test case for verifying text cleaning function.
    Collects all errors instead of stopping at the first failure.
    """

    errors = []
    all_tests_successful = True

    # Test 1: Check if the text length is reduced
    if not (len(cleaned_text) < len(text)):
        all_tests_successful = False
        errors.append("Task 1: Visible test: Text cleaning: The cleaned text should be shorter than the original raw text.")

    # Test 2: All characters should be lowercase
    if not cleaned_text.islower():
        all_tests_successful = False
        errors.append("Text cleaning: The cleaned text is not fully lowercase.")

    # Test 3: Ensure all special characters are removed
    if not all(char.isalnum() or char == ' ' for char in cleaned_text):
        all_tests_successful = False
        errors.append("Text cleaning: Special characters are still present in the cleaned text.")

    # Test 4: Ensure no consecutive spaces exist
    if "  " in cleaned_text:
        all_tests_successful = False
        errors.append("Text cleaning: Multiple consecutive spaces detected in the cleaned text.")

    # --- Report results ---
    if errors:
        feedback_txt.append("Task 1: Visible test: ")
        feedback_txt.extend(errors)
        raise AssertionError("\n".join(errors))
    elif all_tests_successful:
        print("Text cleaning test passed successfully!")

# Run visible test
test_text_cleaning(raw_text, cleaned_text)

Text cleaning test passed successfully!


### 2. Character-Level Encoding (2 points)

In this step, we convert the cleaned text into a format that the model can understand. Since we are working with character-level encoding, each individual character will be treated as a token. This allows the LSTM to learn patterns at the character level and generate text one character at a time.



#### 2.1 Character-Level Vocabulary

In this section, you will create a vocabulary of from the cleaned text and map each element to a unique integer.

##### Steps to follow:
##### 1. Create a vocabulary of unique characters:
First, you need to extract all the unique characters from the cleaned text. You can use Python's built-in `set()` function to find unique characters. For consistency and easier debugging, you should also sort the unique characters.
##### 2. Construct mapping between characters and integers:
Once you have the unique characters, create a dictionary `char_to_int` that maps each character to a unique integer to represent each character during training. You should also create a reverse mapping `int_to_char` that maps integers back to characters to be used when decoding the text later.


In [11]:
def create_char_mappings(cleaned_text):
    """
    Creates character-to-integer and integer-to-character mappings from the cleaned text.

    Args:
    cleaned_text -- The cleaned input text as a string

    Returns:
    char_to_int -- A dictionary mapping each unique character to an integer
    int_to_char -- A dictionary mapping each integer back to its corresponding character
    """
    # Get all unique characters in the text
    unique_chars = sorted(set(cleaned_text))

    # Create character → integer mapping
    char_to_int = {char: idx for idx, char in enumerate(unique_chars)}

    # Create integer → character mapping
    int_to_char = {idx: char for idx, char in enumerate(unique_chars)}

    return char_to_int, int_to_char

char_to_int, int_to_char = create_char_mappings(cleaned_text)
print('Character to Integer Mapping:')
for char, idx in list(char_to_int.items()):
    print(f"'{char}' : {idx}")

Character to Integer Mapping:
' ' : 0
'a' : 1
'b' : 2
'c' : 3
'd' : 4
'e' : 5
'f' : 6
'g' : 7
'h' : 8
'i' : 9
'j' : 10
'k' : 11
'l' : 12
'm' : 13
'n' : 14
'o' : 15
'p' : 16
'q' : 17
'r' : 18
's' : 19
't' : 20
'u' : 21
'v' : 22
'w' : 23
'x' : 24
'y' : 25
'z' : 26


#### 2.2 Encode the Text into Integers

During training, the model will use the encoded representation of the cleaned text as the input. In this section, you need to convert each character in the cleaned text to its corresponding integer using `char_to_int` dictionary.

In [13]:
def encode_text(cleaned_text, char_to_int):
    """
    Encodes the cleaned text into an array of integers.

    Args:
    cleaned_text -- The cleaned input text as a string
    char_to_int -- Characters to integer mapping

    Returns:
    encoded_chars -- Numpy array of integers representing the encoded characters from the text
    """
    # Convert each character in the text to its corresponding integer
    encoded_list = [char_to_int[char] for char in cleaned_text]

    # Convert the Python list to a NumPy array
    encoded_chars = np.array(encoded_list, dtype=np.int64)

    return encoded_chars

encoded_chars = encode_text(cleaned_text, char_to_int)
print(f"Total encoded characters: {len(encoded_chars)}")
print('First 100 encoded characters:')
print(encoded_chars[:100])

Total encoded characters: 135001
First 100 encoded characters:
[ 3  8  1 16 20  5 18  0  9  0  4 15 23 14  0 20  8  5  0 18  1  2  2  9
 20  0  8 15 12  5  0  1 12  9  3  5  0 23  1 19  0  2  5  7  9 14 14  9
 14  7  0 20 15  0  7  5 20  0 22  5 18 25  0 20  9 18  5  4  0 15  6  0
 19  9 20 20  9 14  7  0  2 25  0  8  5 18  0 19  9 19 20  5 18  0 15 14
  0 20  8  5]


In [14]:
### BEGIN VISIBLE TESTS
def test_character_encoding_length(cleaned_text, encoded_chars, char_to_int, int_to_char):
    """
    Visible test case for verifying character encoding length and consistency.
    Collects all errors and writes them to feedback_txt.
    """

    errors = []
    all_tests_successful = True

    # ---- Test 1: Dictionary size match ----
    if len(char_to_int) != len(int_to_char):
        all_tests_successful = False
        errors.append(
            "Character encoding: char_to_int and int_to_char dictionaries should have the same length."
        )

    # ---- Test 2: Encoded text length match ----
    if len(encoded_chars) != len(cleaned_text):
        all_tests_successful = False
        errors.append(
            "Character encoding: The length of encoded_chars should match the length of cleaned_text."
        )

    # ---- Test 3: Round-trip decoding correctness ----
    try:
        decoded_text = ''.join([int_to_char[i] for i in encoded_chars])
        if decoded_text != cleaned_text:
            all_tests_successful = False
            errors.append(
                "Character encoding: Decoded text does not match the original cleaned text."
            )
    except Exception as e:
        all_tests_successful = False
        errors.append(
            f"Character encoding: Error during decoding — {str(e)}"
        )

    # ---- Report results ----
    if errors:
        feedback_txt.append("Task 2: Visible test: -1 points from the following:")
        feedback_txt.extend(errors)
        raise AssertionError("\n".join(errors))
    elif all_tests_successful:
        print("Character encoding length test passed successfully!")

# Run visible test
test_character_encoding_length(cleaned_text, encoded_chars, char_to_int, int_to_char)
### END VISIBLE TESTS


Character encoding length test passed successfully!


### 3. Batch Generation for Training (6 points)
In this step, you will implement the function `get_batches()` that splits the encoded data into smaller batches for training. Each batch will have input sequences `x` and target sequences `y` where `y` is `x` shifted by one position.  This means that the model is trained to generate the next character in the sequence based on the previous ones.

##### Steps to follow:
##### 1. Handle step_size:
The step size determines how much the window moves across the data after each sequence is generated. If `step_size` is not provided, it is set to `seq_length`. This means the sequences will not overlap. A smaller `step_size` allows for overlapping sequences.

##### 2. Calculate the number of batches:
When calculating how many batches you can generate from the input data, there are some key factors to consider:
1. Sequence Length: Each input sequence in a batch will contain a specific number of tokens (representing the characters). The longer your sequence length is, the fewer total sequences you can generate from the input data.
2. Step Size: A smaller step size results in more overlap between sequences and it allows you to generate more sequences from the same input. If the step size is larger, there will be less overlap (or none at all if step size equals sequence length), leading to fewer sequences in total.
3. Batch Size: Once you generate sequences, you need to group them into batches for efficient training. The batch size defines how many sequences are grouped in each batch. A larger batch size means fewer batches because more sequences are grouped together in each batch.

Make sure to generate full number of batches.

##### 3. Trim the input array:
If the input data does not perfectly divide into batches, trim the array so it contains only full batches. Avoid having incomplete sequences at the end.

##### 4. Generate batches:
Use nested loops to generate batches:
- `x` will be the input sequence of length `seq_length`.
- `y` will be target sequence, which is `x` shifted by one position (token).

##### 5. Store and return batches:
Store the input and target sequences in separate arrays (`x_batches` and `y_batches`) and return them as NumPy arrays to be used in training.

**Expected Shape**:
- Each batch in `x_batches` and `y_batches` should have the shape `(batch_size, seq_length)`.
- The returned `x_batches` and `y_batches` should be NumPy arrays with shapes `(num_batches, batch_size, seq_length)`.

**Important Notes**:
- Support for both overlapping and non-overlapping sequences using `step_size`.
- Handle edge cases where the data does not fit perfectly into full batches.
- Think about how you are generating both the `x` and `y` sequences. Their size should match but `y` should always be one token ahead of `x`.



In [15]:
import numpy as np

def get_batches(encoded_chars, batch_size, seq_length, step_size=None):
    """
    Generates batches of input (x) and target (y) sequences from encoded text.

    Args:
        encoded_chars: 1D numpy array of encoded integers (the full text)
        batch_size:    Number of sequences per batch
        seq_length:    Length of each sequence
        step_size:     How many positions to move forward for the next sequence.
                       If None, defaults to seq_length (non-overlapping)

    Returns:
        x_batches: np.array of shape (num_batches, batch_size, seq_length)
        y_batches: np.array of shape (num_batches, batch_size, seq_length)
    """
    if step_size is None:
        step_size = seq_length  # Default: no overlap

    # Total length needed: we need seq_length + 1 because y is shifted by 1
    total_length_needed = seq_length + 1

    # Find how many complete sequences we can make with the given step_size
    n_sequences = 0
    pos = 0
    while pos + total_length_needed <= len(encoded_chars):
        n_sequences += 1
        pos += step_size

    # Total number of sequences we can generate
    if n_sequences == 0:
        raise ValueError("Text is too short for the given seq_length and step_size.")

    # Trim to only full batches
    total_sequences_in_full_batches = (n_sequences // batch_size) * batch_size
    if total_sequences_in_full_batches == 0:
        raise ValueError(f"Cannot form even one full batch with batch_size={batch_size}. "
                         f"Only {n_sequences} sequences available.")

    # We'll collect all valid sequences first
    x_list = []
    y_list = []

    pos = 0
    count = 0
    while count < total_sequences_in_full_batches:
        if pos + total_length_needed <= len(encoded_chars):
            x_seq = encoded_chars[pos : pos + seq_length]
            y_seq = encoded_chars[pos + 1 : pos + seq_length + 1]

            x_list.append(x_seq)
            y_list.append(y_seq)

            count += 1
        pos += step_size

    # Convert to numpy arrays
    x = np.array(x_list)  # Shape: (total_sequences_in_full_batches, seq_length)
    y = np.array(y_list)

    # Reshape into batches
    num_batches = total_sequences_in_full_batches // batch_size

    x_batches = x.reshape(num_batches, batch_size, seq_length)
    y_batches = y.reshape(num_batches, batch_size, seq_length)

    return x_batches, y_batches


# === Example Usage ===
batch_size = 64
seq_length = 100
step_size = 3  # Small step → lots of overlap → more training data (good!)

x_batches, y_batches = get_batches(encoded_chars, batch_size, seq_length, step_size)

print(f"Number of batches: {x_batches.shape[0]}")
print(f"Batch shape: (batch_size={x_batches.shape[1]}, seq_length={x_batches.shape[2]})")
print(f"x_batches shape: {x_batches.shape}")
print(f"y_batches shape: {y_batches.shape}")

# Verify that y is indeed x shifted by one
print("\nFirst sequence check:")
print("x[0]:", x_batches[0, 0, :10], "→ ... →", x_batches[0, 0, -5:])
print("y[0]:", y_batches[0, 0, :10], "→ ... →", y_batches[0, 0, -5:])
print("Correct shift?" , np.all(y_batches[0, 0] == np.roll(x_batches[0, 0], -1)))

Number of batches: 702
Batch shape: (batch_size=64, seq_length=100)
x_batches shape: (702, 64, 100)
y_batches shape: (702, 64, 100)

First sequence check:
x[0]: [ 3  8  1 16 20  5 18  0  9  0] → ... → [14  0 20  8  5]
y[0]: [ 8  1 16 20  5 18  0  9  0  4] → ... → [ 0 20  8  5  0]
Correct shift? False


In [16]:
### BEGIN VISIBLE TESTS
def test_batch_generation_shape_no_overlap(encoded_chars):
    """
    Visible test case for verifying batch generation when there is no overlap between sequences.
    Collects all errors and writes them to feedback_txt.
    """

    errors = []
    all_tests_successful = True

    try:
        x_batches, y_batches = get_batches(encoded_chars, batch_size=64, seq_length=100)
    except Exception as e:
        errors.append(f"Error: get_batches() raised an exception — {str(e)}")
        all_tests_successful = False
        x_batches, y_batches = None, None

    #  Test 1: Matching batch counts
    if x_batches is not None and y_batches is not None:
        if len(x_batches) != len(y_batches):
            all_tests_successful = False
            errors.append(
                "Batch generation: The number of x_batches and y_batches should be the same."
            )

    # Test 2: Check shapes
    expected_shape = (21, 64, 100)
    if x_batches is not None:
        if x_batches.shape != expected_shape:
            all_tests_successful = False
            errors.append(
                f"Batch generation: Expected x_batches shape {expected_shape}, but got {x_batches.shape}."
            )
    if y_batches is not None:
        if y_batches.shape != expected_shape:
            all_tests_successful = False
            errors.append(
                f"Batch generation: Expected y_batches shape {expected_shape}, but got {y_batches.shape}."
            )

    # Report results
    if errors:
        feedback_txt.append("Task 3: Visible test: -2 points from the following:")
        feedback_txt.extend(errors)
        raise AssertionError("\n".join(errors))
    elif all_tests_successful:
        print("All visible batch generation tests passed successfully!")

# Run the visible test
test_batch_generation_shape_no_overlap(encoded_chars)
### END VISIBLE TESTS


All visible batch generation tests passed successfully!


In [17]:
### BEGIN VISIBLE TESTS
def test_batch_generation_shape_overlap(encoded_chars):
    """
    Visible test case for verifying batch generation when sequences overlap (step_size != seq_length).
    Collects all assertion failures into feedback_txt.
    """
    errors = []
    all_tests_successful = True

    # ---- Run the student's batch generator ----
    try:
        x_batches, y_batches = get_batches(encoded_chars, batch_size=64, seq_length=100, step_size=50)
    except Exception as e:
        errors.append(f"Error: get_batches() raised an exception — {str(e)}")
        all_tests_successful = False
        x_batches, y_batches = None, None

    # ---- Test 1: Ensure same number of x/y batches ----
    if x_batches is not None and y_batches is not None:
        if len(x_batches) != len(y_batches):
            all_tests_successful = False
            errors.append(
                "Batch generation (overlap): The number of x_batches and y_batches should be the same."
            )

    # ---- Test 2: Validate shapes ----
    expected_shape = (42, 64, 100)
    if x_batches is not None:
        if x_batches.shape != expected_shape:
            all_tests_successful = False
            errors.append(
                f"Batch generation (overlap): Expected x_batches shape {expected_shape}, but got {x_batches.shape}."
            )
    if y_batches is not None:
        if y_batches.shape != expected_shape:
            all_tests_successful = False
            errors.append(
                f"Batch generation (overlap): Expected y_batches shape {expected_shape}, but got {y_batches.shape}."
            )

    # ---- Report results ----
    if errors:
        feedback_txt.append("Task 3: Visible test: -2 points from the following:")
        feedback_txt.extend(errors)
        raise AssertionError("\n".join(errors))
    elif all_tests_successful:
        print("All visible batch generation (overlap) tests passed successfully!")

# Run the visible test
test_batch_generation_shape_overlap(encoded_chars)
### END VISIBLE TESTS


All visible batch generation (overlap) tests passed successfully!


In [18]:
# Display for y shift and  step_size
def display_batch_generation(arr, char_to_int, int_to_char):
    batch_size, seq_length, step_size = 8, 10, 5  # Setting step_size for overlap between sequences

    x_batches, y_batches = get_batches(arr, batch_size, seq_length, step_size)

    # Display batch number 10
    x_chars = ''.join([int_to_char[idx] for idx in x_batches[10][0]])
    y_chars = ''.join([int_to_char[idx] for idx in y_batches[10][0]])

    print('='*50)
    print('Displaying a Single Batch')
    print('='*50)
    for i in range(batch_size):
        x_chars = ''.join([int_to_char[idx] for idx in x_batches[10][i]])
        y_chars = ''.join([int_to_char[idx] for idx in y_batches[10][i]])

        print(f"[{x_chars}]  -->  [{y_chars}]")
    print('='*50)
display_batch_generation(encoded_chars, char_to_int, int_to_char )

Displaying a Single Batch
[made her f]  -->  [ade her fe]
[her feel v]  -->  [er feel ve]
[eel very s]  -->  [el very sl]
[ery sleepy]  -->  [ry sleepy ]
[leepy and ]  -->  [eepy and s]
[ and stupi]  -->  [and stupid]
[stupid whe]  -->  [tupid whet]
[d whether ]  -->  [ whether t]


### 4. Define the Character-Level LSTM Model (6 points)
In this step, you will implement the CharLSTM class, which processes sequences of characters and predicts the next character in the sequence. The model will learn sequential patterns in the data and store information over time using hidden states.

##### Key Components:
##### 1. Single Multi-Layer LSTM (see [nn.LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html)):
- Use a single `nn.LSTM` module configured as a multi-layer LSTM by setting the `num_layers` parameter to specify the number of stacked layers within this LSTM.
- Dropout set in `nn.LSTM` automatically applies dropout between the internal LSTM layers (e.g., between the 1st and 2nd layers if num_layers=2). This dropout is only applied between the internal LSTM layers and does not affect the final output layer.

##### 2. Dropout Layer (see [nn.Dropout](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html)):
- Define an additional dropout layer to be applied after the final LSTM layer. This helps to prevent overfitting.

##### 3. Fully Connected Layer (see [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)):
- After the LSTM layers, a fully connected layer maps the hidden states to process them as a probability distribution over the vocabulary to predict the next character in the sequence.

##### 4. Hidden State Initialization:
- The LSTM's hidden and cell states will be initialized with zero values before each batch is processed and will be updated as the model processes the sequence.

##### Steps to follow:
- You need to implement the following methods in the `CharLSTM` class:
   - `__init__()` to define the architecture.
   - `forward()` to handle the forward pass.
   - `init_hidden()` to initialize the hidden and cell states before each batch.
- The model architecture should follow this flow:
    - One `nn.LSTM`: Define a single multilayer LSTM with `num_layers=num_layers`, output dimensionality `hidden_dim`, and internal dropout of `dropout_prob` between stacked layers.
    - Dropout after LSTM: Apply a separate dropout layer `nn.Dropout` with the probability of `dropout_prob` to the LSTM’s output after all layers.
    - Fully Connected Layer: Define a fully connected layer to output logits for each character in the sequence.

**Important Notes**:
- Ensure that you use the provided parameters (e.g., num_layers, hidden_dim) when defining the model architecture. Avoid hardcoding values (like `num_layers=2`)
- One-hot encoding will be applied in the training loop before the data is passed to the model. This will adjust the shape of x for each batch to `(batch_size, seq_length, input_dim)`, where input_dim is the vocabulary size.  You do not need to handle this encoding within the CharLSTM class.
- We will use cross-entropy loss as the loss function. The cross-entropy loss combines the softmax operation and the negative log-likelihood loss in a single step. The loss function takes the raw outputs (logits) from the fully connected layer and internally converts them to a probability distribution. Therefore, you do not need to apply Softmax separately.
- Ensure the hidden states are initialized on the same device as the model parameters to avoid device mismatch errors.
- Make sure to configure the LSTM as one `nn.LSTM` module with `num_layers` rather than separate layers. This approach is critical for testing.
- Do not confuse stacked LSTM (configured with `num_layers`) with bidirectional LSTM.
- Carefully check LSTM parameters and make sure that input and output shapes are correct. Pay special attention to the difference between batched and unbatched input shapes.
- Consider how `batch_first` parameter of `nn.LSTM` aligns with the shape of your data after one-hot encoding and think about how the batch dimension should be treated within the LSTM.

In [19]:
import torch
import torch.nn as nn

class CharLSTM(nn.Module):
    """
    Character-Level Multi-Layer LSTM Model

    This model processes sequences of characters and predicts the next character in the sequence.
    """

    def __init__(self, num_layers, input_dim, hidden_dim, output_dim, dropout_prob):
        """
        Initializes the CharLSTM model with the specified parameters.

        Args:
            num_layers (int): Number of LSTM layers
            input_dim (int): Dimensionality of the input (vocab size for one-hot)
            hidden_dim (int): Dimensionality of the LSTM hidden layer.
            output_dim (int): Dimensionality of the output (should equal vocab size)
            dropout_prob (float): Dropout probability
        """
        super(CharLSTM, self).__init__()

        # Save for init_hidden
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        # Single multi-layer LSTM
        # batch_first=True → input shape: (batch_size, seq_length, input_dim)
        # dropout=dropout_prob → applied between internal LSTM layers (not after last)
        self.lstm = nn.LSTM(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout_prob  # internal dropout between stacked layers
        )

        # Separate dropout applied AFTER the final LSTM layer output
        self.dropout = nn.Dropout(p=dropout_prob)

        # Final fully-connected layer: maps hidden state → vocab logits
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x, hidden):
        """
        Forward pass.

        Args:
            x: (batch_size, seq_length, input_dim) — one-hot encoded
            hidden: tuple (h0, c0), each of shape (num_layers, batch_size, hidden_dim)

        Returns:
            out: (batch_size, seq_length, output_dim) — raw logits
            (h_n, c_n): updated hidden and cell states
        """
        # lstm_out: (batch_size, seq_length, hidden_dim)
        # hidden states: h_n, c_n → each (num_layers, batch_size, hidden_dim)
        lstm_out, (h_n, c_n) = self.lstm(x, hidden)

        # Apply dropout to the entire sequence output
        lstm_out = self.dropout(lstm_out)

        # Apply linear layer to every time step
        # Contiguous() ensures correct memory layout after dropout
        out = self.fc(lstm_out.contiguous())

        # out shape: (batch_size, seq_length, output_dim)
        return out, (h_n, c_n)

    def init_hidden(self, batch_size):
        """
        Initializes hidden and cell states to zeros.

        Args:
            batch_size (int): Current batch size

        Returns:
            (h0, c0): tuple of zeros tensors of shape (num_layers, batch_size, hidden_dim)
        """
        device = next(self.parameters()).device

        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_dim, device=device)
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_dim, device=device)

        return (h0, c0)

In [20]:
def test_dropout_effect():
    model = CharLSTM(num_layers=2, input_dim=10, hidden_dim=100, output_dim=40, dropout_prob=0.1)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    hidden = model.init_hidden(64)
    input_seq = torch.rand(64, 50, 10).to(device)

    all_tests_successful = True
    errors = []

    model.train()
    output1, _ = model(input_seq, hidden)
    output2, _ = model(input_seq, hidden)
    try:
        assert not torch.equal(output1, output2), 'Dropout has no effect in training mode.'
    except AssertionError as e:
        errors.append(str(e))
        all_tests_successful = False

    model.eval()
    output3, _ = model(input_seq, hidden)
    output4, _ = model(input_seq, hidden)
    try:
        assert torch.equal(output3, output4), 'Outputs should be consistent in evaluation mode.'
    except AssertionError as e:
        errors.append(str(e))
        all_tests_successful = False

    if all_tests_successful:
        print('Dropout test passed successfully!')
    else:
        feedback_txt.append('Task 4: Visible test: -1 points from the following:')
        feedback_txt.extend(errors)
        raise AssertionError('\n'.join(errors))

test_dropout_effect()

Dropout test passed successfully!


In [21]:
### BEGIN VISIBLE TESTS
def test_lstm_model():
    model = CharLSTM(num_layers=2, input_dim=10, hidden_dim=100, output_dim=40, dropout_prob=0.1)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    hidden = model.init_hidden(64)
    input_seq = torch.rand(64, 50, 10).to(device)
    all_tests_successful = True
    errors = []

    try:
        output, hidden = model(input_seq, hidden)

        try:
            assert output.shape == (64, 50, 40), f"Expected output shape (64, 50, 40), but got {output.shape}."
        except AssertionError as e:
            errors.append(str(e))
            all_tests_successful = False

        (h, c) = hidden
        try:
            assert h.shape == (2, 64, 100), f"Expected h shape (2, 64, 100), but got {h.shape}."
        except AssertionError as e:
            errors.append(str(e))
            all_tests_successful = False

        try:
            assert c.shape == (2, 64, 100), f"Expected c shape (2, 64, 100), but got {c.shape}."
        except AssertionError as e:
            errors.append(str(e))
            all_tests_successful = False

    except RuntimeError as e:
        errors.append(f"RuntimeError: Check if batch_first=True is set properly. Details: {e}")
        all_tests_successful = False

    if all_tests_successful:
        print("LSTM shape test passed successfully!")
    else:
        feedback_txt.append('Task 4: Visible test: -2 points from the following:')
        feedback_txt.extend(errors)
        raise AssertionError('\n'.join(errors))

test_lstm_model()
### END VISIBLE TESTS

LSTM shape test passed successfully!


#### 5. Train the Model (4 points)

In this task, you will implement the training loop for the CharLSTM model.

In the train() function you are given:
- Optimizer and loss function: The [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer and [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) are used. Do not change the optimizer and loss function.
- Preparing the Batches: The `get_batches()` function is used to divide the encoded text data into smaller input (x) and target (y) sequences.

##### Steps to Follow:
Outside of the iteration, initialize the hidden states using the `init_hidden()` method.

For each iteration:
##### 1. Encoding the input sequence:
- For each batch, convert the input to one-hot representations (see [F.one_hot](https://pytorch.org/docs/stable/generated/torch.nn.functional.one_hot.html))

##### 2. Detach Hidden States:
- Detach the hidden states after each batch to avoid backpropagating through previous batches to ensure efficient training.

##### 3. Forward Pass and Loss Calculation:
- For each batch, perform forward pass by passing the input x to the model.
- The model will output logits, which will represent the predicted probabilities for the next character in the sequence when passed to the cross-entropy loss.
- Use cross-entropy loss to compare the predicted output to the target y and calculate the error for the current batch.

##### 4. Backpropagation and Parameter Update:
- After calculating the loss compute the gradients with backward pass.
- Update the model parameters using the optimizer.


**Hints**:
- Do not forget to zero the gradients.
- Ensure that the logits and target tensors are reshaped appropriately to match the expected size for the loss function specified  [here](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html). \
Logits should have shape (batch_size * seq_length, vocab_size) and targets should have shape (batch_size * seq_length).

In [30]:
from unittest.mock import patch, MagicMock
from functools import partialmethod

def train(model, encoded_chars, vocab_size, num_epochs, batch_size,
          seq_length, step_size, learning_rate, save_path=None, verbose=True):
    """
    Train the CharLSTM model on encoded text data.
    """
    model.train()  # Set model to training mode
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss()

    # Prepare batches once (outside the epoch loop)
    x_batches, y_batches = get_batches(encoded_chars, batch_size, seq_length, step_size)
    num_batches = len(x_batches)

    for epoch in range(num_epochs):
        total_loss = 0.0

        # Initialize hidden state at the start of each epoch
        hidden = model.init_hidden(batch_size)

        # Progress bar
        batch_loader = tqdm(zip(x_batches, y_batches), total=num_batches,
                            leave=True, desc=f'Epoch {epoch+1}/{num_epochs}')

        for x_batch, y_batch in batch_loader:
            # Move data to device and convert to long (for one_hot)
            x = torch.tensor(x_batch, dtype=torch.long).to(device)
            y = torch.tensor(y_batch, dtype=torch.long).to(device)

            # One-hot encode the input: (batch_size, seq_length) → (batch_size, seq_length, vocab_size)
            x_one_hot = F.one_hot(x, num_classes=vocab_size).float()  # shape: (B, L, V)

            # Zero gradients
            optimizer.zero_grad()

            # Forward pass: get predictions and new hidden state
            logits, hidden = model(x_one_hot, hidden)

            # Detach hidden states to prevent backprop through entire history
            hidden = (hidden[0].detach(), hidden[1].detach())

            # Reshape for CrossEntropyLoss: (B, L, V) → (B * L, V)
            logits = logits.view(-1, vocab_size)   # (batch_size * seq_length, vocab_size)
            targets = y.view(-1)                   # (batch_size * seq_length)

            # Compute loss
            loss = criterion(logits, targets)

            # Backward pass and optimization step
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

            # Update progress bar
            batch_loader.set_postfix(loss=loss.item())

        avg_loss = total_loss / num_batches
        if verbose:
            print(f'Epoch {epoch + 1}/{num_epochs}, Average Loss: {avg_loss:.4f}')

        # Save model after each epoch
        if save_path:
            torch.save(model.state_dict(), save_path)
            print(f'Model saved to {save_path} after epoch {epoch+1}')

    return avg_loss

In [31]:
def test_model_forward_called():

    vocab_size, hidden_dim, dropout_prob = 50, 12, 0.2
    batch_size, seq_length, num_epochs = 2, 3, 1
    test_chars = np.arange(vocab_size)
    model = CharLSTM(2, vocab_size, hidden_dim, vocab_size, dropout_prob=dropout_prob)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)


    with patch.object(model, 'forward', wraps=model.forward) as mock_forward, \
         patch('torch.optim.Adam'), \
         patch('torch.nn.CrossEntropyLoss'), \
         patch('tqdm.tqdm.__init__', partialmethod(tqdm.__init__, disable=True)):

        try:
            # Run the train function
            train(model, test_chars, vocab_size, num_epochs=num_epochs, batch_size=batch_size, seq_length=seq_length, step_size=seq_length, learning_rate=0.001, verbose=False)
        except Exception as e:
            feedback_txt.append(f"Runtime: Exception occurred during training - {str(e)}")

        # Check if forward() was called
        if not mock_forward.called:
            feedback_txt.append("Task 5: Visible test: [model_forward_called] Training check: Expected model.forward() to be called at least once, but it was not.")
        else:
            print("Test passed: model.forward() was called successfully during training!")


test_model_forward_called()


Test passed: model.forward() was called successfully during training!


In [32]:
def test_input_shape():

    vocab_size, hidden_dim, dropout_prob = 50, 12, 0.2
    batch_size, seq_length, num_epochs = 2, 3, 1
    test_chars = np.arange(vocab_size)
    model = CharLSTM(2, vocab_size, hidden_dim, vocab_size, dropout_prob=0.1)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    errors = []

    def forward_spy(x, hidden):
        # Check input shape correctness
        if x.shape != (batch_size, seq_length, vocab_size):
            errors.append(f"Input shape mismatch: Expected {(batch_size, seq_length, vocab_size)}, but got {x.shape}")
        return model.__class__.forward(model, x, hidden)

    with patch.object(model, 'forward', wraps=forward_spy), \
         patch('torch.optim.Adam'), \
         patch('torch.nn.CrossEntropyLoss'), \
         patch('tqdm.tqdm.__init__', partialmethod(tqdm.__init__, disable=True)):

        try:
            train(model, test_chars, vocab_size, num_epochs=num_epochs, batch_size=batch_size, seq_length=seq_length, step_size=seq_length, learning_rate=0.001, verbose=False)
        except Exception as e:
            errors.append(f"Task 5: Visible test: Runtime: Exception occurred during training - {str(e)}")

    if errors:
        feedback_txt.extend(errors)
        raise AssertionError(" Errors found in input shape check.")

    print("Test passed: Input tensor x has correct shape during training!")

test_input_shape()

Test passed: Input tensor x has correct shape during training!


In [33]:
def test_hidden_state_requires_grad():

    vocab_size, hidden_dim, dropout_prob = 50, 12, 0.2
    batch_size, seq_length, num_epochs = 2, 3, 1
    test_chars = np.arange(vocab_size)
    model = CharLSTM(2, vocab_size, hidden_dim, vocab_size, dropout_prob=dropout_prob)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    errors = []

    def forward_spy(x, hidden):
        h, c = hidden
        if h.requires_grad:
            errors.append("Hidden state h should be detached (requires_grad=False).")
        if c.requires_grad:
            errors.append("Hidden state c should be detached (requires_grad=False).")
        return model.__class__.forward(model, x, hidden)

    with patch.object(model, 'forward', wraps=forward_spy), \
         patch('torch.optim.Adam'), \
         patch('torch.nn.CrossEntropyLoss'), \
         patch('tqdm.tqdm.__init__', partialmethod(tqdm.__init__, disable=True)):

        try:
            train(model, test_chars, vocab_size, num_epochs=num_epochs, batch_size=batch_size, seq_length=seq_length, step_size=seq_length, learning_rate=0.001, verbose=False)
        except Exception as e:
            errors.append(f" Task 5: Visible test: Runtime: Exception occurred during training - {str(e)}")

    if errors:
        feedback_txt.extend(errors)
        raise AssertionError("Hidden state detachment errors detected.")

    print("Test passed: Hidden states are properly detached from computation graph!")

test_hidden_state_requires_grad()

Test passed: Hidden states are properly detached from computation graph!


In [34]:
def test_criterion_argument_shapes():

    vocab_size, hidden_dim, dropout_prob = 50, 12, 0.2
    batch_size, seq_length, num_epochs = 2, 3, 1
    test_chars = np.arange(vocab_size)

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    mock_model = CharLSTM(2, vocab_size, hidden_dim, vocab_size, dropout_prob=0.1).to(device)
    errors = []

    with patch("tqdm.tqdm.__init__", partialmethod(tqdm.__init__, disable=True)), \
         patch("torch.optim.Adam"):

        def criterion_side_effect(output, y):
            if output.shape != (batch_size * seq_length, vocab_size):
                errors.append(f"Criterion argument mismatch: Expected model output shape ({batch_size * seq_length}, {vocab_size}), but got {output.shape}.")
            if y.shape != (batch_size * seq_length,):
                errors.append(f"Criterion argument mismatch: Expected target (y) shape ({batch_size * seq_length},), but got {y.shape}.")
            return torch.tensor(0.0, requires_grad=True, device=device)

        with patch("torch.nn.CrossEntropyLoss", return_value=criterion_side_effect), \
             patch("__main__.get_batches", return_value=(
                 torch.randint(0, 50, (11, 2, 3), device=device),
                 torch.randint(0, 50, (11, 2, 3), device=device)
             )):

            try:
                train(mock_model, test_chars, vocab_size, num_epochs=num_epochs, batch_size=batch_size, seq_length=seq_length, step_size=seq_length, learning_rate=0.001, verbose=False)
            except Exception as e:
                errors.append(f" Task 5: Visible test: Runtime: Exception occurred during training - {str(e)}")

    if errors:
        feedback_txt.extend(errors)
        raise AssertionError("Criterion argument shape errors detected.")

    print("Test passed: Criterion arguments have expected shapes!")

test_criterion_argument_shapes()

Test passed: Criterion arguments have expected shapes!


  x = torch.tensor(x_batch, dtype=torch.long).to(device)
  y = torch.tensor(y_batch, dtype=torch.long).to(device)


#### Model Initialization

We suggest using two LSTM layers (num_layers=2) with a hidden dimension of 400 and dropout of 0.1. In the following cells, you will find the recommended setup for model initialization and training. Feel free to experiment with different parameters. However, before submission, ensure that the parameters you set below match those of your trained model. This is essential for your code to run correctly in inference mode. Aim for a training loss of less than 0.99.

In [35]:
hidden_dim = 400
dropout_prob=0.1
num_layers=2
vocab_size = len(char_to_int)
model = CharLSTM(num_layers, vocab_size, hidden_dim, vocab_size, dropout_prob)
model = model.to(device)
print(model)

CharLSTM(
  (lstm): LSTM(27, 400, num_layers=2, batch_first=True, dropout=0.1)
  (dropout): Dropout(p=0.1, inplace=False)
  (fc): Linear(in_features=400, out_features=27, bias=True)
)


In [36]:
num_epochs = 50 # Train for *at least* 50 epochs to meet the min loss of 0.99
batch_size = 50
seq_length=100
step_size=100
learning_rate=0.001


In [37]:
if not skip_training:
    loss = train(
        model=model,
        encoded_chars=encoded_chars,
        vocab_size=vocab_size,
        num_epochs=num_epochs,
        batch_size=batch_size,
        seq_length=seq_length,
        step_size=step_size,
        learning_rate=learning_rate,
        save_path='best_model.pth'
    )
else:
    model.load_state_dict(torch.load('best_model.pth', weights_only=False, map_location=device))
    print('Loaded weights from your saved model successfully!')

Epoch 1/50: 100%|██████████| 27/27 [00:01<00:00, 22.38it/s, loss=2.81]


Epoch 1/50, Average Loss: 2.8956
Model saved to best_model.pth after epoch 1


Epoch 2/50: 100%|██████████| 27/27 [00:00<00:00, 33.38it/s, loss=2.79]


Epoch 2/50, Average Loss: 2.8011
Model saved to best_model.pth after epoch 2


Epoch 3/50: 100%|██████████| 27/27 [00:00<00:00, 33.33it/s, loss=2.6]


Epoch 3/50, Average Loss: 2.7217
Model saved to best_model.pth after epoch 3


Epoch 4/50: 100%|██████████| 27/27 [00:00<00:00, 33.11it/s, loss=2.31]


Epoch 4/50, Average Loss: 2.4561
Model saved to best_model.pth after epoch 4


Epoch 5/50: 100%|██████████| 27/27 [00:00<00:00, 33.14it/s, loss=2.18]


Epoch 5/50, Average Loss: 2.2760
Model saved to best_model.pth after epoch 5


Epoch 6/50: 100%|██████████| 27/27 [00:00<00:00, 33.27it/s, loss=2.09]


Epoch 6/50, Average Loss: 2.1722
Model saved to best_model.pth after epoch 6


Epoch 7/50: 100%|██████████| 27/27 [00:00<00:00, 32.69it/s, loss=2]


Epoch 7/50, Average Loss: 2.0857
Model saved to best_model.pth after epoch 7


Epoch 8/50: 100%|██████████| 27/27 [00:00<00:00, 32.83it/s, loss=1.94]


Epoch 8/50, Average Loss: 2.0112
Model saved to best_model.pth after epoch 8


Epoch 9/50: 100%|██████████| 27/27 [00:00<00:00, 32.64it/s, loss=1.89]


Epoch 9/50, Average Loss: 1.9445
Model saved to best_model.pth after epoch 9


Epoch 10/50: 100%|██████████| 27/27 [00:00<00:00, 32.48it/s, loss=1.83]


Epoch 10/50, Average Loss: 1.8832
Model saved to best_model.pth after epoch 10


Epoch 11/50: 100%|██████████| 27/27 [00:00<00:00, 32.68it/s, loss=1.78]


Epoch 11/50, Average Loss: 1.8253
Model saved to best_model.pth after epoch 11


Epoch 12/50: 100%|██████████| 27/27 [00:00<00:00, 32.12it/s, loss=1.73]


Epoch 12/50, Average Loss: 1.7718
Model saved to best_model.pth after epoch 12


Epoch 13/50: 100%|██████████| 27/27 [00:00<00:00, 31.89it/s, loss=1.68]


Epoch 13/50, Average Loss: 1.7201
Model saved to best_model.pth after epoch 13


Epoch 14/50: 100%|██████████| 27/27 [00:00<00:00, 32.15it/s, loss=1.63]


Epoch 14/50, Average Loss: 1.6714
Model saved to best_model.pth after epoch 14


Epoch 15/50: 100%|██████████| 27/27 [00:00<00:00, 31.89it/s, loss=1.61]


Epoch 15/50, Average Loss: 1.6289
Model saved to best_model.pth after epoch 15


Epoch 16/50: 100%|██████████| 27/27 [00:00<00:00, 31.44it/s, loss=1.55]


Epoch 16/50, Average Loss: 1.5860
Model saved to best_model.pth after epoch 16


Epoch 17/50: 100%|██████████| 27/27 [00:00<00:00, 31.78it/s, loss=1.53]


Epoch 17/50, Average Loss: 1.5452
Model saved to best_model.pth after epoch 17


Epoch 18/50: 100%|██████████| 27/27 [00:00<00:00, 31.99it/s, loss=1.49]


Epoch 18/50, Average Loss: 1.5088
Model saved to best_model.pth after epoch 18


Epoch 19/50: 100%|██████████| 27/27 [00:00<00:00, 31.28it/s, loss=1.46]


Epoch 19/50, Average Loss: 1.4765
Model saved to best_model.pth after epoch 19


Epoch 20/50: 100%|██████████| 27/27 [00:00<00:00, 31.19it/s, loss=1.44]


Epoch 20/50, Average Loss: 1.4453
Model saved to best_model.pth after epoch 20


Epoch 21/50: 100%|██████████| 27/27 [00:00<00:00, 31.21it/s, loss=1.41]


Epoch 21/50, Average Loss: 1.4125
Model saved to best_model.pth after epoch 21


Epoch 22/50: 100%|██████████| 27/27 [00:00<00:00, 30.96it/s, loss=1.39]


Epoch 22/50, Average Loss: 1.3845
Model saved to best_model.pth after epoch 22


Epoch 23/50: 100%|██████████| 27/27 [00:00<00:00, 31.40it/s, loss=1.37]


Epoch 23/50, Average Loss: 1.3582
Model saved to best_model.pth after epoch 23


Epoch 24/50: 100%|██████████| 27/27 [00:00<00:00, 30.68it/s, loss=1.34]


Epoch 24/50, Average Loss: 1.3350
Model saved to best_model.pth after epoch 24


Epoch 25/50: 100%|██████████| 27/27 [00:00<00:00, 30.75it/s, loss=1.32]


Epoch 25/50, Average Loss: 1.3099
Model saved to best_model.pth after epoch 25


Epoch 26/50: 100%|██████████| 27/27 [00:00<00:00, 30.28it/s, loss=1.3]


Epoch 26/50, Average Loss: 1.2867
Model saved to best_model.pth after epoch 26


Epoch 27/50: 100%|██████████| 27/27 [00:00<00:00, 30.80it/s, loss=1.28]


Epoch 27/50, Average Loss: 1.2652
Model saved to best_model.pth after epoch 27


Epoch 28/50: 100%|██████████| 27/27 [00:00<00:00, 30.52it/s, loss=1.26]


Epoch 28/50, Average Loss: 1.2426
Model saved to best_model.pth after epoch 28


Epoch 29/50: 100%|██████████| 27/27 [00:00<00:00, 29.90it/s, loss=1.24]


Epoch 29/50, Average Loss: 1.2185
Model saved to best_model.pth after epoch 29


Epoch 30/50: 100%|██████████| 27/27 [00:00<00:00, 30.49it/s, loss=1.22]


Epoch 30/50, Average Loss: 1.1998
Model saved to best_model.pth after epoch 30


Epoch 31/50: 100%|██████████| 27/27 [00:00<00:00, 30.22it/s, loss=1.19]


Epoch 31/50, Average Loss: 1.1785
Model saved to best_model.pth after epoch 31


Epoch 32/50: 100%|██████████| 27/27 [00:00<00:00, 30.53it/s, loss=1.17]


Epoch 32/50, Average Loss: 1.1569
Model saved to best_model.pth after epoch 32


Epoch 33/50: 100%|██████████| 27/27 [00:00<00:00, 30.18it/s, loss=1.15]


Epoch 33/50, Average Loss: 1.1368
Model saved to best_model.pth after epoch 33


Epoch 34/50: 100%|██████████| 27/27 [00:00<00:00, 30.21it/s, loss=1.12]


Epoch 34/50, Average Loss: 1.1151
Model saved to best_model.pth after epoch 34


Epoch 35/50: 100%|██████████| 27/27 [00:00<00:00, 30.29it/s, loss=1.1]


Epoch 35/50, Average Loss: 1.0930
Model saved to best_model.pth after epoch 35


Epoch 36/50: 100%|██████████| 27/27 [00:00<00:00, 30.98it/s, loss=1.08]


Epoch 36/50, Average Loss: 1.0692
Model saved to best_model.pth after epoch 36


Epoch 37/50: 100%|██████████| 27/27 [00:00<00:00, 30.78it/s, loss=1.07]


Epoch 37/50, Average Loss: 1.0496
Model saved to best_model.pth after epoch 37


Epoch 38/50: 100%|██████████| 27/27 [00:00<00:00, 30.24it/s, loss=1.06]


Epoch 38/50, Average Loss: 1.0279
Model saved to best_model.pth after epoch 38


Epoch 39/50: 100%|██████████| 27/27 [00:00<00:00, 30.73it/s, loss=1.02]


Epoch 39/50, Average Loss: 1.0071
Model saved to best_model.pth after epoch 39


Epoch 40/50: 100%|██████████| 27/27 [00:00<00:00, 30.40it/s, loss=1.01]


Epoch 40/50, Average Loss: 0.9888
Model saved to best_model.pth after epoch 40


Epoch 41/50: 100%|██████████| 27/27 [00:00<00:00, 31.05it/s, loss=0.987]


Epoch 41/50, Average Loss: 0.9680
Model saved to best_model.pth after epoch 41


Epoch 42/50: 100%|██████████| 27/27 [00:00<00:00, 30.93it/s, loss=0.977]


Epoch 42/50, Average Loss: 0.9495
Model saved to best_model.pth after epoch 42


Epoch 43/50: 100%|██████████| 27/27 [00:00<00:00, 30.93it/s, loss=0.959]


Epoch 43/50, Average Loss: 0.9331
Model saved to best_model.pth after epoch 43


Epoch 44/50: 100%|██████████| 27/27 [00:00<00:00, 31.69it/s, loss=0.941]


Epoch 44/50, Average Loss: 0.9135
Model saved to best_model.pth after epoch 44


Epoch 45/50: 100%|██████████| 27/27 [00:00<00:00, 31.02it/s, loss=0.902]


Epoch 45/50, Average Loss: 0.8925
Model saved to best_model.pth after epoch 45


Epoch 46/50: 100%|██████████| 27/27 [00:00<00:00, 31.91it/s, loss=0.888]


Epoch 46/50, Average Loss: 0.8700
Model saved to best_model.pth after epoch 46


Epoch 47/50: 100%|██████████| 27/27 [00:00<00:00, 31.71it/s, loss=0.865]


Epoch 47/50, Average Loss: 0.8512
Model saved to best_model.pth after epoch 47


Epoch 48/50: 100%|██████████| 27/27 [00:00<00:00, 31.56it/s, loss=0.833]


Epoch 48/50, Average Loss: 0.8274
Model saved to best_model.pth after epoch 48


Epoch 49/50: 100%|██████████| 27/27 [00:00<00:00, 31.96it/s, loss=0.814]


Epoch 49/50, Average Loss: 0.8023
Model saved to best_model.pth after epoch 49


Epoch 50/50: 100%|██████████| 27/27 [00:00<00:00, 32.34it/s, loss=0.773]


Epoch 50/50, Average Loss: 0.7786
Model saved to best_model.pth after epoch 50


In [None]:
# Do not delete this cell

### Step 6: Text Generation (2 points)
The `text_generation()` function will use a trained model to create new text based on a given starting string.

In this process, the model becomes autoregressive by using its own predictions as inputs for the next steps. Starting with an initial string, the model produces characters one by one, feeding each newly generated character back into itself as input.


1. Input start string:
The function begins with a starting string, which is converted into a sequence of integers using the character-to-integer mapping.
2. Generate text:
The model will take this sequence as input, predict the next character, and add it to the text. This process is repeated for a specified number of characters `predict_len`.
3. Output probabilities

There are a few steps in the function for you to implement.

##### Steps to follow:
1. One-Hot encoding: Apply one-hot encoding to the input sequence as you did in the training loop.

2. Forward pass: Feed the one-hot encoded input through the model to obtain the output logits and the updated hidden state.

3. Extract the last output: You only need the output from the **last time step** to predict the next character. Slice the output and get the last element along the sequence dimension. If you did the training well, the generated text should mostly include meaningful words.

4. Temperature scaling: To control the randomness in prediction, divide the output logits by the temperature parameter. The temperature should be in range (0,1]. Higher temperatures produce more random text, while lower temperatures produce more predictable results. You can observe the variations in the generated text by experimenting with different temperature values.

You can try generating text after training for just one epoch to observe the model's initial behavior. Depending on your temperature setting, the generated text might be a repetition of a single character or random sequences.

In [38]:
def generate_text(model, start_str, char_to_int, int_to_char, vocab_size, predict_len=100, temperature=1.0):
    """
    Generate text using the trained model.
    """
    model.eval()  # Set model to evaluation mode

    # Encode the starting string
    input_seq = [char_to_int[char] for char in start_str]
    input_seq = torch.tensor(input_seq, dtype=torch.long).to(device).unsqueeze(0)  # (1, len)

    # Initialize hidden state for batch_size=1
    hidden = model.init_hidden(1)

    generated_text = start_str

    with torch.no_grad():
        # First, process the entire start_str to "warm up" the hidden state
        # (Important: we need correct hidden state before generating new chars)
        x_one_hot = F.one_hot(input_seq, num_classes=vocab_size).float()  # (1, L, V)
        _, hidden = model(x_one_hot, hidden)

        # Now use only the last character of start_str as current input
        current_input = input_seq[:, -1:].clone()  # (1, 1)

        for _ in range(predict_len):
            # One-hot encode current input character
            x_one_hot = F.one_hot(current_input, num_classes=vocab_size).float()  # (1, 1, V)

            # Forward pass: get logits for next character
            logits, hidden = model(x_one_hot, hidden)

            # Extract logits from the last (only) time step
            logits = logits[:, -1, :]  # (1, vocab_size)

            # Apply temperature scaling
            logits = logits / temperature

            # Convert to probabilities
            probabilities = F.softmax(logits, dim=-1)  # (1, vocab_size)
            probabilities = probabilities.squeeze(0).cpu().numpy()  # (vocab_size,)

            # Sample next character index
            next_char_index = np.random.choice(vocab_size, p=probabilities)

            # Decode and append
            next_char = int_to_char[next_char_index]
            generated_text += next_char

            # Prepare next input: feed the predicted character back
            current_input = torch.tensor([[next_char_index]], dtype=torch.long).to(device)

    return generated_text

In [39]:
start_str = 'we re all '
predict_len = 1000
temperature = 0.5
generated_text = generate_text(model,
                               start_str,
                               char_to_int,
                               int_to_char,
                               vocab_size,
                               predict_len=predict_len,
                               temperature=temperature)
print(generated_text)

we re all and the march hare said to herself alice replied said the king the mock turtle s remark the three gardeners only and round the court with a sme thought and i m sure i mand and she had never eat of this side of the beginn with a cat latten the duchess and the same thing and the knave was thinking over the duchess s got to the queen was some of the e e evening beautiful beautiful soup beautiful beautiful soup chapter with a sigh she had poor little thing bottle stapt she did not like that word but she was too much for its meant with the time the next witness was the march hare things was the fan and gloves the flower wood the queen she went on said alice why the arches close behand it said alice i don t know what the rest of the hatter went on in a mouse that s the mock turtle s beginning as all the window as she could the gryphon the words down with said the gryphon it s all the lobsters and the mock turtle s confusion which she said the gryphon what alice did not like the dor

In [40]:
# Do not delete this cell

In [41]:
# Do not delete this cell


In [42]:
# Do not delete this cell

In [43]:
# Do not delete this cell

**Closing remarks**:

1. Consider experimenting with more complex architectures by adding additional LSTM layers or increasing the hidden dimension size. Keep in mind that even with GPU resources this can take a while.

2. In this task, we used one-hot encoding to represent inputs. However, you can experiment with the `nn.Embedding` module in PyTorch, which creates better representations for input characters and may improve model performance.

3. For more complex models, you do not need to remove special characters, like punctuation and new lines, during preprocessing. Keeping these characters is helpful especially if you want to generate text in different styles, such as Shakespearean sonnets, where line breaks and punctuation are important for preserving the text style.
