## Task 1
In this notebook, we will build a dense neural network to perform the task of adding two numbers. While in professor's notebook, recurrent neural networks (RNNs) were used, here we will try to achieve the same result using a dense (fully connected) neural network. The goal is to train this dense network using a dataset of randomly generated addition problems and then evaluate its performance on unseen examples.

1. **Generate a dataset of random two-number addition problems (up to 3 digits).**
2. **`Preprocess the data`: Encode both the input (addition problem) and the output (result) into a format the neural network can understand.**
3. **Define a `dense neural network` using `TensorFlow/Keras`.**
4. **`Train` the model on the `generated dataset`.**
5. **`Evaluate` the model's performance on `validation data`.**
6. **`Test` the model with new, `unseen numbers` to check its ability to perform addition.**

#### Import necessary libraries

In [1]:
import numpy as np
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Reshape, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2

### Step 1: Data Generation

First, we generate a dataset of random addition problems. Each problem consists of two numbers (with up to 3 digits) and the result of their addition. We will generate 50,000 such problems.

In [2]:
# Helper class to encode/decode the data
class CharacterTable(object):
    """Given a set of characters:
    + Encode them to a one-hot integer representation
    + Decode the one-hot or integer representation to their character output
    + Decode a vector of probabilities to their character output
    """
    def __init__(self, chars):
        """Initialize character table.

        # Arguments
            chars: Characters that can appear in the input.
        """
        self.chars = sorted(set(chars))
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))

    def encode(self, C, num_rows):
        """One-hot encode given string C.

        # Arguments
            C: string, to be encoded.
            num_rows: Number of rows in the returned one-hot encoding. This is
                used to keep the # of rows for each data the same.
        """
        x = np.zeros((num_rows, len(self.chars)))
        for i, c in enumerate(C):
            x[i, self.char_indices[c]] = 1
        return x

    def decode(self, x, calc_argmax=True):
        """Decode the given vector or 2D array to their character output.

        # Arguments
            x: A vector or a 2D array of probabilities or one-hot representations;
                or a vector of character indices (used with `calc_argmax=False`).
            calc_argmax: Whether to find the character index with maximum
                probability, defaults to `True`.
        """
        if calc_argmax:
            # This will convert each one-hot encoded vector into an index
            x = np.argmax(x, axis=-1)  
        # Ensure that we iterate over each index and map it to the corresponding character
        return ''.join(self.indices_char[int(i)] for i in x)

The `CharacterTable` class provides essential functionality to handle character-level encoding and decoding, crucial for converting addition problems and their results into a format usable by the neural network.

- `Character Mapping`:
  - The class creates mappings between characters (digits, '+', and space for padding) and their respective indices using two dictionaries:
    - char_indices: Maps each character (e.g., '1', '2', '+') to a unique index (e.g., '1' → 0, '2' → 1).
    - indices_char: Maps each index back to its corresponding character (e.g., 0 → '1', 1 → '2').
- `encode method`:
    - Takes a string (e.g., "12+34") and converts it into a one-hot encoded matrix. Each row of the matrix corresponds to a character, and the columns represent whether that character is present using a binary vector (1 or 0).
    - The num_rows parameter defines the number of rows in the one-hot encoding, ensuring that all encoded strings have the same number of rows.
- `decode method`:
    - Converts the one-hot encoded vectors or probability distributions back into their corresponding characters.
    - If calc_argmax=True, it finds the character with the maximum probability by applying np.argmax to get the most likely index for each position, then maps those indices back to their corresponding characters.
    - If calc_argmax=False, it directly maps indices without applying the argmax function, useful when working with already-decoded indices.

In [3]:
# Parameters for the model and dataset.
TRAINING_SIZE = 50000
DIGITS = 3
REVERSE = False

# Maximum length of input is 'int + int' (e.g., '34+78  '). Maximum length of
# int is DIGITS.
MAXLEN = DIGITS + 1 + DIGITS

# All the numbers, plus sign and space for padding.
chars = '0123456789+ '
ctable = CharacterTable(chars)

questions = []
expected = []
seen = set()
print('Generating data...')
while len(questions) < TRAINING_SIZE:
    f = lambda: int(''.join(np.random.choice(list('0123456789'))
                    for i in range(np.random.randint(1, DIGITS + 1))))
    a, b = f(), f()
    
    # Skip any addition questions we've already seen
    # Also skip any such that x+Y == Y+x (hence the sorting).
    key = tuple(sorted((a, b)))
    if key in seen:
        continue
    seen.add(key)
    
    # Pad the data with spaces such that it is always MAXLEN.
    q = '{}+{}'.format(a, b)
    query = q + ' ' * (MAXLEN - len(q))
    ans = str(a + b)
    
    # Answers can be of maximum size DIGITS + 1.
    ans += ' ' * (DIGITS + 1 - len(ans))
    if REVERSE:
        # Reverse the query, e.g., '12+345  ' becomes '  543+21'. (Note the
        # space used for padding.)
        query = query[::-1]
    
    # store
    questions.append(query)
    expected.append(ans)
    print(query, ans)
    
print('Total addition questions:', len(questions))

Generating data...
189+6   195 
837+1   838 
60+98   158 
493+711 1204
76+6    82  
539+2   541 
882+759 1641
5+4     9   
29+9    38  
6+738   744 
5+5     10  
460+75  535 
808+5   813 
0+9     9   
4+6     10  
9+9     18  
4+84    88  
984+854 1838
0+1     1   
7+52    59  
6+0     6   
63+8    71  
469+286 755 
62+60   122 
653+5   658 
31+41   72  
539+1   540 
62+188  250 
1+3     4   
3+97    100 
710+3   713 
45+451  496 
458+23  481 
45+2    47  
584+7   591 
913+5   918 
734+791 1525
4+4     8   
30+7    37  
67+9    76  
8+657   665 
329+95  424 
1+6     7   
7+5     12  
29+32   61  
77+9    86  
1+1     2   
5+2     7   
888+5   893 
370+14  384 
4+81    85  
96+48   144 
88+839  927 
14+1    15  
6+195   201 
7+635   642 
42+7    49  
495+99  594 
1+9     10  
202+805 1007
574+95  669 
939+2   941 
819+387 1206
8+978   986 
644+242 886 
796+9   805 
38+1    39  
4+3     7   
2+23    25  
813+19  832 
3+13    16  
436+392 828 
652+433 1085
24+1    25  
22+8    30  
68+60 

- `Random Number Generation`:
    - We randomly generate two numbers, a and b, each with up to 3 digits.
    - To ensure variety in the dataset, the function generates numbers with a random number of digits between 1 and DIGITS.
- `Avoiding Duplicates`:
    - We maintain a set (seen) to track previously generated problems. This ensures that each problem in the dataset is unique.
    - We exclude commutative problems such as "3 + 4" and "4 + 3" by sorting the numbers before adding them to the seen set.
- `Padding`:
    - Padding is applied to both the input (problem) and output (result) to ensure that they have consistent lengths. This helps facilitate easy batching during training.
    - For example, "12+34" becomes "12+34 ", and the result "46" becomes "046".
- `Reversing Input`:
    - If REVERSE is set to True, the input query will be reversed (e.g., "12+34" becomes "43+21"). This can sometimes help the model learn better by providing diverse patterns, though it is not used here.

In [4]:
questions[0]

'189+6  '

In [5]:
expected[0]

'195 '

### Step 2: Data Preprocessing

Now that we have the addition problems and their results, we need to encode these into one-hot vectors. Additionally, since we are using a dense network (not an RNN), we will flatten the input and output data.

In [6]:
# Vectorize the data into one-hot encoded format
x = np.zeros((len(questions), MAXLEN, len(chars)), dtype=bool)
y = np.zeros((len(questions), DIGITS + 1, len(chars)), dtype=bool)

- x has the shape (len(questions), MAXLEN, len(chars)):
  - len(questions) is the total number of addition problems (50,000 in this case).
  - MAXLEN is the length of the input string (e.g., 34+78 with padding, which is 7 characters).
  - len(chars) is the number of unique characters in our character set (0123456789+ ).
- y has the shape (len(questions), DIGITS + 1, len(chars)):
  - DIGITS + 1 represents the length of the result (since the sum of two 3-digit numbers can be 4 digits long).

In [7]:
for i, sentence in enumerate(questions):
    x[i] = ctable.encode(sentence, MAXLEN)

for i, sentence in enumerate(expected):
    y[i] = ctable.encode(sentence, DIGITS + 1)

- We iterate over each sentence (addition problem) in the questions list and encode it into a one-hot representation using the encode method of the CharacterTable class.
- Similarly, we iterate over each sentence (the expected result of the addition) in the expected list and encode it into a one-hot representation.

In [8]:
# Flatten the input and output for use in a dense network
x = x.reshape((x.shape[0], -1))  # Flatten for dense input
y = y.reshape((y.shape[0], -1))  # Flatten the output

**Since a dense neural network expects a 2D input, we flatten both the input (x) and output (y) arrays into 2D representations.**

In [9]:
# Shuffle the data
indices = np.arange(len(y))
np.random.shuffle(indices)
x = x[indices]
y = y[indices]

In [10]:
# Split data into train (80%), validation (15%), and test (5%)
split_at_1 = int(0.8 * len(x))  # 80% for training
split_at_2 = int(0.95 * len(x))  # 15% for validation, 5% for testing

x_train, x_val, x_test = x[:split_at_1], x[split_at_1:split_at_2], x[split_at_2:]
y_train, y_val, y_test = y[:split_at_1], y[split_at_1:split_at_2], y[split_at_2:]

In [11]:
print('Training Data:')
print(x_train.shape)
print(y_train.shape)
print()

print('Validation Data:')
print(x_val.shape)
print(y_val.shape)
print()

print('Test Data:')
print(x_test.shape)
print(y_test.shape)
print()

print('Example:')
print('The first row of input data is encoded internally as:')
print(x_train[0])
print()
print('The first row of output data is encoded internally as:')
print(y_train[0])
print()


# Reshape back to 2D form (MAXLEN, len(chars)) before decoding
reshaped_x_train = x_train[0].reshape(MAXLEN, len(chars))
reshaped_y_train = y_train[0].reshape(DIGITS + 1, len(chars))

# Decoding the reshaped one-hot encoded input and output
print('These internal representations represent these signals:')
print(ctable.decode(reshaped_x_train))
print(ctable.decode(reshaped_y_train))

Training Data:
(40000, 84)
(40000, 48)

Validation Data:
(7500, 84)
(7500, 48)

Test Data:
(2500, 84)
(2500, 48)

Example:
The first row of input data is encoded internally as:
[False False False False False False False False False False False  True
 False False  True False False False False False False False False False
 False  True False False False False False False False False False False
 False False False  True False False False False False False False False
 False False False False False False False False False  True False False
 False False False False False  True False False False False False False
  True False False False False False False False False False False False]

The first row of output data is encoded internally as:
[False False False False  True False False False False False False False
 False False False False False False False False  True False False False
 False False False False False  True False False False False False False
  True False False False False False

### Step 3: Building the Dense Neural Network

We will now define the architecture of our dense neural network. This will consist of a few fully connected (dense) layers to learn the mapping from addition problems to their results.

In [12]:
# Model parameters
HIDDEN_SIZE = 256  # Number of neurons in the hidden layers
BATCH_SIZE = 128   # Batch size for training
LAYERS = 2         # Number of dense layers

# Number of output digits (this depends on the format of the output)
num_digits = DIGITS + 1  # Adjust based on your output format

# Build the Dense Neural Network model
model = Sequential()

# Input layer, first hidden layer, and Batch Normalization
model.add(Dense(HIDDEN_SIZE, input_shape=(x_train.shape[1],), kernel_regularizer=l2(0.01)))
model.add(BatchNormalization())  # Add Batch Normalization here
model.add(Dense(HIDDEN_SIZE, activation='relu'))  # Activation after normalization

# Additional hidden layers with Batch Normalization
for _ in range(LAYERS - 1):
    model.add(Dense(HIDDEN_SIZE, kernel_regularizer=l2(0.01)))  # Hidden layer with regularization
    model.add(BatchNormalization())  # Batch Normalization after hidden layer
    model.add(Dense(HIDDEN_SIZE, activation='relu'))  # Activation after normalization

# Output layer: predict each digit of the result independently
model.add(Dense(num_digits * len(chars), activation='softmax'))
model.add(Reshape((num_digits, len(chars))))  # Reshape to have one softmax per digit

# Compile the model
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(learning_rate=0.001, clipnorm=1.0),
              metrics=['accuracy'])

# Display the model summary
model.summary()



Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 256)               21760     
                                                                 
 batch_normalization (Batch  (None, 256)               1024      
 Normalization)                                                  
                                                                 
 dense_1 (Dense)             (None, 256)               65792     
                                                                 
 dense_2 (Dense)             (None, 256)               65792     
                                                                 
 batch_normalization_1 (Bat  (None, 256)               1024      
 chNormalization)                                                
                                                                 
 dense_3 (Dense)             (None, 256)               6

- `HIDDEN_SIZE`: Refers to the number of neurons in each hidden layer. Each hidden layer will contain 256 units that process the input data.
- `BATCH_SIZE`: A batch size of 128 means that the network will process 128 addition problems before updating the weights.
- `LAYERS`: We start with 2 layers to avoid over-complicating the network.
- `Dense() Layers`: The first dense layer accepts an input with shape (MAXLEN * len(chars)), which is the flattened one-hot encoded vector representing the input sequence.
   - `Kernel Regularization (L2)`: Prevents overfitting by penalizing large weight values, encouraging the network to learn smoother patterns.
   - `Batch Normalization`: Applied after each hidden layer to normalize the activations, leading to faster convergence and more stable training.
   - `Activation (ReLU)`: ReLU is used for non-linearity, allowing the model to learn complex functions effectively.
   - `Output Layer & Softmax`: The output layer predicts each digit of the result independently. Softmax ensures each prediction is a probability distribution over possible characters (0-9, +, space).
   - `Optimizer (Adam)`: Adam with a learning rate of 0.001 is used for efficient and adaptive gradient-based optimization, with clipnorm=1.0 to avoid large gradient updates.

### Step 4: Training the Model

Let’s now train the model for 35 epochs and evaluate it on the validation set. We will shuffle the data and monitor the validation accuracy at each step.

In [13]:
class colors:
    ok = '\033[92m'
    fail = '\033[91m'
    close = '\033[0m'

y_train = y_train.reshape((-1, DIGITS + 1, len(chars)))  # Reshape y_train to 3D
y_val = y_val.reshape((-1, DIGITS + 1, len(chars)))      # Reshape y_val to 3D

# Train the model for 35 epochs or stop early if loss < 0.1 or if loss is NaN
previous_model_weights = None  # To store the previous best model weights

for iteration in range(1, 36):
    print()
    print('-' * 50)
    print('Iteration', iteration)

    # Train the model on the training data for one epoch
    history = model.fit(x_train, y_train,
                        batch_size=BATCH_SIZE,
                        epochs=1,
                        validation_data=(x_val, y_val))

    # Extract training loss from history
    loss = history.history['loss'][0]
    print(f'Training loss: {loss}')

    # Stop training early if loss is below 0.1
    if loss < 0.1:
        print(f"Stopping early as loss is below 0.1 in iteration {iteration}")
        break

    # Check if the loss is NaN
    if np.isnan(loss):
        print(f"Loss became NaN at iteration {iteration}. Restoring previous weights and stopping training.")
        # Restore the previous weights if NaN occurred
        model.set_weights(previous_model_weights)
        break

    # Save the model's weights after each successful epoch (before potential NaN)
    previous_model_weights = model.get_weights()

    # Select 10 samples from the validation set to visualize predictions and errors
    for i in range(10):
        ind = np.random.randint(0, len(x_val))
        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]

        # Predict the result of the validation sample
        preds = np.argmax(model.predict(rowx), axis=-1)

        # Decode the input and correct output
        q = ctable.decode(rowx[0].reshape(MAXLEN, len(chars)))  # Reshape back to 2D form for decoding
        correct = ctable.decode(rowy[0].reshape(DIGITS + 1, len(chars)))  # Reshape for decoding

        # Decode the predicted output
        try:
            guess = ctable.decode(preds[0], calc_argmax=False)
        except KeyError:
            guess = "Invalid Prediction"

        # Display the input, correct answer, and prediction
        print('Q', q[::-1] if REVERSE else q, end=' ')  # The input query
        print('T', correct, end=' ')  # The correct output
        if correct == guess:
            print(colors.ok + '☑' + colors.close, end=' ')  # Correct prediction
        else:
            print(colors.fail + '☒' + colors.close, end=' ')  # Incorrect prediction
        print(guess)


--------------------------------------------------
Iteration 1
Training loss: 2.328892946243286
Q 4+4     T 8    [91m☒[0m 790 
Q 865+5   T 870  [91m☒[0m 877 
Q 8+894   T 902  [91m☒[0m 800 
Q 85+123  T 208  [91m☒[0m 787 
Q 47+852  T 899  [91m☒[0m 880 
Q 0+15    T 15   [91m☒[0m 790 
Q 714+823 T 1537 [91m☒[0m 103 
Q 8+0     T 8    [91m☒[0m 890 
Q 640+12  T 652  [91m☒[0m 767 
Q 110+69  T 179  [91m☒[0m 780 

--------------------------------------------------
Iteration 2
Training loss: 1.0612726211547852
Q 976+96  T 1072 [91m☒[0m 1032
Q 91+940  T 1031 [91m☒[0m 1001
Q 77+602  T 679  [92m☑[0m 679 
Q 11+745  T 756  [91m☒[0m 765 
Q 481+431 T 912  [91m☒[0m 821 
Q 112+26  T 138  [91m☒[0m 148 
Q 535+206 T 741  [91m☒[0m 875 
Q 48+650  T 698  [91m☒[0m 798 
Q 32+571  T 603  [91m☒[0m 694 
Q 386+29  T 415  [91m☒[0m 525 

--------------------------------------------------
Iteration 3
Training loss: 0.743332028388977
Q 47+462  T 509  [91m☒[0m 519 
Q 32+107  T 139

In this block, we train the model for 35 epochs or stop early if the training loss drops below 0.1. The model is trained on the reshaped 3D data, which represents each digit in the output separately. After each epoch, we print the training loss and monitor performance on the validation set.

- `Reshaping to 3D`: Before training, we reshape y_train and y_val back to 3D (with shape (-1, DIGITS + 1, len(chars))). This is because each digit in the result is predicted separately, so the network must output multiple classes per example (one class per digit). The 3D structure allows the model to manage and evaluate these multiple class predictions effectively.
- `Early Stopping`: If the training loss goes below 0.1, the loop will break early, saving computation if the model has already achieved a good result.
- `Handling NaN Loss`: If the loss becomes NaN at any point (which can happen due to unstable training), the model will stop training, and the best weights from the previous epoch will be restored to prevent further degradation of the model’s performance.
- `Saving Model Weights`: After every successful epoch, the model's current weights are saved. If any issue (like NaN loss) occurs, the model can revert to the last valid weights, preventing issues from propagating
- `Prediction and Visualization`: After each epoch, we randomly select 10 samples from the validation set to visualize the model’s predictions and compare them to the correct answers. The inputs and predictions are decoded from one-hot encoded vectors back to human-readable format.

### Step 5: Model Evaluation

After training, we will evaluate the model’s accuracy on the validation set to see how well it learned to add two numbers.

In [14]:
# evaluate the keras model
_, accuracy = model.evaluate(x_val, y_val)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 97.12


This high accuracy shows that the model has learned to generalize well from the training data

### Step 6: Testing with Out-of-Bag Numbers

#### `Evaluate on the Test Set`
Once training is complete, you can evaluate the model on the test set (x_test, y_test) and visualize the input/output using the colors (☑ or ☒) just like before:

In [15]:
y_test = y_test.reshape((-1, DIGITS + 1, len(chars)))  # Reshape y_test to 3D

# Evaluate the model on the test set
_, test_accuracy = model.evaluate(x_test, y_test)
print('Test Accuracy: {:.2f}%'.format(test_accuracy * 100))

# Select samples from the test set and visualize predictions
for i in range(10):  # Show 10 test examples
    ind = np.random.randint(0, len(x_test))
    rowx, rowy = x_test[np.array([ind])], y_test[np.array([ind])]

    # Predict the result on the test sample
    preds = np.argmax(model.predict(rowx), axis=-1)

    # Decode the input, correct output, and predicted output
    q = ctable.decode(rowx[0].reshape(MAXLEN, len(chars)))
    correct = ctable.decode(rowy[0].reshape(DIGITS + 1, len(chars)))
    
    try:
        guess = ctable.decode(preds[0], calc_argmax=False)
    except KeyError:
        guess = "Invalid Prediction"
    
    # Print the input, correct answer, and model's prediction with colors
    print('Q', q[::-1] if REVERSE else q, end=' ')  # Input question
    print('T', correct, end=' ')  # Correct answer
    if correct == guess:
        print(colors.ok + '☑' + colors.close, end=' ')  # Correct prediction
    else:
        print(colors.fail + '☒' + colors.close, end=' ')  # Incorrect prediction
    print(guess)

Test Accuracy: 97.36%
Q 98+713  T 811  [92m☑[0m 811 
Q 273+604 T 877  [92m☑[0m 877 
Q 848+65  T 913  [92m☑[0m 913 
Q 71+934  T 1005 [92m☑[0m 1005
Q 308+737 T 1045 [92m☑[0m 1045
Q 645+805 T 1450 [92m☑[0m 1450
Q 268+31  T 299  [92m☑[0m 299 
Q 22+327  T 349  [91m☒[0m 359 
Q 377+574 T 951  [92m☑[0m 951 
Q 7+678   T 685  [92m☑[0m 685 


In [16]:
# Function to encode a new addition problem and test it
def test_addition(num1, num2):
    # Encode the input addition problem
    q = '{}+{}'.format(num1, num2)
    q_padded = q + ' ' * (MAXLEN - len(q))  # Padding
    encoded_input = ctable.encode(q_padded, MAXLEN).reshape(1, -1)
    
    # Predict the sum using the trained model
    pred = model.predict(encoded_input)
    
    # Print the shape of the prediction for debugging
    print(f'Prediction shape: {pred.shape}')  # Check the shape of pred
    
    # Ensure we are reshaping the prediction correctly
    pred_reshaped = np.argmax(pred, axis=-1).reshape(1, DIGITS + 1)
    decoded_output = ctable.decode(pred_reshaped[0], calc_argmax=False)
    
    # Print results
    print(f'Adding {num1} + {num2}')
    print(f'Prediction: {decoded_output.strip()}')
    print(f'Correct Answer: {num1 + num2}')

# Test with a new random example
test_addition(123, 789)

Prediction shape: (1, 4, 12)
Adding 123 + 789
Prediction: 912
Correct Answer: 912


In [17]:
test_addition(100, 23)

Prediction shape: (1, 4, 12)
Adding 100 + 23
Prediction: 123
Correct Answer: 123


### Conclusion

1. **`Training and Validation Performance`**:
   - The model was trained on **80% of the data**, with **15%** used for validation and the rest **5%** used for testing.
   - During training, we monitored the loss to stop early when the training loss fell below 0.1 or when NaN loss occurred.
   - Throughout the training process, the model converged well, showing high performance on the validation set, indicating that the network successfully learned to add two numbers.
   - The use of batch normalization and L2 regularization helped stabilize the training process, preventing overfitting despite the simplicity of the task and the network architecture.

2. **`Test Set Evaluation`**:
   - After training, the model achieved a **test accuracy of 97.36%**, which indicates it generalized well to unseen data. 
   - The model's ability to predict correct sums on the test set demonstrates that even with a simple dense architecture, the network was able to effectively solve the problem of addition.
   - The test set predictions, visualized with the input/output comparison, confirmed that the model mostly predicted the correct sum, with only a few minor errors.

3. **`Dense Network Capability`**:
   - Dense neural networks are not traditionally used for sequence tasks like addition, as recurrent neural networks (RNNs) typically handle sequence prediction more efficiently. However, by properly encoding the input and output data (flattening the one-hot encoded sequences), the dense network was able to learn the relationships between input digits and the result.
   - The reshaping of the output to 3D (with dimensions for the number of digits and the possible character classes) allowed the dense network to handle the multi-digit prediction as separate classification tasks for each digit.
   - This demonstrates that dense networks, despite their limitations in sequence processing, can effectively perform such tasks if the data is appropriately preprocessed and the network is well-structured.

4. **`Data Preprocessing and Its Impact`**:
   - By flattening the input into a 2D structure and encoding each character (digit or operator) as a one-hot vector, the dense network was provided a clear way to learn patterns.
   - Reshaping the output back into 3D for digit-level prediction enabled the model to predict the sum digit by digit. This allowed the dense network to handle the complexity of multi-digit addition even without the sequential memory capabilities of an RNN.
  
In conclusion, the model effectively learned to add two numbers with high accuracy on training, validation, and test sets, proving that dense neural networks can perform well on tasks like addition when supported by correct data preprocessing and regularization techniques.

---


## Task 2: Can a Dense Network Add Two Numbers with More Digits than It Was Trained On?
In this task, we will explore whether a dense neural network, trained to add two numbers of up to a certain number of digits (e.g., 3 digits), can generalize and correctly add two numbers with one more digit during inference mode.

The challenge here is to determine whether a dense neural network can effectively handle inputs that are outside of the range it was trained on (in this case, 4-digit numbers after training on 3-digit numbers).

1. **Generate a Dataset**: Create random addition problems with up to 3-digit numbers.
2. **Preprocess the Data**: Encode the data into one-hot vectors and flatten it to fit a dense network architecture.
3. **Train the Network**: Train the dense neural network to predict the sum of two 3-digit numbers.
4. **Test the Network on 3-Digit Numbers**: Evaluate the model’s performance on unseen 3-digit test data.
5. **Inference on 4-Digit Numbers**: Generate new addition problems with 4-digit numbers and observe how the trained model performs on this task.
6. **Identify Limitations**: Analyze why the model cannot handle inputs with more digits without adjustments.
7. **Adjust the Model to Accept One More Digit**: Modify the data and model architecture to accept inputs with an extra digit.
8. **Conclusion**: Summarize findings and understand whether the network can generalize to larger inputs after adjustments.

### Step 1: Data Generation
We will generate two sets of data:
- Training data: Addition problems with up to 3-digit numbers (e.g., 321 + 456).
- Inference data: Addition problems with 4-digit numbers (e.g., 1234 + 5678), which the network hasn’t seen during training.

In [18]:
# Parameters for the model and dataset.
TRAINING_SIZE = 50000
DIGITS = 3  # Train on 3-digit numbers
REVERSE = False

# Maximum length of input for 3-digit numbers is 'int + int' (e.g., '123+456 '). Max length of int is DIGITS.
MAXLEN = DIGITS + 1 + DIGITS

# All the numbers, plus sign and space for padding.
chars = '0123456789+ '
ctable = CharacterTable(chars)

questions = []
expected = []
seen = set()

# Data Generation for 3-digit numbers
print('Generating 3-digit addition problems...')
while len(questions) < TRAINING_SIZE:
    f = lambda: int(''.join(np.random.choice(list('0123456789')) for i in range(np.random.randint(1, DIGITS + 1))))
    a, b = f(), f()
    
    # Skip any addition questions we've already seen
    key = tuple(sorted((a, b)))
    if key in seen:
        continue
    seen.add(key)
    
    # Format question and answer
    q = '{}+{}'.format(a, b)
    query = q + ' ' * (MAXLEN - len(q))
    ans = str(a + b)
    ans += ' ' * (DIGITS + 1 - len(ans))  # Ensure answers are padded to the right length
    
    questions.append(query)
    expected.append(ans)

print(f'Total 3-digit addition problems: {len(questions)}')

Generating 3-digit addition problems...
Total 3-digit addition problems: 50000


In [19]:
questions[0], expected[0]

('1+79   ', '80  ')

In [20]:
# Ensure the character set includes all necessary characters
chars = '0123456789+ '  # Digits 0-9, plus sign, and space for padding
ctable = CharacterTable(chars)

# Data generation for 4-digit numbers (for inference)
TEST_SIZE = 1000
# Define the new max length for 4-digit problems
DIGITS_4 = 4
MAXLEN_4_DIGIT = DIGITS_4 + 1 + DIGITS_4  # e.g., '1234+5678 '

# Generate some 4-digit test questions and answers
questions_4_digit = []
expected_4_digit = []

while len(questions_4_digit) < TEST_SIZE:
    f = lambda: int(''.join(np.random.choice(list('0123456789')) for _ in range(DIGITS_4)))
    a, b = f(), f()
    
    # Ensure that the first digit is not zero
    if str(a)[0] == '0' or str(b)[0] == '0':
        continue  # Skip numbers that start with zero
    
    q = '{}+{}'.format(a, b)
    query = q  # No additional padding needed
    ans = str(a + b).rjust(DIGITS_4 + 1)  # Pad answer to length 5
    
    questions_4_digit.append(query)
    expected_4_digit.append(ans)

print(f'Total 4-digit addition problems: {len(questions_4_digit)}')

Total 4-digit addition problems: 1000


In [21]:
questions_4_digit[0], expected_4_digit[0]

('8935+8529', '17464')

### Step 2: Preprocessing Data
We will now encode the questions and answers into one-hot vectors, flatten the input, and prepare it for training the dense network.

In [22]:
# Vectorizing 3-digit data
x = np.zeros((len(questions), MAXLEN, len(chars)), dtype=bool)
y = np.zeros((len(questions), DIGITS + 1, len(chars)), dtype=bool)

for i, sentence in enumerate(questions):
    x[i] = ctable.encode(sentence, MAXLEN)

for i, sentence in enumerate(expected):
    y[i] = ctable.encode(sentence, DIGITS + 1)

# Flatten the input and output for the dense network
x = x.reshape((x.shape[0], -1))
y = y.reshape((y.shape[0], -1))

# Shuffle the data
indices = np.arange(len(y))
np.random.shuffle(indices)
x = x[indices]
y = y[indices]

# Split data into train, validation, test sets
split_at_1 = int(0.8 * len(x))  # 80% training
split_at_2 = int(0.95 * len(x))  # 15% validation, 5% testing

x_train, x_val, x_test = x[:split_at_1], x[split_at_1:split_at_2], x[split_at_2:]
y_train, y_val, y_test = y[:split_at_1], y[split_at_1:split_at_2], y[split_at_2:]

In [23]:
print('Training Data:')
print(x_train.shape)
print(y_train.shape)
print()

print('Validation Data:')
print(x_val.shape)
print(y_val.shape)
print()

print('Test Data:')
print(x_test.shape)
print(y_test.shape)
print()

print('Example:')
print('The first row of input data is encoded internally as:')
print(x_train[0])
print()
print('The first row of output data is encoded internally as:')
print(y_train[0])
print()


# Reshape back to 2D form (MAXLEN, len(chars)) before decoding
reshaped_x_train = x_train[0].reshape(MAXLEN, len(chars))
reshaped_y_train = y_train[0].reshape(DIGITS + 1, len(chars))

# Decoding the reshaped one-hot encoded input and output
print('These internal representations represent these signals:')
print(ctable.decode(reshaped_x_train))
print(ctable.decode(reshaped_y_train))

Training Data:
(40000, 84)
(40000, 48)

Validation Data:
(7500, 84)
(7500, 48)

Test Data:
(2500, 84)
(2500, 48)

Example:
The first row of input data is encoded internally as:
[False False False False False False False False False  True False False
 False False False False False False False  True False False False False
 False  True False False False False False False False False False False
 False False False  True False False False False False False False False
 False False False False False False False False  True False False False
 False False False False False  True False False False False False False
  True False False False False False False False False False False False]

The first row of output data is encoded internally as:
[False False False False  True False False False False False False False
 False False False False False  True False False False False False False
 False False False False False False False False False False  True False
  True False False False False False

In [24]:
# Vectorizing 4-digit test data (for inference)
x_test_4_digit = np.zeros((len(questions_4_digit), MAXLEN_4_DIGIT, len(chars)), dtype=bool)
y_test_4_digit = np.zeros((len(questions_4_digit), DIGITS_4 + 1, len(chars)), dtype=bool)

for i, sentence in enumerate(questions_4_digit):
    x_test_4_digit[i] = ctable.encode(sentence, MAXLEN_4_DIGIT)

for i, sentence in enumerate(expected_4_digit):
    y_test_4_digit[i] = ctable.encode(sentence, DIGITS_4 + 1)

# Flattening the 4-digit input for the dense network
x_test_4_digit = x_test_4_digit.reshape((x_test_4_digit.shape[0], -1))
y_test_4_digit = y_test_4_digit.reshape((y_test_4_digit.shape[0], -1))

In [25]:
print('4 Digit Test Data:')
print(x_test_4_digit.shape)
print(y_test_4_digit.shape)
print()

4 Digit Test Data:
(1000, 108)
(1000, 60)



### Step 3: Building the Dense Neural Network


In [26]:
# Model parameters
HIDDEN_SIZE = 256  # Number of neurons in the hidden layers
BATCH_SIZE = 128   # Batch size for training
LAYERS = 2         # Number of dense layers

# Number of output digits (3-digit + 1 for carry over) for training
num_digits_train = DIGITS + 1  

# Build the Dense Neural Network model
model = Sequential()

# Input layer, first hidden layer, and Batch Normalization
model.add(Dense(HIDDEN_SIZE, input_shape=(x_train.shape[1],), kernel_regularizer=l2(0.01)))
model.add(BatchNormalization())  # Add Batch Normalization
model.add(Dense(HIDDEN_SIZE, activation='relu'))  # Activation after normalization

# Additional hidden layers with Batch Normalization
for _ in range(LAYERS - 1):
    model.add(Dense(HIDDEN_SIZE, kernel_regularizer=l2(0.01)))  # Hidden layer with regularization
    model.add(BatchNormalization())  # Batch Normalization after hidden layer
    model.add(Dense(HIDDEN_SIZE, activation='relu'))  # Activation after normalization

# Output layer: predict each digit of the result independently
model.add(Dense(num_digits_train * len(chars), activation='softmax'))  # Softmax for multi-class classification
model.add(Reshape((num_digits_train, len(chars))))  # Reshape to have one softmax per digit

# Compile the model
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(learning_rate=0.001, clipnorm=1.0),
              metrics=['accuracy'])

# Display the model summary
model.summary()




Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_5 (Dense)             (None, 256)               21760     
                                                                 
 batch_normalization_2 (Bat  (None, 256)               1024      
 chNormalization)                                                
                                                                 
 dense_6 (Dense)             (None, 256)               65792     
                                                                 
 dense_7 (Dense)             (None, 256)               65792     
                                                                 
 batch_normalization_3 (Bat  (None, 256)               1024      
 chNormalization)                                                
                                                                 
 dense_8 (Dense)             (None, 256)              

### Step 4: Train the Model on 3-Digit Data
The training setup is similar to the previous task, using 3-digit numbers.

In [27]:
class colors:
    ok = '\033[92m'
    fail = '\033[91m'
    close = '\033[0m'

# Reshape y_train and y_val for 3-digit training
y_train = y_train.reshape((-1, DIGITS + 1, len(chars)))  # Reshape y_train to 3D for 3-digit training
y_val = y_val.reshape((-1, DIGITS + 1, len(chars)))      # Reshape y_val to 3D for validation

# Train the model for 35 epochs or stop early if loss < 0.1 or if loss is NaN
previous_model_weights = None  # To store the previous best model weights

for iteration in range(1, 36):
    print()
    print('-' * 50)
    print('Iteration', iteration)

    # Train the model on the training data for one epoch
    history = model.fit(x_train, y_train,
                        batch_size=BATCH_SIZE,
                        epochs=1,
                        validation_data=(x_val, y_val))

    # Extract training loss from history
    loss = history.history['loss'][0]
    print(f'Training loss: {loss}')

    # Stop training early if loss is below 0.1
    if loss < 0.1:
        print(f"Stopping early as loss is below 0.1 in iteration {iteration}")
        break

    # Check if the loss is NaN
    if np.isnan(loss):
        print(f"Loss became NaN at iteration {iteration}. Restoring previous weights and stopping training.")
        # Restore the previous weights if NaN occurred
        model.set_weights(previous_model_weights)
        break

    # Save the model's weights after each successful epoch (before potential NaN)
    previous_model_weights = model.get_weights()

    # Select 10 samples from the validation set to visualize predictions and errors
    for i in range(10):
        ind = np.random.randint(0, len(x_val))
        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]

        # Predict the result of the validation sample
        preds = np.argmax(model.predict(rowx), axis=-1)

        # Decode the input and correct output
        q = ctable.decode(rowx[0].reshape(MAXLEN, len(chars)))  # Reshape back to 2D form for decoding
        correct = ctable.decode(rowy[0].reshape(DIGITS + 1, len(chars)))  # Reshape for decoding

        # Decode the predicted output
        try:
            guess = ctable.decode(preds[0], calc_argmax=False)
        except KeyError:
            guess = "Invalid Prediction"

        # Display the input, correct answer, and prediction
        print('Q', q[::-1] if REVERSE else q, end=' ')  # The input query
        print('T', correct, end=' ')  # The correct output
        if correct == guess:
            print(colors.ok + '☑' + colors.close, end=' ')  # Correct prediction
        else:
            print(colors.fail + '☒' + colors.close, end=' ')  # Incorrect prediction
        print(guess)


--------------------------------------------------
Iteration 1
Training loss: 2.3538877964019775
Q 57+364  T 421  [91m☒[0m 490 
Q 561+811 T 1372 [91m☒[0m 902 
Q 99+235  T 334  [91m☒[0m 391 
Q 861+634 T 1495 [91m☒[0m 900 
Q 702+2   T 704  [91m☒[0m 277 
Q 6+597   T 603  [91m☒[0m 491 
Q 682+7   T 689  [92m☑[0m 689 
Q 494+202 T 696  [91m☒[0m 671 
Q 63+915  T 978  [91m☒[0m 970 
Q 859+0   T 859  [91m☒[0m 377 

--------------------------------------------------
Iteration 2
Training loss: 1.071484923362732
Q 84+135  T 219  [92m☑[0m 219 
Q 928+386 T 1314 [91m☒[0m 1211
Q 88+85   T 173  [91m☒[0m 143 
Q 382+8   T 390  [91m☒[0m 400 
Q 146+832 T 978  [91m☒[0m 908 
Q 89+802  T 891  [91m☒[0m 890 
Q 214+3   T 217  [91m☒[0m 228 
Q 78+42   T 120  [91m☒[0m 10  
Q 52+118  T 170  [92m☑[0m 170 
Q 563+167 T 730  [91m☒[0m 770 

--------------------------------------------------
Iteration 3
Training loss: 0.7443254590034485
Q 167+53  T 220  [92m☑[0m 220 
Q 97+773  T 87

### Step 5: Evaluate on 3-Digit Test Set
Once the model is trained, we evaluate it on the 3-digit test set to check its performance.

In [28]:
# Reshape the test data back to 3D
y_test = y_test.reshape((-1, DIGITS + 1, len(chars)))

# Evaluate on 3-digit test set
_, test_accuracy = model.evaluate(x_test, y_test)
print(f'3-Digit Test Accuracy: {test_accuracy * 100:.2f}%')

3-Digit Test Accuracy: 96.65%


### Step 6: Test on 4-Digit Data (Inference Mode)
Now we test the trained model on the 4-digit test data to see if the model generalizes to numbers with one more digit.

In [29]:
# Reshape y_test_4_digit to 3D for 4-digit testing (DIGITS + 2 for the extra digit in the result)
y_test_4_digit = y_test_4_digit.reshape((-1, DIGITS + 2, len(chars)))

# Evaluate the model on the 4-digit test set
_, test_accuracy = model.evaluate(x_test_4_digit, y_test_4_digit)
print('Test Accuracy on 4-digit numbers: {:.2f}%'.format(test_accuracy * 100))

# Select 10 random samples from the 4-digit test set and visualize predictions
for i in range(10):  # Show 10 test examples
    ind = np.random.randint(0, len(x_test_4_digit))
    rowx, rowy = x_test_4_digit[np.array([ind])], y_test_4_digit[np.array([ind])]

    # Predict the result on the 4-digit test sample
    preds = np.argmax(model.predict(rowx), axis=-1)

    # Decode the input, correct output, and predicted output
    q = ctable.decode(rowx[0].reshape(MAXLEN_4_DIGIT, len(chars)))  # For 4-digit numbers, use MAXLEN_4_DIGIT
    correct = ctable.decode(rowy[0].reshape(DIGITS + 2, len(chars)))  # Adjust for the 5-digit result
    
    try:
        guess = ctable.decode(preds[0], calc_argmax=False)
    except KeyError:
        guess = "Invalid Prediction"
    
    # Print the input, correct answer, and model's prediction with colors
    print('Q', q[::-1] if REVERSE else q, end=' ')  # Input question
    print('T', correct, end=' ')  # Correct answer
    if correct == guess:
        print(colors.ok + '☑' + colors.close, end=' ')  # Correct prediction
    else:
        print(colors.fail + '☒' + colors.close, end=' ')  # Incorrect prediction
    print(guess)



InvalidArgumentError: Graph execution error:

Detected at node sequential_1/dense_5/BiasAdd defined at (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main

  File "<frozen runpy>", line 88, in _run_code

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ipykernel_launcher.py", line 17, in <module>

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/traitlets/config/application.py", line 1043, in launch_instance

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 725, in start

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tornado/platform/asyncio.py", line 215, in start

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 607, in run_forever

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 1919, in _run_once

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/events.py", line 80, in _run

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 502, in process_one

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 729, in execute_request

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 422, in do_execute

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ipykernel/zmqshell.py", line 540, in run_cell

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code

  File "/var/folders/01/7z6mdkx15c5cyc_f0f_6ly1w0000gn/T/ipykernel_6087/2450687021.py", line 5, in <module>

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/engine/training.py", line 2272, in evaluate

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/engine/training.py", line 4079, in run_step

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/engine/training.py", line 2042, in test_function

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/engine/training.py", line 2025, in step_function

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/engine/training.py", line 2013, in run_step

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/engine/training.py", line 1893, in test_step

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/engine/training.py", line 589, in __call__

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1149, in __call__

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/engine/sequential.py", line 398, in call

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/engine/functional.py", line 515, in call

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/engine/functional.py", line 672, in _run_internal_graph

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1149, in __call__

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/keras/src/layers/core/dense.py", line 252, in call

Matrix size-incompatible: In[0]: [32,108], In[1]: [84,256]
	 [[{{node sequential_1/dense_5/BiasAdd}}]] [Op:__inference_test_function_142199]

##### `Why do we see this error?`:

- `Matrix Size Incompatibility`: The error indicates that there is a mismatch in the dimensions of the matrices involved in the computation during the evaluation.
- `Input Tensor Shape (In[0]: [32,108])`: This represents the batch of input data x_test_4_digit being fed into the model:
   - 32 is the batch size (number of samples processed in parallel).
   - 108 is the size of each input sample after flattening (since each 4-digit addition problem is represented by MAXLEN_4_DIGIT = 9 characters, and each character is one-hot encoded over len(chars) = 12 possible characters: 9 * 12 = 108).
- `Weight Matrix Shape (In[1]: [84,256])`: This represents the weights of the first dense layer in the model:
   - 84 is the expected input size for each sample (from the training phase, where inputs had MAXLEN = 7 characters: 7 * 12 = 84).
   - 256 is the number of neurons in the first hidden layer.
- `Cause of the Error`:
   - The model was trained on input vectors of size 84, corresponding to 3-digit addition problems.
   - During evaluation, we provided input vectors of size 108, corresponding to 4-digit addition problems.
   - This mismatch leads to an incompatibility in the matrix multiplication operation in the first dense layer, as the input dimensions do not align with the expected weight dimensions.
- `Result`: The model cannot process input data with dimensions different from what it was trained on.

##### `Solution: Adjust the Model to Accept Inputs with One More Digit`

To make the network able to accept numbers with one more digit, we need to:

1. **Adjust the Input Dimensions**:
    - Modify the MAXLEN to accommodate the larger input size.
    - Update the input shape in the model accordingly.
2. **Adjust the Model Architecture**:
    - Ensure the model's input layer can accept the new input size.
    - Modify the output layer to produce outputs with the correct dimensions.
3. **Test the Model on 4-Digit Numbers**:
    - After adjusting the model, evaluate its performance on 4-digit addition problems.

### Step 7: Adjusting the Model to Accept Inputs with One More Digit

`1. Adjust MAXLEN and Define DIGITS_4`
- `DIGITS`: We define DIGITS as 3, representing the maximum number of digits in the numbers used for training (up to 3-digit numbers).
- `DIGITS_4`: We create DIGITS_4 as DIGITS + 1, which equals 4. This represents the number of digits in the numbers we'll use for testing (4-digit numbers).
- `MAXLEN`: We adjust MAXLEN to accommodate the larger input size. It is calculated as DIGITS_4 + 1 + DIGITS_4, accounting for two numbers of length DIGITS_4 and the '+' sign in between. So, MAXLEN = 4 + 1 + 4 = 9.

`2. Generate Training Data with Zero Padding`
We pad each number a and b with leading zeros to match the length of 4-digit numbers (DIGITS_4).
For example, if a = 47, a_str = '0047'
- We pad with zeros instead of spaces to maintain the numeric structure and positional significance of each digit, ensuring consistent and meaningful input for the model.


In [30]:
# Adjust MAXLEN and DIGITS_4
DIGITS = 3
DIGITS_4 = DIGITS + 1
MAXLEN = DIGITS_4 + 1 + DIGITS_4

# Re-initialize character table if needed
chars = '0123456789+ '
ctable = CharacterTable(chars)

# Generate training data with padding
questions = []
expected = []
seen = set()
TRAINING_SIZE = 50000

# Modify Data Generation for 3-digit numbers with zero padding
print('Generating 3-digit addition problems with zero padding...')
while len(questions) < TRAINING_SIZE:
    f = lambda: int(''.join(np.random.choice(list('0123456789')) 
                            for _ in range(np.random.randint(1, DIGITS + 1))))
    a, b = f(), f()

    # Skip any addition questions we've already seen
    key = tuple(sorted((a, b)))
    if key in seen:
        continue
    seen.add(key)

    # Pad numbers with leading zeros to match DIGITS_4 length
    a_str = str(a).zfill(DIGITS_4)  # Pad to length DIGITS_4 with zeros
    b_str = str(b).zfill(DIGITS_4)  # Pad to length DIGITS_4 with zeros

    # Format question and answer
    q = '{}+{}'.format(a_str, b_str)
    query = q  # Length should be MAXLEN
    ans = str(a + b).zfill(DIGITS_4 + 1)  # Pad answer to length DIGITS_4 + 1

    # # Pad numbers with leading spaces to match MAXLEN
    # q = '{}+{}'.format(str(a).rjust(DIGITS_4), str(b).rjust(DIGITS_4))
    # query = q  # Length should be MAXLEN
    # ans = str(a + b).rjust(DIGITS_4 + 1)  # Pad answer to length DIGITS_4 + 1

    questions.append(query)
    expected.append(ans)

print(f'Total 3-digit addition problems: {len(questions)}')

Generating 3-digit addition problems with zero padding...
Total 3-digit addition problems: 50000


In [31]:
questions[0]

'0005+0007'

In [32]:
expected[0]

'00012'

`3. Vectorize the Training Data`

In [33]:
# Vectorize the training data
x = np.zeros((len(questions), MAXLEN, len(chars)), dtype=bool)
y = np.zeros((len(questions), DIGITS_4 + 1, len(chars)), dtype=bool)

for i, sentence in enumerate(questions):
    x[i] = ctable.encode(sentence, MAXLEN)

for i, sentence in enumerate(expected):
    y[i] = ctable.encode(sentence, DIGITS_4 + 1)

# Flatten the input and output
x = x.reshape((x.shape[0], -1))
y = y.reshape((y.shape[0], -1))

# Shuffle and split the data
indices = np.arange(len(y))
np.random.shuffle(indices)
x = x[indices]
y = y[indices]

split_at_1 = int(0.8 * len(x))  # 80% training
split_at_2 = int(0.95 * len(x))  # 15% validation, 5% testing

x_train, x_val, x_test = x[:split_at_1], x[split_at_1:split_at_2], x[split_at_2:]
y_train, y_val, y_test = y[:split_at_1], y[split_at_1:split_at_2], y[split_at_2:]

In [34]:
print('Training Data:')
print(x_train.shape)
print(y_train.shape)
print()

print('Validation Data:')
print(x_val.shape)
print(y_val.shape)
print()

print('Test Data:')
print(x_test.shape)
print(y_test.shape)
print()

print('Example:')
print('The first row of input data is encoded internally as:')
print(x_train[0])
print()
print('The first row of output data is encoded internally as:')
print(y_train[0])
print()


# Reshape back to 2D form (MAXLEN, len(chars)) before decoding
reshaped_x_train = x_train[0].reshape(MAXLEN, len(chars))
reshaped_y_train = y_train[0].reshape(DIGITS_4 + 1, len(chars))

# Decoding the reshaped one-hot encoded input and output
print('These internal representations represent these signals:')
print(ctable.decode(reshaped_x_train))
print(ctable.decode(reshaped_y_train))

Training Data:
(40000, 108)
(40000, 60)

Validation Data:
(7500, 108)
(7500, 60)

Test Data:
(2500, 108)
(2500, 60)

Example:
The first row of input data is encoded internally as:
[False False  True False False False False False False False False False
 False False  True False False False False False False False False False
 False False False False False False False False  True False False False
 False False False False False False False False False False  True False
 False  True False False False False False False False False False False
 False False  True False False False False False False False False False
 False False False False False False False False  True False False False
 False False False  True False False False False False False False False
 False False False False False False False False False False False  True]

The first row of output data is encoded internally as:
[False False  True False False False False False False False False False
 False False  True False False Fa

`4. Build and Compile the Model`

In [35]:
# Reshape outputs for training
num_digits_output = DIGITS_4 + 1
y_train = y_train.reshape((-1, num_digits_output, len(chars)))
y_val = y_val.reshape((-1, num_digits_output, len(chars)))

# Build the model
HIDDEN_SIZE = 128
BATCH_SIZE = 128
LAYERS = 2

model = Sequential()
model.add(Dense(HIDDEN_SIZE, input_shape=(MAXLEN * len(chars),), kernel_regularizer=l2(0.01)))
model.add(BatchNormalization())
model.add(Dense(HIDDEN_SIZE, activation='relu'))

for _ in range(LAYERS - 1):
    model.add(Dense(HIDDEN_SIZE, kernel_regularizer=l2(0.01)))
    model.add(BatchNormalization())
    model.add(Dense(HIDDEN_SIZE, activation='relu'))

model.add(Dense(num_digits_output * len(chars), activation='softmax'))
model.add(Reshape((num_digits_output, len(chars))))

model.compile(loss='categorical_crossentropy',
              optimizer=Adam(learning_rate=0.001, clipnorm=1.0),
              metrics=['accuracy'])

model.summary()



Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_10 (Dense)            (None, 128)               13952     
                                                                 
 batch_normalization_4 (Bat  (None, 128)               512       
 chNormalization)                                                
                                                                 
 dense_11 (Dense)            (None, 128)               16512     
                                                                 
 dense_12 (Dense)            (None, 128)               16512     
                                                                 
 batch_normalization_5 (Bat  (None, 128)               512       
 chNormalization)                                                
                                                                 
 dense_13 (Dense)            (None, 128)              

`5. Train the Model`



In [36]:
class colors:
    ok = '\033[92m'
    fail = '\033[91m'
    close = '\033[0m'

# Train the model for 35 epochs or stop early if loss < 0.1 or if loss is NaN
previous_model_weights = None  # To store the previous best model weights

for iteration in range(1, 36):
    print()
    print('-' * 50)
    print('Iteration', iteration)

    # Train the model on the training data for one epoch
    history = model.fit(x_train, y_train,
                        batch_size=BATCH_SIZE,
                        epochs=1,
                        validation_data=(x_val, y_val))

    # Extract training loss from history
    loss = history.history['loss'][0]
    print(f'Training loss: {loss}')

    # Stop training early if loss is below 0.1
    if loss < 0.1:
        print(f"Stopping early as loss is below 0.1 in iteration {iteration}")
        break

    # Check if the loss is NaN
    if np.isnan(loss):
        print(f"Loss became NaN at iteration {iteration}. Restoring previous weights and stopping training.")
        # Restore the previous weights if NaN occurred
        model.set_weights(previous_model_weights)
        break

    # Save the model's weights after each successful epoch (before potential NaN)
    previous_model_weights = model.get_weights()

    # Select 10 samples from the validation set to visualize predictions and errors
    for i in range(10):
        ind = np.random.randint(0, len(x_val))
        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]

        # Predict the result of the validation sample
        preds = np.argmax(model.predict(rowx), axis=-1)

        # Decode the input and correct output
        q = ctable.decode(rowx[0].reshape(MAXLEN, len(chars)))  # Reshape back to 2D form for decoding
        correct = ctable.decode(rowy[0].reshape(DIGITS_4 + 1, len(chars)))  # Reshape for decoding

        # Decode the predicted output
        try:
            guess = ctable.decode(preds[0], calc_argmax=False)
        except KeyError:
            guess = "Invalid Prediction"

        # Display the input, correct answer, and prediction
        print('Q', q[::-1] if REVERSE else q, end=' ')  # The input query
        print('T', correct, end=' ')  # The correct output
        if correct == guess:
            print(colors.ok + '☑' + colors.close, end=' ')  # Correct prediction
        else:
            print(colors.fail + '☒' + colors.close, end=' ')  # Incorrect prediction
        print(guess)


--------------------------------------------------
Iteration 1
Training loss: 1.6623128652572632
Q 0722+0020 T 00742 [91m☒[0m 00759
Q 0077+0595 T 00672 [91m☒[0m 00699
Q 0758+0330 T 01088 [91m☒[0m 00199
Q 0656+0031 T 00687 [91m☒[0m 00195
Q 0784+0267 T 01051 [91m☒[0m 00035
Q 0630+0013 T 00643 [91m☒[0m 00753
Q 0418+0529 T 00947 [91m☒[0m 00033
Q 0007+0021 T 00028 [91m☒[0m 00125
Q 0527+0620 T 01147 [91m☒[0m 00159
Q 0002+0306 T 00308 [91m☒[0m 00103

--------------------------------------------------
Iteration 2
Training loss: 0.3386910557746887
Q 0054+0558 T 00612 [91m☒[0m 00602
Q 0009+0972 T 00981 [92m☑[0m 00981
Q 0033+0058 T 00091 [92m☑[0m 00091
Q 0522+0369 T 00891 [91m☒[0m 00991
Q 0564+0007 T 00571 [92m☑[0m 00571
Q 0023+0782 T 00805 [92m☑[0m 00805
Q 0807+0716 T 01523 [91m☒[0m 01613
Q 0641+0038 T 00679 [92m☑[0m 00679
Q 0037+0851 T 00888 [92m☑[0m 00888
Q 0652+0044 T 00696 [92m☑[0m 00696

--------------------------------------------------
Iteration 3


`6. Evaluate on 3-Digit Test Set`

In [37]:
# Reshape the test data back to 3D
y_test = y_test.reshape((-1, DIGITS_4 + 1, len(chars)))

# Evaluate on 3-digit test set
_, test_accuracy = model.evaluate(x_test, y_test)
print(f'3-Digit Test Accuracy: {test_accuracy * 100:.2f}%')

3-Digit Test Accuracy: 99.55%


The test accuracy on 3-digit numbers is 99.55%, indicating excellent performance just like earlier.

`7. Evaluate on 4-Digit Test Set`

In [38]:
# Reshape outputs before evaluation
y_test_4_digit = y_test_4_digit.reshape((-1, num_digits_output, len(chars)))

# Evaluate the model on the 4-digit test set
_, test_accuracy = model.evaluate(x_test_4_digit, y_test_4_digit)
print(f'Test Accuracy on 4-digit numbers: {test_accuracy * 100:.2f}%')

# Visualize predictions on 4-digit test data
print('\nSample predictions on 4-digit test data:')
for i in range(10):  # Show 10 test examples
    ind = np.random.randint(0, len(x_test_4_digit))
    rowx, rowy = x_test_4_digit[np.array([ind])], y_test_4_digit[np.array([ind])]

    # Predict the result on the 4-digit test sample
    preds = np.argmax(model.predict(rowx), axis=-1)

    # Decode the input, correct output, and predicted output
    q = ctable.decode(rowx[0].reshape(MAXLEN, len(chars)))
    correct = ctable.decode(rowy[0].reshape(num_digits_output, len(chars)))
    guess = ctable.decode(preds[0], calc_argmax=False)

    # Print the input, correct answer, and model's prediction
    print('Q', q, end=' ')  # Input question
    print('T', correct, end=' ')  # Correct answer
    if correct.strip() == guess.strip():
        print(colors.ok + '☑' + colors.close, end=' ')  # Correct prediction
    else:
        print(colors.fail + '☒' + colors.close, end=' ')  # Incorrect prediction
    print(guess)

Test Accuracy on 4-digit numbers: 50.36%

Sample predictions on 4-digit test data:
Q 3505+2363 T  5868 [91m☒[0m 00868
Q 7712+4465 T 12177 [91m☒[0m 01177
Q 8268+504  T  8772 [91m☒[0m 00318
Q 8503+7800 T 16303 [91m☒[0m 01303
Q 5750+510  T  6260 [91m☒[0m 00851
Q 1510+6742 T  8252 [91m☒[0m 01252
Q 969+3693  T  4662 [91m☒[0m 01629
Q 6088+8105 T 14193 [91m☒[0m 00193
Q 6836+5635 T 12471 [91m☒[0m 01971
Q 9459+7822 T 17281 [91m☒[0m 01281


- The test accuracy on 4-digit numbers is 50.36%.
- This accuracy represents per-character accuracy, not sequence accuracy.
- The model predicts some digits correctly, usually the last three digits.

`8. Compute Sequence Accuracy on 4-Digit Numbers`

In [39]:
# Compute sequence accuracy on 4-digit test data
correct_sequences = 0
total_sequences = len(x_test_4_digit)

for i in range(total_sequences):
    rowx = x_test_4_digit[np.array([i])]
    rowy = y_test_4_digit[np.array([i])]
    preds = np.argmax(model.predict(rowx, verbose=0), axis=-1)
    correct = ctable.decode(rowy[0])
    guess = ctable.decode(preds[0], calc_argmax=False)
    if correct == guess:
        correct_sequences += 1

sequence_accuracy = correct_sequences / total_sequences * 100
print(f'Sequence Accuracy on 4-digit numbers: {sequence_accuracy:.2f}%')

Sequence Accuracy on 4-digit numbers: 0.00%


- `Sequence Accuracy`:
    - Measures the percentage of samples where the entire output sequence is predicted correctly.
    - This is a stricter metric than per-character accuracy.
- `Results`:
    - The sequence accuracy on 4-digit numbers is 0.00%.
    - This means the model did not predict any 4-digit addition problems entirely correctly.
    - It highlights the model's inability to generalize to numbers with more digits than it was trained on.

### Step 8: **Conclusion**

#### **Can the Network Correctly Add Two Numbers with More Digits Than It Was Trained On?**

`No`, even after adjusting the model to accept larger inputs, the dense network cannot correctly add two numbers with more digits than it was trained on.

#### **Why Not?**

1. **`Lack of Generalization`:**
   - The model has learned to add numbers within the digit length it was trained on (3-digit numbers).
   - It fails to generalize the addition operation to numbers with more digits because it hasn't learned to handle carry-over and positional significance in unseen digit positions.

2. **`Positional Dependence`:**
   - Dense networks treat input features independently and lack the ability to capture sequential dependencies.
   - The model's weights are tuned to specific input positions corresponding to the training data.

3. **`Fixed Patterns`:**
   - The model relies on patterns seen during training.
   - When faced with inputs of a different structure (e.g., 4-digit numbers), it cannot expand beyond its learned patterns.

### **Observations from Results**

- **`Per-Character Accuracy`:**
  - The model achieves around **50.36%** per-character accuracy on 4-digit numbers because it correctly predicts some digits (often the last few digits).

- **`Sequence Accuracy`:**
  - The sequence accuracy is **0%**, indicating that the model does not predict any 4-digit sums entirely correctly.

- **`Sample Predictions`:**
  - The model often predicts the last few digits correctly but fails on the leading digits.
  - This suggests that it is applying learned patterns to familiar positions but cannot handle new digit positions.

#### **Implications**

- **`Limitations of Dense Networks`:**
  - Dense networks are not well-suited for tasks requiring an understanding of sequence and positional relationships.
  - They cannot handle variable-length inputs or generalize to longer sequences without architectural changes.

- **`Need for Sequence Models`:**
  - Models like Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), or Transformers are better suited for sequence tasks.
  - These models can capture dependencies across different positions and handle variable-length inputs.

#### **Key Takeaways**

- **`Training on Larger Numbers`:**
  - To enable the model to add numbers with more digits, it should be trained on data that includes those digit lengths.
  - Including 4-digit numbers in the training set may help the model learn to handle larger numbers.

- **`Model Architecture Matters`:**
  - Selecting an appropriate model architecture is crucial for tasks involving sequences and generalization to larger inputs.
---

## Task 3 - Diference between Dense Network and Recurrent networks
 
In **dense networks** (also called fully connected networks), data flows through the network from one layer to the next in a straight line. Each neuron (or node) in one layer is connected to every neuron in the next layer. The network doesn’t "remember" anything from previous data inputs—it treats every new piece of information separately.

When data is passed through the network, the output is calculated using an equation like this:

`output = activation_function(W * input + b)`

Here, `W` stands for weights (the strength of the connections between neurons), `input` is the data you're feeding in, and `b` is a bias value that helps adjust the output. The `activation_function` decides whether or not to pass the signal along, helping the network make decisions.

After this, the network checks how far the output is from the correct answer. It uses **backward propagation** to calculate how much each weight contributed to the error. The gradient (or slope) of the error with respect to the weights is calculated:

`gradient = dL/dW`

The weights are then updated to reduce future errors using:

`new_weight = weight - learning_rate * gradient`

In **recurrent neural networks (RNNs)**, the network has a memory—it can remember what it learned from previous inputs. This is useful for tasks like speech or text, where the order of the inputs matters. For example, understanding the next word in a sentence depends on knowing what came before it.

In RNNs, the current output depends on both the present input and the hidden state (which contains information from previous inputs). The hidden state is calculated as:

`hidden_state_t = activation_function(W_h * hidden_state_t-1 + W_x * input_t + b)`

Here, `W_h` represents the weights for the hidden state from the previous step, and `W_x` represents the weights for the current input. The hidden state is updated after each step, meaning the network "remembers" past inputs as it processes new ones.

During training, RNNs also use **backward propagation**, but they apply a special technique called **Backpropagation Through Time (BPTT)**. This allows the network to calculate the error and adjust the weights across multiple time steps, so it can learn from both the current and past data.

So, dense networks process data in isolation, while RNNs remember past data, making them ideal for handling sequences of information like text or time series data.