# GRU-based Encoder–Decoder for Machine Translation (NumPy)

In this project, I created a machine translation model using an encoder–decoder architecture built entirely from scratch in NumPy, without using any deep learning frameworks. The model translates from English to Japanese using a GRU (Gated Recurrent Unit).
- Dataset: English Japanese sentence pairs selected from the Tatoeba Project
- Architecture: Encoder–Decoder with GRU
- Objective: Generate Japanese translations using only NumPy operations



## Architecture Overview
- **Embedding**: English and Japanese character tokens are first mapped to integer indices. Each index is embedded into a dense vector using separate embedding matrices.
- **Encoder**: A GRU processes the full embedded English input sequence, the final hidden state is returned and represents the context of the entire input sentence. 
- **Decoder**: The decoder is another GRU that predicts the Japanese output one character at a time. It takes the embedded previous character and the hidden state from the previous step. The initial hidden state is the encoder's final hidden state.
- **Output**: Immediately after running the decoder GRU on a time step, apply an FC layer to the hidden state at that time step to get output logits. Store those logits and at the end apply softmax.

![Encoder Decoder Architecture](figures/ED_arch.png)

Image source: Zhang, Aston and Lipton, Zachary C. and Li, Mu and Smola, Alexander J. - https://github.com/d2l-ai/d2l-en

## 1. Loading and Filtering Sentence Pairs

Read English Japanese sentence pairs from the `jpn.txt` file. Because training a model from scratch using only NumPy means we can't use a gpu, we need to make the dataset smaller. We only use sentences with 3 or less words. lowercase the English text, and clean both languages using regular expressions. 

For English, we remove all non-alphanumeric characters. For Japanese, we preserve Hiragana, Katakana, Kanji, and a few key punctuation symbols (like 。 and 、). Duplicate English sentences are also removed to avoid redundancy.

In [2]:
import numpy as np
import re

# Lists for cleaned sentences
eng_sentences = []
jpn_sentences = []

# To avoid duplicates
seen = set()

with open('jpn.txt', "r", encoding="utf-8") as f:
    for line in f:
        parts = line.strip().split("\t")

        # Extract English and Japanese parts
        eng = parts[0].strip().lower()
        jpn = parts[1].strip().lower()

        # Clean English: keep lowercase letters, numbers, and spaces
        eng = re.sub(r"[^a-z0-9\s]", "", eng)

        # Clean Japanese: keep the punctuation
        jpn = re.sub(r"[^\u3040-\u30ff\u4e00-\u9fff。、！？\s]", "", jpn)
        # Filtering out sentences, max 3 words
        if len(eng.split()) <= 3 and len(jpn.split()) <= 3 and eng not in seen:
            eng_sentences.append(eng)
            jpn_sentences.append(jpn)
            seen.add(eng)

print(f"English Sentences: {eng_sentences[0:10]}")
print(f"English Sentences Length: {len(eng_sentences)}")
print(f"Japanese Sentences: {jpn_sentences[0:10]}")
print(f"Japanese Sentences Length: {len(jpn_sentences)}")

English Sentences: ['go', 'hi', 'run', 'who', 'wow', 'duck', 'fire', 'help', 'hide', 'jump']
English Sentences Length: 6627
Japanese Sentences: ['行け。', 'こんにちは。', '走れ。', '誰？', 'すごい！', '頭を下げろ！', '火事だ！', '助けて！', '隠れろ。', '飛び越えろ！']
Japanese Sentences Length: 6627


## Vocabulary Construction

We now build vocabularies for both English and Japanese.

- English is tokenized at word level. We assign indices to all words that appear more than twice.
- Japanese is tokenized at the character level, since words are not separated by spaces. Characters appearing more than twice are included in the vocabulary.

We also add special tokens:

| Token | Meaning         |
|-------|------------------|
| `<pad>` | Padding token     |
| `<unk>` | Unknown token     |
| `<bos>` | Beginning of sequence |
| `<eos>` | End of sequence     |


In [3]:
from collections import Counter

counter = Counter()

# Count word frequency
for sent in eng_sentences:
    counter.update(sent.split())

# English to index mapping
eng_to_ind = {'<pad>': 0, '<unk>':1}
ind = 2

# Include words that appear 3+ times.
for word in counter:
    if counter[word] > 2:
        eng_to_ind[word] = ind
        ind += 1
        
print("English Vocabulary Size:", len(eng_to_ind))

jcounter = Counter()

# Count character frequency across Japanese sentences
for sent in jpn_sentences:
    jcounter.update(list(sent))

# Japanese to index and index to Japanese mapping
jpn_to_ind = {'<pad>': 0, '<unk>': 1, '<bos>': 2, '<eos>': 3}
ind_to_jpn = {0: '<pad>', 1: '<unk>', 2: '<bos>', 3: '<eos>'}
ind = 4

# Include characters that appear 3+ times
for char, freq in jcounter.most_common():
    if freq > 2:
        jpn_to_ind[char] = ind
        ind_to_jpn[ind] = char
        ind += 1

print("Japanese Vocabulary Size:", len(jpn_to_ind))

English Vocabulary Size: 971
Japanese Vocabulary Size: 847


## Encoding Sentences

Now convert each sentence into a sequence of integer indices.

- English Encoding: 
  - Tokenized at the word level.
  - Each sentence is padded to 3 tokens using the `<pad>` (index 0).
  - Unknown words are mapped to `<unk>` (index 1).

- Japanese Encoding:
  - Tokenized at the character level.
  - Each sentence starts with `<bos>` (index 2) and ends with `<eos>` (index 3).
  - Limit to 12 tokens total to include both `<bos>` and `<eos>`.
  - Padding (`<pad>`) is added at the end if needed.

In [4]:
eng_encoded = []

# Encode each English sentence at word level
# Maximum of 3 words per sentence
for sentence in eng_sentences:
    s = []
    for word in sentence.split():
        if word in eng_to_ind:
            s.append(eng_to_ind[word])
        else:
            s.append(1)
    while(len(s) < 3):
        s.append(0) # Padding
    eng_encoded.append(s)

jpn_encoded = []

# Encode each Japanese sentence character level
# Maximum of twelve characters per sentence including <bos> and <eos>
for sentence in jpn_sentences:
    s = [2] # 2 is <bos>
    for ch in sentence:
        if len(s) > 10:
            break # We need room to fit <eos> token.
        if ch in jpn_to_ind:
            s.append(jpn_to_ind[ch])
        else:
            s.append(1)   # 1 is <unk>
    s.append(3)           # 3 is <eos>
    while(len(s) < 12):
        s.append(0)       # 0 is <pad>
    jpn_encoded.append(s)

print(f"English Sentences Encoded: {eng_encoded[0:10]}")
print(f"English Sentences Encoded Length: {len(eng_encoded)}")
print(f"Japanese Sentences Encoded: {jpn_encoded[0:5]}")
print(f"Japanese Sentences Encoded Length: {len(jpn_encoded)}")

English Sentences Encoded: [[2, 0, 0], [3, 0, 0], [4, 0, 0], [5, 0, 0], [6, 0, 0], [1, 0, 0], [7, 0, 0], [8, 0, 0], [9, 0, 0], [10, 0, 0]]
English Sentences Encoded Length: 6627
Japanese Sentences Encoded: [[2, 71, 47, 4, 3, 0, 0, 0, 0, 0, 0, 0], [2, 29, 23, 27, 42, 6, 4, 3, 0, 0, 0, 0], [2, 213, 28, 4, 3, 0, 0, 0, 0, 0, 0, 0], [2, 65, 26, 3, 0, 0, 0, 0, 0, 0, 0, 0], [2, 12, 79, 5, 53, 3, 0, 0, 0, 0, 0, 0]]
Japanese Sentences Encoded Length: 6627


## Creating Embedding Matrices

- Vocabulary Size: Approximately 900 for each language.
- Embedding Dimension: I estimated using int(min(600, 1.6 * vocab_size ** 0.56)) = 72  
  I wanted to use a power of 2, so I rounded down to 64.

Each word or character will be embedded into a 64 dimensional dense vector.


In [5]:
embedding_size = 64
V_eng = len(eng_to_ind)  # English vocab size
V_jpn = len(jpn_to_ind)  # Japanese vocab size

eng_embedding = np.random.randn(V_eng, embedding_size) * 0.1
jpn_embedding = np.random.randn(V_jpn, embedding_size) * 0.1

print(f"English embedding shape: {eng_embedding.shape}")
print(f"Japanese embedding shape: {jpn_embedding.shape}")

# Example
eng_sentence = [5, 23, 8]
embedded_sentence = eng_embedding[eng_sentence]  # Shape (3, 64)
print(f"English input shape: {embedded_sentence.shape}")  # (seq_len, embedding_dim)

English embedding shape: (971, 64)
Japanese embedding shape: (847, 64)
English input shape: (3, 64)


## Activation and Loss Functions

- Sigmoid: Used in GRU gates.
- Tanh: Used for the candidate hidden state.
- Softmax: Converts decoder logits into probabilities.
- Cross Entropy Loss: Measures prediction error between predicted softmax outputs and true labels.

Documentation generated by ChatGPT

In [7]:
def sigmoid(x):
    """
    Element-wise Sigmoid activation function.
    Used in GRU for update/reset gates.
    
    Clips input to avoid overflow in exp.
    """
    x = np.clip(x, -50, 50)  # Prevent overflow
    return 1 / (1 + np.exp(-x))

def sigmoid_deriv(x):
    """
    Derivative of the Sigmoid function.
    Used during GRU backpropagation.
    """
    s = sigmoid(x)
    return s * (1 - s)

def tanh(x):
    """
    Element-wise hyperbolic tangent activation.
    Used in GRU for candidate hidden state.
    """
    return np.tanh(x)

def tanh_deriv(x):
    """
    Derivative of tanh function.
    Used in GRU backpropagation.
    """
    return 1.0 - np.tanh(x) ** 2

def softmax(Z: np.ndarray) -> np.ndarray:
    """
    Apply softmax over last dimension (class logits).
    
    Parameters:
    - Z: np.ndarray of shape (batch_size, num_classes)
      Raw logits for classification.
    
    Returns:
    - np.ndarray of same shape: softmax probabilities.
    """
    Z = Z - np.max(Z, axis=1, keepdims=True)
    exp_Z = np.exp(Z)
    return exp_Z / np.sum(exp_Z, axis=1, keepdims=True)

def CrossEntropy(yhat: np.ndarray, y: np.ndarray, eps: float = 1e-15) -> float:
    """
    Compute mean cross-entropy loss.

    Parameters:
    - yhat: np.ndarray of shape (batch_size, num_classes)
        Predicted probabilities (after softmax)
    - y: np.ndarray of shape (batch_size,)
        True class indices
    - eps: float
        Small constant to avoid log(0)

    Returns:
    - float: mean loss across the batch
    """
    yhat = np.clip(yhat, eps, 1 - eps)  # Prevent log(0)
    correct_probs = yhat[np.arange(len(y)), y]
    return -np.mean(np.log(correct_probs))


## GRU Architecture and Implementation

The Gated Recurrent Unit (GRU) is an RNN cell designed to assist with the vanishing/exploding gradient problem and capture long range dependencies more efficiently than a vanilla RNN. Below is a overview of the implementation.

- Reset Gate (R): Determines how much of the past hidden state to forget.
- Update Gate (Z): Determines how much of the new candidate state to use versus retaining the past hidden state.

## GRU Architecture and Implementation

The Gated Recurrent Unit (GRU) is an RNN cell designed to assist with the vanishing/exploding gradient problem and capture long range dependencies more efficiently than a vanilla RNN. Below is a overview of the implementation.

- Reset Gate (R): Determines how much of the past hidden state to forget.
- Update Gate (Z): Determines how much of the new candidate state to use versus retaining the past hidden state.
Rₜ = σ(xₜ @ W_r.T + hₜ₋₁ @ U_r + b_r)
### Forward Pass (per time step \(t\))
1. Compute gates
    - Rₜ = σ(xₜ @ W_r.T + hₜ₋₁ @ U_r + b_r)
    - Zₜ = σ(xₜ @ W_z.T + hₜ₋₁ @ U_z + b_z)
2. Candidate state 
    - Cₜ = tanh(xₜ @ W_c.T + (Rₜ ⊙ hₜ₋₁) @ U_c + b_c)
3. Hidden state update
    - xₜ = zₜ ⊙ hₜ₋₁ + (1 - Zₜ) ⊙ Cₜ

Store (xₜ, hₜ₋₁, Rₜ, Zₜ, Cₜ, hₜ) at each step to use during backpropagation.

![GRU Architecture](figures/GRU_arch.png)

### Gradient Calculations

Calculating the gradients is a tedious process, below are my calculations to get this model to work. There were a lot of problems.

![Backpropagation](figures/Backprop.png)

### Integration into Encoder–Decoder
- Encoder GRU: Processes the input sequence and returns the final hidden state \(h_{enc}\).  
- Decoder GRU: Initialized with \(h_{enc}\), generates the target sequence by feeding its own previous output back as input.

In [8]:
class GRU:
    def __init__(self, input_size, hidden_size):
        self.input_size = input_size
        self.hidden_size = hidden_size

        # Xavier Normal std
        std = np.sqrt(2 / (input_size + hidden_size))

        # Reset gate (R)
        self.r_w = np.random.normal(0, std, size=(hidden_size, input_size))
        self.r_u = np.random.normal(0, std, size=(hidden_size, hidden_size))
        self.r_b = np.zeros(hidden_size)

        # Update gate (Z)
        self.z_w = np.random.normal(0, std, size=(hidden_size, input_size))
        self.z_u = np.random.normal(0, std, size=(hidden_size, hidden_size))
        self.z_b = np.zeros(hidden_size)

        # Candidate hidden state (C)
        self.c_w = np.random.normal(0, std, size=(hidden_size, input_size))
        self.c_u = np.random.normal(0, std, size=(hidden_size, hidden_size))
        self.c_b = np.zeros(hidden_size)
    
    def forward(self, X, H=None):
        batch_size, seq_len, embed_dim = X.shape # (N, T, D)
        if H is None:
            H = np.zeros(shape=(batch_size, self.hidden_size)) # (N, H)

        hidden_states = []
        store = [] # For backpropagation

        for i in range(seq_len):
            x_t = X[:, i, :]  # (N, D)
            R = sigmoid(x_t @ self.r_w.T + H @ self.r_u + self.r_b)  # (N, H)
            Z = sigmoid(x_t @ self.z_w.T + H @ self.z_u + self.z_b)  # (N, H)
            C = tanh(x_t @ self.c_w.T + (R * H) @ self.c_u + self.c_b)  # (N, H)

            H_new = Z * H + (1 - Z) * C  # (N, H)

            store.append((x_t, H, R, Z, C, H_new))
            H = H_new
            hidden_states.append(H[:, np.newaxis, :])
            

        return H, np.concatenate(hidden_states, axis=1), store

    def backward(self, dH_all, store, lr):

        batch, seq_len, _ = dH_all.shape

        # Initialize accumulators for parameter gradients to zeros
        dr_w = np.zeros_like(self.r_w)  # ∂L/∂R_w, shape (H, D)
        dr_u = np.zeros_like(self.r_u)  # ∂L/∂R_u, shape (H, H)
        dr_b = np.zeros_like(self.r_b)  # ∂L/∂R_b,  shape (H,)

        dz_w = np.zeros_like(self.z_w)  # ∂L/∂Z_w
        dz_u = np.zeros_like(self.z_u)  # ∂L/∂Z_u
        dz_b = np.zeros_like(self.z_b)  # ∂L/∂Z_b

        dc_w = np.zeros_like(self.c_w)  # ∂L/∂C_w
        dc_u = np.zeros_like(self.c_u)  # ∂L/∂C_u
        dc_b = np.zeros_like(self.c_b)  # ∂L/∂C_b

        # dh accumulates the gradient flowing from future time steps
        dH = np.zeros((batch, self.hidden_size)) # (N, H)
        
        # Will hold gradient w.r.t. the input embeddings at each time step
        dx = np.zeros((batch, seq_len, self.input_size))  # (N, T, D)

        # Loop backwards through time: t = seq_len-1 ... 0
        for t in reversed(range(seq_len)):
            # Unpack stored values from forward pass
            # x_t:   (N, D) input at time t
            # h_prev:(N, H) previous hidden state
            # r_t:   (N, H) reset gate
            # z_t:   (N, H) update gate
            # c_t:   (N, H) candidate hidden state
            # h_t:   (N, H) final hidden state at time t
            x_t, h_prev, r_t, z_t, c_t, h_t = store[t]

            # dh: (N, H) total gradient wrt current hidden state h_t
            # Includes gradient from output loss and from next time step dH
            dh = dH_all[:, t, :] + dH
            
            # Gradients through update gate z_t and candidate c_t
            # ∂L/∂z_t = dh ⊙ (h_prev - c_t)
            dz = dh * (h_prev - c_t) # (N, H)
            # ∂L/∂c_t = dh ⊙ (1 - z_t)
            dc = dh * (1 - z_t)      # (N, H)
            # ∂L/∂h_prev = dh ⊙ z_t, one part of ∂L/∂h_{t-1}
            dh_prev = dh * z_t       # (N, H)

            # Backprop through tanh: ∂L/∂c_t_preact = dc ⊙ (1 - c_t²)
            dc_preact = dc * (1 - c_t**2)   # (N, H)
            
            # Accumulate parameter gradients for candidate weights
            # ∂L/∂C_w = dc_preact.T @ x_t
            dc_w += dc_preact.T @ x_t              # (H, D)
            # ∂L/∂C_u = (r_t ⊙ h_prev).T @ dc_preact
            dc_u += (r_t * h_prev).T @ dc_preact   # (H, H)
            # ∂L/∂C_b = sum over batch of dc_preact
            dc_b += np.sum(dc_preact, axis=0)      # (H,)

            # Reset gate gradient from C preact's hidden side
            # ∂L/∂r_t = ∂L/(∂C preact) ⊙ ∂(C preact)/∂r_t
            # ∂L/∂r_t = (dc_preact @ C_u.T) ⊙ h_prev
            dr = (dc_preact @ self.c_u.T) * h_prev   # (N, H)
            # Backprop through sigmoid: ∂L/∂r_preact = dr ⊙ r_t ⊙ (1 - r_t)
            dr_preact = dr * (r_t * (1 - r_t))       # (N, H)
            # Accumulate Reset gate gradients
            dr_w += dr_preact.T @ x_t                # (H, D)
            dr_u += h_prev.T @ dr_preact             # (H, H)
            dr_b += np.sum(dr_preact, axis=0)        # (H,)

            # Backprop through update gate sigmoid
            dz_preact = dz * z_t * (1 - z_t)         # (N, H)

            # Accumulate update gate gradients
            dz_w += dz_preact.T @ x_t                # (H, D)
            dz_u += h_prev.T @ dz_preact             # (H, H)
            dz_b += np.sum(dz_preact, axis=0)        # (H,)

            # Add to total gradient wrt h_prev
            # dh_prev = sum of all partials wrt h_prev from C, R, Z
            dh_prev += (dc_preact @ self.c_u.T) * r_t        # (N, H)
            dh_prev += dr_preact @ self.r_u.T                # (N, H)
            dh_prev += dz_preact @ self.z_u.T                # (N, H)
            

            # Set dH for next iteration (t-1)
            dH = dh_prev  # (N, H)

            # Gradient wrt. input x_t combines contributions from each gate
            # Paths: dr_preact, dz_preact, dc_preact
            dx[:, t, :] = (
                dr_preact @ self.r_w +  # via reset gate input weights
                dz_preact @ self.z_w +  # via update gate input weights
                dc_preact @ self.c_w    # via candidate input weights
            )  # shape (N, D)


        grads = [dc_w, dc_u, dc_b, dr_w, dr_u, dr_b, dz_w, dz_u, dz_b]

        for i in range(len(grads)):
            np.clip(grads[i], -1.0, 1.0, out=grads[i])  # Gradient Clipping

        self.c_w -= lr * dc_w
        self.c_u -= lr * dc_u
        self.c_b -= lr * dc_b

        self.r_w -= lr * dr_w
        self.r_u -= lr * dr_u
        self.r_b -= lr * dr_b

        self.z_w -= lr * dz_w
        self.z_u -= lr * dz_u
        self.z_b -= lr * dz_b

        return dH, dx

Saving parameters function

In [7]:
def save_parameters(path, Encoder_GRU, Decoder_GRU, eng_embedding, jpn_embedding, o_w, o_b):

    np.savez(
        path,
        eng_embedding=eng_embedding,
        jpn_embedding=jpn_embedding,
        o_w=o_w, o_b=o_b,
        enc_r_w=Encoder_GRU.r_w, enc_r_u=Encoder_GRU.r_u, enc_r_b=Encoder_GRU.r_b,
        enc_z_w=Encoder_GRU.z_w, enc_z_u=Encoder_GRU.z_u, enc_z_b=Encoder_GRU.z_b,
        enc_c_w=Encoder_GRU.c_w, enc_c_u=Encoder_GRU.c_u, enc_c_b=Encoder_GRU.c_b,
        dec_r_w=Decoder_GRU.r_w, dec_r_u=Decoder_GRU.r_u, dec_r_b=Decoder_GRU.r_b,
        dec_z_w=Decoder_GRU.z_w, dec_z_u=Decoder_GRU.z_u, dec_z_b=Decoder_GRU.z_b,
        dec_c_w=Decoder_GRU.c_w, dec_c_u=Decoder_GRU.c_u, dec_c_b=Decoder_GRU.c_b
    )
    print("Saved Parameters")

In [8]:
batch_size = 8
epochs = 3000
lr = 0.008
encoder_len = 3        # English length
target_len = 12        # Japanese length
hidden_size = 128

bos_id, pad_id, eos_id = 2, 0, 3

Encoder_GRU = GRU(64, hidden_size)
Decoder_GRU = GRU(64, hidden_size)

X = np.array(eng_encoded, dtype=int)
y = np.array(jpn_encoded, dtype=int)

V_jpn = jpn_embedding.shape[0]
o_w = np.random.randn(V_jpn, hidden_size) * 0.01
o_b = np.zeros(V_jpn)

past_loss = []

for epoch in range(epochs):
    total_loss = 0

    perm = np.random.permutation(X.shape[0])
    X_train = X[perm]
    y_train = y[perm]

    for batch in range(0, X.shape[0], batch_size):
        end = min(batch + batch_size, X_train.shape[0])
        X_ids = X_train[batch:end]
        y_train_batch = y_train[batch:end]
        batch_len = X_ids.shape[0]
        
        # Encoder forward
        X_emb = eng_embedding[X_ids]
        h_enc, enc_h_all, enc_store = Encoder_GRU.forward(X_emb)

        decoder_input = np.concatenate(
            [np.full((batch_len, 1), bos_id, dtype=int), y_train_batch[:, :-1]],
            axis=1
        )
        decoder_target = y_train_batch

        dec_in_emb = jpn_embedding[decoder_input]        # (N, T, D)
        _, h_dec_all, dec_store = Decoder_GRU.forward(dec_in_emb, h_enc)

        logits_flat = (h_dec_all @ o_w.T + o_b).reshape(-1, V_jpn) # (bs*seq_len, V_jpn)
        targets_flat = decoder_target.reshape(-1)

        mask = (targets_flat != pad_id)
        probs = softmax(logits_flat)
        probs = np.clip(probs, 1e-12, 1 - 1e-12)
        
        loss = -np.sum(np.log(probs[np.where(mask)[0], targets_flat[mask]])) / mask.sum()
        total_loss += loss * batch_len

        # Backprop through softmax + cross entropy
        dlogits = probs
        dlogits[np.where(mask)[0], targets_flat[mask]] -= 1
        dlogits /= mask.sum()  # average over non pad tokens

        # Final linear layer grad
        h_dec_all_flat = h_dec_all.reshape(-1, hidden_size)
        do_w = dlogits.T @ h_dec_all_flat    # (V_jpn, H)
        do_b = np.sum(dlogits, axis=0)       # (V_jpn,)
        dh_dec = dlogits @ o_w
        dh_dec = dh_dec.reshape(batch_len, target_len, hidden_size)

        # Decoder GRU backward
        dH_enc_from_dec, dx_dec = Decoder_GRU.backward(dh_dec, dec_store, lr)

        # Update decoder embeddings and clip gradients
        for b in range(batch_len):
            for t in range(target_len):
                idx_tok = decoder_input[b, t]
                grad = np.clip(dx_dec[b, t], -1.0, 1.0)
                jpn_embedding[idx_tok] -= lr * grad

        # Encoder GRU backward
        # Build gradient tensor with only last time-step having gradient
        dH_enc_all = np.zeros_like(enc_h_all)           # (B, encoder_len, H)
        dH_enc_all[:, -1, :] = dH_enc_from_dec          # gradient only on final encoder state
        _, dx_enc = Encoder_GRU.backward(dH_enc_all, enc_store, lr)

        # Update encoder embeddings and clip gradients
        for b in range(batch_len):
            for t in range(encoder_len):
                tok = X_ids[b, t]
                grad = np.clip(dx_enc[b, t], -1.0, 1.0)
                eng_embedding[tok] -= lr * grad

        # Clip output gradients
        np.clip(do_w, -1.0, 1.0, out=do_w)
        np.clip(do_b, -1.0, 1.0, out=do_b)

        # Update output layer
        o_w -= lr * do_w
        o_b -= lr * do_b

    tl = total_loss / X_train.shape[0]
    print(f"Epoch {epoch+1} | Loss: {tl:.4f}")

    past_loss.append(tl)

    if len(past_loss) > 2:
        if past_loss[-1] > past_loss[-2] and past_loss[-2] > past_loss[-3] and past_loss[-3] > past_loss[-4] and past_loss[-4] > past_loss[-5]:
            break

save_parameters("model_params3.npz", Encoder_GRU, Decoder_GRU, eng_embedding, jpn_embedding, o_w, o_b)

Epoch 1 | Loss: 6.6236
Epoch 2 | Loss: 6.0039
Epoch 3 | Loss: 4.7982
Epoch 4 | Loss: 4.3895
Epoch 5 | Loss: 4.1994
Epoch 6 | Loss: 4.0809
Epoch 7 | Loss: 3.9969
Epoch 8 | Loss: 3.9323
Epoch 9 | Loss: 3.8843
Epoch 10 | Loss: 3.8459
Epoch 11 | Loss: 3.8140
Epoch 12 | Loss: 3.7884
Epoch 13 | Loss: 3.7655
Epoch 14 | Loss: 3.7436
Epoch 15 | Loss: 3.7236
Epoch 16 | Loss: 3.7042
Epoch 17 | Loss: 3.6830
Epoch 18 | Loss: 3.6615
Epoch 19 | Loss: 3.6393
Epoch 20 | Loss: 3.6176
Epoch 21 | Loss: 3.5949
Epoch 22 | Loss: 3.5727
Epoch 23 | Loss: 3.5515
Epoch 24 | Loss: 3.5317
Epoch 25 | Loss: 3.5126
Epoch 26 | Loss: 3.4945
Epoch 27 | Loss: 3.4772
Epoch 28 | Loss: 3.4611
Epoch 29 | Loss: 3.4461
Epoch 30 | Loss: 3.4297
Epoch 31 | Loss: 3.4143
Epoch 32 | Loss: 3.3996
Epoch 33 | Loss: 3.3868
Epoch 34 | Loss: 3.3737
Epoch 35 | Loss: 3.3615
Epoch 36 | Loss: 3.3493
Epoch 37 | Loss: 3.3386
Epoch 38 | Loss: 3.3274
Epoch 39 | Loss: 3.3168
Epoch 40 | Loss: 3.3071
Epoch 41 | Loss: 3.2976
Epoch 42 | Loss: 3.2887
E

In [9]:
def predict(Encoder_GRU, Decoder_GRU, input_seq,
            eng_to_ind, eng_embedding,
            jpn_embedding, o_w, o_b,
            pad_id=0, unk_id=1, bos_id=2, eos_id=3,
            max_len=20, encoder_len=3):
    """
    Greedy decode.

    Returns list of predicted token ids excluding <eos>.
    """
    # Preprocess encoder input
    import re, numpy as np
    seq = re.sub(r"[^a-z0-9\s]", "", input_seq.strip().lower())
    src_tokens = [eng_to_ind.get(w, unk_id) for w in seq.split()]
    if not src_tokens:
        src_tokens = [pad_id]
    while len(src_tokens) < encoder_len:
        src_tokens.append(pad_id)
    src_tokens = src_tokens[:encoder_len]

    enc_embed = eng_embedding[src_tokens][None, ...]  # (1,T,D)
    h_enc, _, _ = Encoder_GRU.forward(enc_embed)      # (1,H)

    h = h_enc
    prev_token = bos_id
    preds = []

    for step in range(max_len):
        # Embed previous token
        y_embed = jpn_embedding[prev_token][None, :]   # (1,D)

        r = sigmoid(y_embed @ Decoder_GRU.r_w.T + h @ Decoder_GRU.r_u + Decoder_GRU.r_b)
        z = sigmoid(y_embed @ Decoder_GRU.z_w.T + h @ Decoder_GRU.z_u + Decoder_GRU.z_b)
        c = tanh(y_embed @ Decoder_GRU.c_w.T + (r * h) @ Decoder_GRU.c_u + Decoder_GRU.c_b)
        h = z * h + (1 - z) * c # (1,H)

        logits = h @ o_w.T + o_b # (1,V_jpn)

        # Prevent <pad> and <bos> from being chosen
        logits[0, pad_id] = -1e9
        logits[0, bos_id] = -1e9

        next_token = int(np.argmax(logits, axis=1)[0])

        if next_token == eos_id:
            break

        preds.append(next_token)
        prev_token = next_token

    return preds


In [10]:
def load_model(path, embedding_dim=64, hidden_size=128):
    params = np.load(path)

    # Initialize GRUs
    Encoder_GRU = GRU(embedding_dim, hidden_size)
    Decoder_GRU = GRU(embedding_dim, hidden_size)

    # Load Encoder weights
    Encoder_GRU.r_w = params["enc_r_w"]
    Encoder_GRU.r_u = params["enc_r_u"]
    Encoder_GRU.r_b = params["enc_r_b"]
    Encoder_GRU.z_w = params["enc_z_w"]
    Encoder_GRU.z_u = params["enc_z_u"]
    Encoder_GRU.z_b = params["enc_z_b"]
    Encoder_GRU.c_w = params["enc_c_w"]
    Encoder_GRU.c_u = params["enc_c_u"]
    Encoder_GRU.c_b = params["enc_c_b"]

    # Load Decoder weights
    Decoder_GRU.r_w = params["dec_r_w"]
    Decoder_GRU.r_u = params["dec_r_u"]
    Decoder_GRU.r_b = params["dec_r_b"]
    Decoder_GRU.z_w = params["dec_z_w"]
    Decoder_GRU.z_u = params["dec_z_u"]
    Decoder_GRU.z_b = params["dec_z_b"]
    Decoder_GRU.c_w = params["dec_c_w"]
    Decoder_GRU.c_u = params["dec_c_u"]
    Decoder_GRU.c_b = params["dec_c_b"]

    # Load embeddings and output layer
    eng_embedding = params["eng_embedding"]
    jpn_embedding = params["jpn_embedding"]
    o_w = params["o_w"]
    o_b = params["o_b"]

    return Encoder_GRU, Decoder_GRU, eng_embedding, jpn_embedding, o_w, o_b

In [11]:
Encoder_GRU, Decoder_GRU, eng_embedding, jpn_embedding, o_w, o_b = load_model("model_params3.npz")

In [26]:
sentence = "Go on"
translation_ids = predict(Encoder_GRU, Decoder_GRU, sentence,
                          eng_to_ind, eng_embedding,
                          jpn_embedding, o_w, o_b)

translation = "".join([ind_to_jpn[i] for i in translation_ids])
print("Translation:", translation)

Translation: これは行し。
