**Analisis Implementasi Encoder-Decoder**

## 1. Aplikasi atau Kasus Penggunaan
- Menggunakan model Encoder-Decoder berbasis LSTM untuk menerjemahkan teks.
- Diterapkan untuk penerjemahan dari bahasa Toraja ke Bahasa Indonesia.


In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Dropout
from tensorflow.keras.models import Model
import numpy as np
import pandas as pd

# Dataset
dataset = [
    ("<start> Kombai <end>", "<start> Apa kabar? <end>"),
    ("<start> Manta'de <end>", "<start> Baik-baik saja <end>"),
    ("<start> Pa'tabe' <end>", "<start> Permisi <end>"),
    ("<start> Tammu' langi' <end>", "<start> Selamat pagi <end>"),
    ("<start> Kumande'ko? <end>", "<start> Apakah kamu sudah makan? <end>"),
    ("<start> Umbain? <end>", "<start> Apa itu? <end>"),
    ("<start> Sule? <end>", "<start> Di mana? <end>"),
    ("<start> Narombo? <end>", "<start> Kenapa? <end>"),
    ("<start> Indan muane? <end>", "<start> Siapa laki-laki itu? <end>"),
    ("<start> Indan baine? <end>", "<start> Siapa perempuan itu? <end>"),
    ("<start> Umba' ko? <end>", "<start> Kamu dari mana? <end>"),
    ("<start> Tammu' ko? <end>", "<start> Apakah kamu baik-baik saja? <end>"),
    ("<start> Sikamali' <end>", "<start> Terima kasih <end>"),
    ("<start> Mangrara banua <end>", "<start> Membangun rumah <end>"),
    ("<start> Pumate' <end>", "<start> Meninggal <end>"),
    ("<start> Malleke' <end>", "<start> Pergi <end>"),
    ("<start> Manapa'ko? <end>", "<start> Apa yang kamu lakukan? <end>"),
    ("<start> Bassi'ka <end>", "<start> Hujan <end>"),
    ("<start> Tallangko' <end>", "<start> Jatuh <end>"),
    ("<start> Allo tau? <end>", "<start> Hari ini apa? <end>"),
    ("<start> Kema'na ko? <end>", "<start> Kamu mau ke mana? <end>"),
    ("<start> Tabe' <end>", "<start> Maaf <end>"),
    ("<start> Tammu' to dolo' <end>", "<start> Orang dulu bilang <end>"),
    ("<start> Kema'na inai? <end>", "<start> Ibunya ke mana? <end>"),
    ("<start> Nasang tani' <end>", "<start> Tidak tahu <end>"),
    ("<start> Umbai'na mako? <end>", "<start> Apa yang dia katakan? <end>"),
    ("<start> Na'ala tu? <end>", "<start> Sudah diambil? <end>"),
    ("<start> Tammu' rara <end>", "<start> Selamat sore <end>"),
    ("<start> Nai' to dolota' <end>", "<start> Ini cerita orang tua <end>"),
    ("<start> Kema'na sangngambu? <end>", "<start> Ke mana anak kecil itu? <end>"),
    ("<start> Pa'rapo <end>", "<start> Menunggu <end>"),
    ("<start> Na'bangka <end>", "<start> Meninggalkan sesuatu <end>"),
    ("<start> Umbai' sangpuru' <end>", "<start> Di mana tempat sembahyang? <end>"),
    ("<start> Tamali'ko? <end>", "<start> Bagaimana kabarmu? <end>"),
    ("<start> Umbai' untu' mu? <end>", "<start> Di mana gigimu? <end>"),
    ("<start> Unni' tau untu' <end>", "<start> Anak yang kehilangan gigi <end>"),
    ("<start> Bangko' duka'na? <end>", "<start> Kenapa kamu sedih? <end>"),
    ("<start> Ledo'? <end>", "<start> Lapar? <end>"),
    ("<start> Uru'na sala <end>", "<start> Kesalahan pertama <end>"),
    ("<start> Nasang tau <end>", "<start> Tidak ada orang <end>"),
    ("<start> Tallangko' sangka' <end>", "<start> Jatuh ke bawah <end>"),
    ("<start> Masambo <end>", "<start> Beristirahat <end>"),
    ("<start> Ira' lisu'? <end>", "<start> Kapan kamu pulang? <end>"),
    ("<start> Pa'pada <end>", "<start> Sama-sama <end>"),
    ("<start> Makka' biang <end>", "<start> Menyapu halaman <end>"),
    ("<start> Tamali' allo <end>", "<start> Hari ini cerah <end>"),
    ("<start> Na'mu kalua? <end>", "<start> Sudah selesai? <end>"),
    ("<start> Tallang sang rampa' <end>", "<start> Jatuh di depan <end>"),
    ("<start> Bangko' lako' <end>", "<start> Mau pergi ke mana? <end>"),
    ("<start> Tammu' tallu <end>", "<start> Tiga orang berkumpul <end>")
]

## 2. Arsitektur Encoder-Decoder yang Diimplementasikan

- Model Terdiri dari **Encoder dan Decoder berbasis LSTM**
- Encoder digunakan untuk mengubah input menjadi vektor konteks
- Decoder digunakan untuk menerjemahkan vektor konteks menjadi output target.
- Menggunakan **word Embeddings** untuk representasi kata.

In [None]:
# Preprocessing Data
input_texts, target_texts = zip(*dataset)

input_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
target_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')

input_tokenizer.fit_on_texts(input_texts)
target_tokenizer.fit_on_texts(target_texts)

input_sequences = input_tokenizer.texts_to_sequences(input_texts)
target_sequences = target_tokenizer.texts_to_sequences(target_texts)

max_encoder_seq_length = max(len(seq) for seq in input_sequences)
max_decoder_seq_length = max(len(seq) for seq in target_sequences)

encoder_input_data = tf.keras.preprocessing.sequence.pad_sequences(input_sequences, maxlen=max_encoder_seq_length, padding='post')
decoder_input_data = tf.keras.preprocessing.sequence.pad_sequences(target_sequences, maxlen=max_decoder_seq_length, padding='post')
decoder_target_data = np.zeros_like(decoder_input_data)
decoder_target_data[:, :-1] = decoder_input_data[:, 1:]

decoder_target_data = np.expand_dims(decoder_target_data, -1)

In [None]:
# Model Encoder-Decoder
latent_dim = 256

encoder_inputs = Input(shape=(None,))
enc_emb = Embedding(len(input_tokenizer.word_index) + 1, latent_dim)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True, dropout=0.2, recurrent_dropout=0.2)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
encoder_states = [state_h, state_c]

decoder_inputs = Input(shape=(None,))
dec_emb = Embedding(len(target_tokenizer.word_index) + 1, latent_dim)(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True, dropout=0.2, recurrent_dropout=0.2)
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_states)
decoder_dense = Dense(len(target_tokenizer.word_index) + 1, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

In [None]:
# Train Model
history = model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=32, epochs=300, validation_split=0.2)

Epoch 1/300
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 1s/step - accuracy: 0.0667 - loss: 4.3948 - val_accuracy: 0.5000 - val_loss: 4.2814
Epoch 2/300
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 136ms/step - accuracy: 0.5024 - loss: 4.2599 - val_accuracy: 0.5000 - val_loss: 4.0779
Epoch 3/300
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 159ms/step - accuracy: 0.5083 - loss: 4.0253 - val_accuracy: 0.5000 - val_loss: 3.6325
Epoch 4/300
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 137ms/step - accuracy: 0.5009 - loss: 3.5057 - val_accuracy: 0.5000 - val_loss: 2.7257
Epoch 5/300
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 135ms/step - accuracy: 0.4994 - loss: 2.5857 - val_accuracy: 0.5000 - val_loss: 2.1917
Epoch 6/300
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 137ms/step - accuracy: 0.4979 - loss: 2.2482 - val_accuracy: 0.5000 - val_loss: 2.3658
Epoch 7/300
[1m2/2[0m [32m━━━━━━━

In [None]:
# Inference Model
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
dec_emb2 = Embedding(len(target_tokenizer.word_index) + 1, latent_dim)(decoder_inputs)
decoder_outputs, state_h, state_c = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

In [None]:
encoder_model.summary()

In [None]:
decoder_model.summary()

## 3. Proses Training dan Dataset yang Digunakan
- Dataset terdiri dari pasangan teks "<start> Input <end>" dan "<start> Output <end>"
- Data diproses dengan tokenisasi, padding, dan embedding.
- Padding digunakan untuk menyamakan panjang sekuens.
- Model dilatih dengan **categorical cross-entropy loss** dan **Adam optimizer**

In [None]:
# Beam Search Decoder
def beam_search_decoder(predictions, beam_width=3):
    top_k_indices = np.argsort(predictions[0, -1, :])[-beam_width:]
    sampled_token_index = np.random.choice(top_k_indices)
    return sampled_token_index

In [None]:
# Function for Translation
def translate(input_text):
    input_seq = tf.keras.preprocessing.sequence.pad_sequences(
        input_tokenizer.texts_to_sequences([input_text]), maxlen=max_encoder_seq_length, padding='post')
    states_value = encoder_model.predict(input_seq)
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = target_tokenizer.word_index['<start>']
    decoded_sentence = []
    stop_condition = False
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        sampled_token_index = beam_search_decoder(output_tokens, beam_width=3)
        sampled_word = target_tokenizer.index_word.get(sampled_token_index, '?')
        if sampled_word == '<end>' or len(decoded_sentence) > max_decoder_seq_length:
            stop_condition = True
        else:
            decoded_sentence.append(sampled_word)
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index
        states_value = [h, c]
    return ' '.join(decoded_sentence)

## 4. Hasil dan Evaluasi Model
- **Akurasi Training**: **1.0000**
- **Akurasi Validasi**: **0.6857**
- **Contoh Hasil Prediksi Model**:
  - **Input**: "<start> Kombai <end>" → **Output**: "pergi kabar? ?"
  - **Input**: "<start> Manta'de <end>" → **Output**: "baik-baik"
  - **Input**: "<start> Pa'tabe' <end>" → **Output**: "permisi permisi ?"
  - **Input**: "<start> Tammu' langi' <end>" → **Output**: "pagi pagi pagi"
  - **Input**: "<start> Kumande'ko? <end>" → **Output**: "bagaimana kamu sudah ?"
- **Analisis Hasil**:
  - Model menunjukkan **beberapa kesalahan pengulangan kata** dalam hasil terjemahan.
  - Beberapa output tidak sempurna, tetapi masih memiliki makna yang mendekati target.
  - Model memiliki **akurasi validasi 68.57%**, menunjukkan ada ruang untuk perbaikan dengan lebih banyak data atau teknik seperti attention mechanism.


In [None]:
# Test Translation
sample_sentences = [
    "<start> Kombai <end>",
    "<start> Manta'de <end>",
    "<start> Pa'tabe' <end>",
    "<start> Tammu' langi' <end>",
    "<start> Kumande'ko? <end>"
]

for sentence in sample_sentences:
    print(f"Input: {sentence}")
    print(f"Output: {translate(sentence)}")
    print("-")

Input: <start> Kombai <end>
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 368ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 360ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
Output: pergi kabar? ?
-
Input: <start> Manta'de <end>
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
Output: baik-baik
-
Input: <start> Pa'tabe' <end>
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step


In [None]:
# Display Model Accuracy
train_acc = history.history['accuracy'][-1]
val_acc = history.history['val_accuracy'][-1]
print(f"Final Training Accuracy: {train_acc:.4f}")
print(f"Final Validation Accuracy: {val_acc:.4f}")

Final Training Accuracy: 1.0000
Final Validation Accuracy: 0.6857









# Kesimpulan:
# Implementasi encoder-decoder berbasis LSTM dalam proyek ini berhasil diterapkan untuk tugas penerjemahan bahasa daerah Toraja ke bahasa Indonesia. Model ini dapat dikembangkan lebih lanjut dengan dataset yang lebih besar dan penggunaan mekanisme **attention** untuk meningkatkan performa.

