# Martian language decipher ranking



Notebook ini bertujuan untuk memecahkan teka-teki "Bahasa Mars" yang diberikan dalam kompetisi. Analisis awal menunjukkan bahwa bahasa ini bukanlah bahasa asing, melainkan **Bahasa Indonesia yang kata-katanya diacak (anagram)** dengan pola enkripsi yang spesifik.

**Tujuan Utama:**
1.  **Mendekripsi Pola**: Mengidentifikasi dan mengimplementasikan logika untuk memecahkan anagram.
2.  **Mempersiapkan Data**: Mengubah file `queries.txt` dan `unk500.txt` dari format terenkripsi menjadi Bahasa Indonesia yang dapat dibaca.
3.  **Menyimpan Hasil**: Menyimpan teks yang telah didekripsi untuk digunakan pada tahap selanjutnya, yaitu melatih model ranking untuk mencocokkannya dengan `eng_collection.txt`.

---
## **1. Inisialisasi dan Pengunduhan Data**

Langkah pertama adalah mempersiapkan lingkungan kerja. Kita akan menginstal library `gdown` untuk mengunduh data dari Google Drive, lalu mengimpor semua library yang dibutuhkan. Setelah itu, kita akan mengunduh semua file yang relevan untuk kompetisi.

File yang diunduh:
* `queries.txt`: 50 dokumen "Bahasa Mars" yang perlu dicari translasinya.
* `eng_collection.txt`: 1459 dokumen Bahasa Inggris yang menjadi kandidat translasi.
* `eng500.txt` & `unk500.txt`: Korpus paralel untuk melatih model.
* `list_1.0.0.txt`: Sebuah daftar kata Bahasa Indonesia (KBBI) yang menjadi kunci untuk memecahkan anagram ini.

In [None]:
!pip install gdown
import gdown
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re



In [None]:
urls = {
    "queries.txt": "1csCrWhjczjvYVibFfV8tq2vlmXW4cCct",
    "eng500.txt": "1AZ913Q_YbM7K3azRzFYTtCIxBK6411Lc",
    "eng_collection.txt": "1nb9MooQQZvquRVp3azd3AfPRfItg_jyh",
    "unk500.txt": "1mTVAEgqhys17dD3LeaHVSecAzKHk88l2",
    "list_1.0.0.txt":"1F6wTRmKCrqvRSpGLpRPh8tDA7D6A9A4V" # Daftar kata KBBI
}

# Unduh setiap file jika belum ada
print("Memeriksa dan mengunduh file...")
for filename, file_id in urls.items():
    print(f"Mengunduh {filename}...")
    gdown.download(id=file_id, output=filename, quiet=False)

print("\nSemua file siap digunakan.")


Memeriksa dan mengunduh file...
Mengunduh queries.txt...


Downloading...
From: https://drive.google.com/uc?id=1csCrWhjczjvYVibFfV8tq2vlmXW4cCct
To: /kaggle/working/queries.txt
100%|██████████| 15.3k/15.3k [00:00<00:00, 17.1MB/s]


Mengunduh eng500.txt...


Downloading...
From: https://drive.google.com/uc?id=1AZ913Q_YbM7K3azRzFYTtCIxBK6411Lc
To: /kaggle/working/eng500.txt
100%|██████████| 65.8k/65.8k [00:00<00:00, 40.8MB/s]


Mengunduh eng_collection.txt...


Downloading...
From: https://drive.google.com/uc?id=1nb9MooQQZvquRVp3azd3AfPRfItg_jyh
To: /kaggle/working/eng_collection.txt
100%|██████████| 196k/196k [00:00<00:00, 84.2MB/s]


Mengunduh unk500.txt...


Downloading...
From: https://drive.google.com/uc?id=1mTVAEgqhys17dD3LeaHVSecAzKHk88l2
To: /kaggle/working/unk500.txt
100%|██████████| 151k/151k [00:00<00:00, 59.7MB/s]


Mengunduh list_1.0.0.txt...


Downloading...
From: https://drive.google.com/uc?id=1F6wTRmKCrqvRSpGLpRPh8tDA7D6A9A4V
To: /kaggle/working/list_1.0.0.txt
100%|██████████| 1.29M/1.29M [00:00<00:00, 110MB/s]


Semua file siap digunakan.





---
## **2. Pendekatan Awal: Fungsi `decode_anagram` Sederhana**

Fungsi di bawah ini adalah implementasi pertama dan paling dasar untuk memecahkan satu kata terenkripsi. Dari sini, kita berhasil mengidentifikasi pola enkripsi yang digunakan:

**Pola:** `{huruf_terakhir_asli}zk0{anagram_kata}xv{panjang_kata}{huruf_pertama_asli}`

**Cara Kerja:**
1.  **Parsing**: Membongkar teks terenkripsi berdasarkan *noise* `zk0` dan `xv` untuk mendapatkan petunjuk (huruf awal/akhir, panjang) dan bagian anagram.
2.  **Memuat KBBI**: Membaca semua kata dari file `list_1.0.0.txt`.
3.  **Iterasi & Pengecekan**: Melakukan iterasi pada setiap kata di KBBI dan memvalidasinya berdasarkan petunjuk panjang, huruf awal, dan huruf akhir. Terakhir, mengecek apakah kata tersebut merupakan anagram dari bagian yang diacak.

Metode ini berhasil membuktikan hipotesis kita, namun **sangat tidak efisien** karena harus memindai seluruh KBBI untuk setiap kata yang ingin dipecahkan.
an dokumen Inggris berdasarkan skor kemiripan untuk setiap query dan memformat hasilnya sesuai dengan aturan kompetisi.

In [None]:
def decode_anagram(encrypted_text, kbbi_file_path = 'list_1.0.0.txt'):
    
    def parse_encrypted_text(text):
        """Parse encrypted text berdasarkan pola yang diberikan"""
        # Cari posisi noise "zk0" dan "xv"
        noise1_pos = text.find("zk0")
        noise2_pos = text.find("xv")
        
        if noise1_pos == -1 or noise2_pos == -1:
            return None
            
        # Extract komponen-komponen
        hint_start = text[0]  # huruf terakhir kata asli
        anagram_part = text[noise1_pos + 3:noise2_pos]  # bagian anagram
        length_part = text[noise2_pos + 2:-1]  # panjang kata
        hint_end = text[-1]  # huruf pertama kata asli
        
        try:
            word_length = int(length_part)
        except ValueError:
            return None
            
        return {
            'hint_start': hint_start,
            'anagram': anagram_part,
            'length': word_length,
            'hint_end': hint_end
        }
    
    def is_anagram(word1, word2):
        """Check apakah dua kata adalah anagram"""
        return sorted(word1.lower()) == sorted(word2.lower())
    
    def clean_word(word):
        """Bersihkan kata dari tanda baca"""
        return word.rstrip('.,;!?')
    
    def load_kbbi_words(file_path):
        """Load kata-kata dari file KBBI"""
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                words = [line.strip() for line in file if line.strip()]
            return words
        except FileNotFoundError:
            print(f"File {file_path} tidak ditemukan!")
            return []
        except Exception as e:
            print(f"Error reading file: {e}")
            return []
    
    # Parse encrypted text
    parsed = parse_encrypted_text(encrypted_text)
    if not parsed:
        print("Format encrypted text tidak valid!")
        return []
    
    print(f"Parsing hasil:")
    print(f"- Hint start (huruf terakhir): {parsed['hint_start']}")
    print(f"- Anagram part: {parsed['anagram']}")
    print(f"- Length: {parsed['length']}")
    print(f"- Hint end (huruf pertama): {parsed['hint_end']}")
    print()
    
    # Load KBBI words
    kbbi_words = load_kbbi_words(kbbi_file_path)
    if not kbbi_words:
        return []
    
    print(f"Loaded {len(kbbi_words)} kata dari KBBI")
    print()
    
    # Cari kata yang cocok
    candidates = []
    
    for word in kbbi_words:
        clean = clean_word(word)
        
        # Check panjang kata
        if len(clean) != parsed['length']:
            continue
            
        # Check huruf pertama dan terakhir
        if len(clean) > 0:
            first_char = clean[0].lower()
            last_char = clean[-1].lower()
            
            if (first_char == parsed['hint_end'].lower() and 
                last_char == parsed['hint_start'].lower()):
                
                # Check apakah anagram
                if is_anagram(clean, parsed['anagram']):
                    candidates.append(word)
    
    return candidates


hasil = decode_anagram("tzk0ltpoixv5p")
print(hasil)

Parsing hasil:
- Hint start (huruf terakhir): t
- Anagram part: ltpoi
- Length: 5
- Hint end (huruf pertama): p

Loaded 112651 kata dari KBBI

['pilot']



---
## **3. Pendekatan Lanjutan: Dekoder yang Dioptimalkan**

Untuk mengatasi masalah performa, kita membangun serangkaian fungsi yang lebih canggih dan efisien. Pendekatan ini dirancang untuk memproses seluruh file dengan cepat.

Kunci utama dari optimasi ini adalah **pre-processing** atau persiapan data KBBI dengan membangun **Lookup Tables** (kamus pencarian cepat).

---
### **3.1. Fungsi-Fungsi Pembantu yang Dioptimalkan**

* **`_build_lookup_tables`**: Inilah jantung dari optimasi. Fungsi ini hanya berjalan sekali untuk memetakan seluruh kata KBBI ke dalam struktur data (dictionary) yang memungkinkan pencarian dalam waktu konstan (O(1)). Ini menghilangkan kebutuhan untuk iterasi berulang kali.
* **`_generate_variations`**: Bahasa Indonesia kaya akan imbuhan (`me-`, `di-`, `-kan`) dan reduplikasi (kata ulang). Fungsi ini secara proaktif menghasilkan variasi-variasi kata dari KBBI untuk meningkatkan kemungkinan menemukan kecocokan.
* **`_parse_encrypted_text`**: Versi parser yang lebih tangguh, mampu menangani tanda baca yang mungkin menempel pada kata terenkripsi.
* **`_find_candidates_fast`**: Menggunakan *lookup tables* yang sudah dibuat untuk menemukan kandidat kata dengan sangat cepat.

In [None]:
import re
from collections import defaultdict
from itertools import combinations

# Global cache untuk lookup tables (dibangun sekali saja)
_ANAGRAM_LOOKUP = None
_LENGTH_LOOKUP = None
_FIRST_LAST_LOOKUP = None
_VARIATION_CACHE = {}
_KBBI_PROCESSED = None

def _clean_word(word):
    """Clean word dengan regex yang dioptimasi"""
    return re.sub(r'[^a-zA-Z]', '', word)

def _generate_variations(word):
    """Generate variations dengan algoritma yang dioptimasi"""
    if word in _VARIATION_CACHE:
        return _VARIATION_CACHE[word]
    
    variations = {word}  # Use set untuk avoid duplicates
    word_lower = word.lower()
    
    # Optimized prefix/suffix removal
    prefixes = ['me', 'men', 'meng', 'mem', 'meny', 'ber', 'ter', 'ke', 'se', 'di', 'pe', 'pen', 'peng', 'per']
    suffixes = ['kan', 'an', 'i', 'nya', 'mu', 'ku', 'lah', 'kah', 'tah']
    
    # Single pass untuk prefix removal
    for prefix in prefixes:
        if word_lower.startswith(prefix) and len(word) > len(prefix) + 2:  # Minimal 3 char root
            root = word[len(prefix):]
            variations.add(root)
    
    # Single pass untuk suffix removal
    for suffix in suffixes:
        if word_lower.endswith(suffix) and len(word) > len(suffix) + 2:
            root = word[:-len(suffix)]
            variations.add(root)
    
    # Combined prefix+suffix removal (hanya untuk kata panjang)
    if len(word) > 8:  # Optimasi: hanya untuk kata yang cukup panjang
        for prefix in prefixes[:5]:  # Hanya prefix yang paling umum
            for suffix in suffixes[:5]:  # Hanya suffix yang paling umum
                if (word_lower.startswith(prefix) and word_lower.endswith(suffix) and 
                    len(word) > len(prefix) + len(suffix) + 2):
                    root = word[len(prefix):-len(suffix)]
                    variations.add(root)
    
    # Generate reduplikasi hanya untuk kata yang reasonable
    base_variations = list(variations)
    for variation in base_variations:
        if 3 <= len(variation) <= 6:  # Reasonable length untuk reduplikasi
            variations.add(f"{variation}-{variation}")
            
            # Reduplikasi dengan prefix terpilih
            for prefix in ['me', 'ber', 'ter']:  # Hanya prefix yang umum untuk reduplikasi
                variations.add(f"{prefix}{variation}-{variation}")
    
    result = list(variations)
    _VARIATION_CACHE[word] = result
    return result

def _build_lookup_tables(kbbi_words):
    """Pre-build lookup tables untuk pencarian O(1)"""
    global _ANAGRAM_LOOKUP, _LENGTH_LOOKUP, _FIRST_LAST_LOOKUP, _KBBI_PROCESSED
    
    # Skip jika sudah dibangun untuk KBBI words yang sama
    if _KBBI_PROCESSED == id(kbbi_words):
        return
    
    print("Building optimized lookup tables...")
    
    # Reset lookup tables
    _ANAGRAM_LOOKUP = defaultdict(list)
    _LENGTH_LOOKUP = defaultdict(list)
    _FIRST_LAST_LOOKUP = defaultdict(list)
    
    # Process setiap kata KBBI
    for word in kbbi_words:
        clean_word = _clean_word(word)
        if not clean_word or len(clean_word) < 2:
            continue
        
        # Generate semua variations untuk kata ini
        variations = _generate_variations(clean_word)
        
        for variation in variations:
            if len(variation) < 2:
                continue
                
            # Anagram signature (sorted chars)
            signature = ''.join(sorted(variation.lower()))
            _ANAGRAM_LOOKUP[signature].append((word, variation))
            
            # Length lookup
            _LENGTH_LOOKUP[len(variation)].append((word, variation))
            
            # First-last char lookup
            first_last_key = f"{variation[0].lower()}_{variation[-1].lower()}"
            _FIRST_LAST_LOOKUP[first_last_key].append((word, variation))
    
    _KBBI_PROCESSED = id(kbbi_words)
    print(f"Lookup tables built: {len(_ANAGRAM_LOOKUP)} anagram signatures")

def _parse_encrypted_text(text):
    """Parse encrypted text dengan algoritma yang dioptimasi"""
    # Fast pattern matching
    noise1_pos = text.find("zk0")
    noise2_pos = text.find("xv")
    
    if noise1_pos == -1 or noise2_pos == -1:
        return None
    
    # Extract parts dalam satu operasi
    prefix_part = text[:noise1_pos]
    anagram_part = text[noise1_pos + 3:noise2_pos]
    length_and_hint = text[noise2_pos + 2:]
    
    # Parse components efficiently
    punctuation_start = ''
    punctuation_end = ''
    hint_start = ''
    hint_end = ''
    
    # Process prefix
    if prefix_part:
        if prefix_part[0] in '.,;!?()":/-':
            punctuation_start = prefix_part[0]
            # Find first alphanumeric after punctuation
            for char in prefix_part[1:]:
                if char.isalnum():
                    hint_start = char
                    break
        else:
            # Find first alphanumeric
            for char in prefix_part:
                if char.isalnum():
                    hint_start = char
                    break
    
    # Process suffix
    if length_and_hint:
        last_char = length_and_hint[-1]
        if last_char in '.,;!?()":/-':
            punctuation_end = last_char
            # Find last alphanumeric before punctuation
            for i in range(len(length_and_hint) - 2, -1, -1):
                if length_and_hint[i].isalnum():
                    hint_end = length_and_hint[i]
                    break
        else:
            hint_end = last_char
    
    # Clean anagram part efficiently
    clean_anagram_part = re.sub(r'[^a-zA-Z]', '', anagram_part)
    
    return {
        'hint_start': hint_start.lower() if hint_start else '',
        'anagram': clean_anagram_part.lower(),
        'length': len(clean_anagram_part),
        'hint_end': hint_end.lower() if hint_end else '',
        'punctuation_start': punctuation_start,
        'punctuation_end': punctuation_end,
        'original': text
    }

def _find_candidates_fast(parsed):
    """Find candidates menggunakan lookup tables untuk kecepatan maksimal"""
    candidates = []
    
    # Level 1: Exact match menggunakan anagram signature
    anagram_signature = ''.join(sorted(parsed['anagram']))
    exact_matches = _ANAGRAM_LOOKUP.get(anagram_signature, [])
    
    for original_word, variation in exact_matches:
        # Quick length check
        if len(variation) != parsed['length']:
            continue
        
        # Quick hint checks (jika ada)
        if parsed['hint_start'] and parsed['hint_start'].isalpha():
            if variation[-1].lower() != parsed['hint_start']:
                continue
        
        if parsed['hint_end'] and parsed['hint_end'].isalpha():
            if variation[0].lower() != parsed['hint_end']:
                continue
        
        # Found exact match
        result = parsed['punctuation_start'] + variation + parsed['punctuation_end']
        candidates.append((result, 'exact'))
        
        if len(candidates) >= 3:  # Limit hasil untuk efisiensi
            break
    
    # Level 2: Relaxed search jika exact match tidak ditemukan
    if not candidates:
        length_matches = _LENGTH_LOOKUP.get(parsed['length'], [])
        
        for original_word, variation in length_matches:
            if ''.join(sorted(variation.lower())) == anagram_signature:
                result = parsed['punctuation_start'] + variation + parsed['punctuation_end']
                candidates.append((result, 'relaxed'))
                
                if len(candidates) >= 3:
                    break
    
    # Level 3: Reduplication check untuk kata yang tidak ditemukan
    if not candidates and parsed['length'] % 2 == 0:
        half_length = parsed['length'] // 2
        
        # Quick reduplication check
        anagram_chars = sorted(parsed['anagram'])
        mid = len(anagram_chars) // 2
        
        if anagram_chars[:mid] == anagram_chars[mid:]:
            # Possible reduplication
            single_word_chars = ''.join(anagram_chars[:mid])
            single_matches = _ANAGRAM_LOOKUP.get(single_word_chars, [])
            
            for original_word, variation in single_matches:
                if len(variation) == half_length:
                    redupe_result = f"{variation}-{variation}"
                    result = parsed['punctuation_start'] + redupe_result + parsed['punctuation_end']
                    candidates.append((result, 'reduplication'))
                    break
    
    return candidates


---
### **3.2. Fungsi Utama untuk Dekripsi**

* **`decode_anagram_single`**: Mengorkestrasi semua fungsi pembantu untuk mendekripsi satu kata terenkripsi secara efisien.
* **`decode_sentence_anagrams`**: Fungsi level tertinggi yang melakukan seluruh alur kerja:
    1.  Membaca file input (misalnya, `queries.txt`).
    2.  Menggunakan *Regular Expression* untuk menemukan semua pola kata terenkripsi dalam sebuah kalimat.
    3.  Memanggil `decode_anagram_single` untuk setiap pola yang ditemukan.
    4.  Mengganti kata terenkripsi dengan hasil dekripsinya.
    5.  Menyimpan kalimat yang sudah bersih ke dalam file output.

In [None]:

def decode_anagram_single(encrypted_text, kbbi_words):
    """Main decode function dengan optimasi maksimal"""
    # Build lookup tables jika belum ada
    _build_lookup_tables(kbbi_words)
    
    # Parse input
    parsed = _parse_encrypted_text(encrypted_text)
    if not parsed:
        return encrypted_text
    
    # Debug info (optional)
    # print(f"Debug parsing '{encrypted_text}':")
    # print(f"  - Hint start (huruf terakhir): '{parsed['hint_start']}'")
    # print(f"  - Anagram part (cleaned): '{parsed['anagram']}'")
    # print(f"  - Length: {parsed['length']}")
    # print(f"  - Hint end (huruf pertama): '{parsed['hint_end']}'")
    # print(f"  - Punctuation start: '{parsed['punctuation_start']}'")
    # print(f"  - Punctuation end: '{parsed['punctuation_end']}'")
    
    # Find candidates
    candidates = _find_candidates_fast(parsed)
    
    if candidates:
        # Return best candidate (prioritas: exact > relaxed > reduplication)
        best_candidate = min(candidates, key=lambda x: {'exact': 0, 'relaxed': 1, 'reduplication': 2}[x[1]])
        # print(f"  - Found match: {best_candidate[0]} (type: {best_candidate[1]})")
        return best_candidate[0]
    
    # print(f"  - No match found, returning original")
    return encrypted_text




In [None]:
def decode_sentence_anagrams(sentence_file_path, kbbi_file_path='/kaggle/input/indawg/list_1.0.0.txt', output_file_path='output_hasil.txt'):
    """
    Decode anagram dalam kalimat dari file txt
    """
    
    def load_kbbi_words(file_path):
        """Load kata-kata dari file KBBI"""
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                words = [line.strip() for line in file if line.strip()]
            return words
        except FileNotFoundError:
            print(f"File {file_path} tidak ditemukan!")
            return []
        except Exception as e:
            print(f"Error reading KBBI file: {e}")
            return []
    
    def load_sentences(file_path):
        """Load kalimat dari file txt"""
        try:
            # FIX: Tambahkan errors='replace' untuk menangani karakter tidak valid
            with open(file_path, 'r', encoding='utf-8', errors='replace') as file:
                sentences = [line.strip() for line in file if line.strip()]
            return sentences
        except FileNotFoundError:
            print(f"File {file_path} tidak ditemukan!")
            return []
        except Exception as e:
            print(f"Error reading sentence file: {e}")
            return []
    
    def find_anagram_pattern(text):
        """Find anagram pattern dalam text dengan berbagai variasi"""
        # Pattern yang lebih komprehensif untuk menangkap berbagai format
        patterns = [
            r'[.,;!?()":/-]?[a-zA-Z0-9]*zk0[a-zA-Z0-9.,;!?()":-]*xv[a-zA-Z0-9.,;!?()":-]*[.,;!?()":/-]?',
            r'\b[a-zA-Z0-9.,;!?()":-]*zk0[a-zA-Z0-9.,;!?()":-]*xv[a-zA-Z0-9.,;!?()":-]*\b'
        ]
        
        matches = []
        for pattern in patterns:
            matches.extend(re.findall(pattern, text))
        
        # Remove duplicates dan filter yang valid
        unique_matches = []
        for match in matches:
            if match not in unique_matches and 'zk0' in match and 'xv' in match:
                unique_matches.append(match)
        
        return unique_matches
    
    def decode_sentence(sentence, kbbi_words):
        """Decode semua anagram dalam satu kalimat"""
        anagram_patterns = find_anagram_pattern(sentence)
        
        decoded_sentence = sentence
        
        for pattern in anagram_patterns:
            # Decode anagram
            decoded_word = decode_anagram_single(pattern, kbbi_words)
            
            if decoded_word and decoded_word != pattern:
                # Replace dalam kalimat
                decoded_sentence = decoded_sentence.replace(pattern, decoded_word)
                # print(f"  {pattern} -> {decoded_word}")
            # else:
                # print(f"  {pattern} -> [TIDAK DITEMUKAN]")
        
        return decoded_sentence
    
    # Load data
    print("Loading KBBI words...")
    kbbi_words = load_kbbi_words(kbbi_file_path)
    if not kbbi_words:
        return []
    
    print(f"Loaded {len(kbbi_words)} kata dari KBBI")
    print()
    
    print("Loading sentences...")
    sentences = load_sentences(sentence_file_path)
    if not sentences:
        return []
    
    print(f"Loaded {len(sentences)} kalimat")
    print()
    
    # Decode semua kalimat
    decoded_sentences = []
    
    for i, sentence in enumerate(sentences, 1):
        # print(f"Kalimat {i}: {sentence}")
        
        decoded = decode_sentence(sentence, kbbi_words)
        decoded_sentences.append(decoded)
        
        # print(f"Hasil   : {decoded}")
        # print("-" * 80)
    
    # Simpan hasil jika diminta
    if output_file_path:
        try:
            with open(output_file_path, 'w', encoding='utf-8') as f:
                for decoded in decoded_sentences:
                    f.write(decoded + '\n')
            print(f"Hasil disimpan ke: {output_file_path}")
        except Exception as e:
            print(f"Error saving output: {e}")
    
    return decoded_sentences



---
## **4. Eksekusi pada Data Lomba & Penyimpanan Hasil**

Sekarang kita akan menerapkan fungsi `decode_sentence_anagrams` pada file data yang sesungguhnya.

In [None]:
def decode_from_file(input_file_path, kbbi_file_path='list_1.0.0.txt'):
    """Wrapper function untuk decode dari file dengan path yang sudah diketahui"""
    return decode_sentence_anagrams(input_file_path, kbbi_file_path, 
                                   input_file_path.replace('.txt', '_decoded.txt'))

---
### **4.1. Mendekripsi File `unk500.txt`**

Kita proses file `unk500.txt` untuk mendapatkan korpus paralel Bahasa Indonesia. Hasilnya akan disimpan ke `output_hasil_unk.txt`. File ini, jika dipasangkan dengan `eng500.txt`, akan menjadi data latih (training data) untuk model ranker kita.

In [None]:
hasil = decode_from_file('unk500.txt')
nama_file = 'output_hasil_unk.txt'

# Buka file dengan mode 'w' (write/tulis)
# 'with open' akan otomatis menutup file setelah selesai
with open(nama_file, 'w', encoding='utf-8') as f:
    # Lakukan perulangan untuk setiap baris di dalam list 'data'
    for baris in hasil:
        # Tulis setiap baris ke file, tambahkan '\n' untuk membuat baris baru
        f.write(baris + '\n')

Loading KBBI words...
Loaded 112651 kata dari KBBI

Loading sentences...
Loaded 500 kalimat

Building optimized lookup tables...
Lookup tables built: 245605 anagram signatures
Hasil disimpan ke: unk500_decoded.txt


---
### **4.2. Mendekripsi File `queries.txt`**

Selanjutnya, kita proses file `queries.txt`. Ini adalah 50 dokumen yang perlu kita temukan padanannya dalam Bahasa Inggris. Hasil dekripsi Bahasa Indonesia akan disimpan ke `output_hasil_query.txt` dan akan menjadi input untuk model ranker kita pada tahap selanjutnya.

In [None]:
hasil = decode_from_file('queries.txt')
nama_file = 'output_hasil_query.txt'

# Buka file dengan mode 'w' (write/tulis)
# 'with open' akan otomatis menutup file setelah selesai
with open(nama_file, 'w', encoding='utf-8') as f:
    # Lakukan perulangan untuk setiap baris di dalam list 'data'
    for baris in hasil:
        # Tulis setiap baris ke file, tambahkan '\n' untuk membuat baris baru
        f.write(baris + '\n')

Loading KBBI words...
Loaded 112651 kata dari KBBI

Loading sentences...
Loaded 50 kalimat

Building optimized lookup tables...
Lookup tables built: 245605 anagram signatures
Hasil disimpan ke: queries_decoded.txt


## **5. Kesimpulan dan Langkah Selanjutnya**

Notebook ini telah berhasil menyelesaikan **tahap pertama dan paling krusial**: memecahkan teka-teki "Bahasa Mars" dan mengubah data mentah menjadi format Bahasa Indonesia yang dapat diproses, meskipun mungkin masih mengandung sedikit *noise* sisa dekripsi.

File output utama yang dihasilkan adalah:
* `output_hasil_query.txt`: Berisi 50 kueri dalam Bahasa Indonesia.
* `output_hasil_unk.txt`: Berisi 500 kalimat korpus paralel dalam Bahasa Indonesia.

### **Langkah Selanjutnya ➡️**

Dengan data yang sudah siap, kita kini bisa melanjutkan ke tahap **pembangunan model ranker**.   File-file output dari notebook ini akan digunakan sebagai **input** pada dua approach solusi yang berbeda:
1.  **Approach Solusi A (`labse`| Akan dibahas di notebook ini) :** Menggunakan pendekatan *Fine-Tuning* . Berhasil Mendapatkan Skor 1 pada leaderboard
2.  **ApproachNotebook Solusi B (`qwen`):** Menggunakan pendekatan *Zero-Shot*. Berhasil Mendapatkan Skor 1 pada leaderboard (karena keterbatasan GPU maka tidak ditampilkan, tapi source code dapat dilihat disini : [link](https://colab.research.google.com/drive/11Pb1EdLoBdb5_2R6hNVDl4yyS8GhZfJ1))

Anda dapat memilih salah satu dari notebook solusi tersebut untuk dijalankan selanjutnya.

---
# **Section 2**
# **Solusi A : Menerjemahkan Bahasa Mars ke Bahasa Inggris dengan labse**

## **Ringkasan Masalah (Overview)**
Tugas ini adalah sebuah tantangan *Cross-Lingual Information Retrieval* (CLIR). Kita diberikan 50 dokumen dalam "Bahasa Mars" (sebuah bahasa buatan) dan 1459 dokumen kandidat terjemahan dalam Bahasa Inggris. Tujuannya adalah menemukan satu terjemahan yang benar untuk setiap dokumen Mars dari koleksi Bahasa Inggris.

---
## **Pendekatan Unik: Dekripsi Anagram & Penanganan Noise**

"Bahasa Mars" pada tantangan ini ternyata merupakan sebuah sandi atau anagram dari Bahasa Indonesia. Oleh karena itu, strategi yang paling efektif adalah pendekatan dua langkah:

1.  **Dekripsi (Offline):** Pertama, "Bahasa Mars" dipecahkan atau didekripsi. Proses ini **tidak sempurna**, menghasilkan teks yang sebagian besar adalah Bahasa Indonesia namun **masih mengandung sisa kode 'Mars'** (`...zk0...xv...`) yang tidak berhasil terpecahkan. Sisa kode ini kita anggap sebagai *noise*.
2.  **Pencarian Lintas-Bahasa & Pelatihan Kebal-Noise (Notebook):** Kedua, kita menggunakan model AI untuk tugas pencarian lintas-bahasa. Namun, tantangan utamanya adalah melatih model agar **kebal (robust) terhadap noise** yang ada di dalam teks hasil dekripsi.

Notebook ini akan fokus pada langkah kedua dari strategi tersebut: melatih model yang dapat menemukan padanan kalimat Bahasa Inggris bahkan ketika inputnya adalah campuran Bahasa Indonesia dan *noise* sisa dekripsi.

---
### **Strategi Solusi (Pendekatan A: Fine-Tuning Bi-Encoder)**

Notebook ini mendemonstrasikan **Pendekatan A**, yaitu melakukan **fine-tuning** pada sebuah model *bi-encoder*. Strategi ini bertujuan untuk mengadaptasi dan mengspesialisasi sebuah model yang sudah kuat (LaBSE) agar memiliki performa maksimal pada dataset spesifik kompetisi ini.

Pendekatan alternatif, yaitu **Pendekatan B: Zero-Shot Retrieval**, terdapat di notebook terpisah (https://colab.research.google.com/drive/11Pb1EdLoBdb5_2R6hNVDl4yyS8GhZfJ1).

Alur kerja untuk **Pendekatan A** ini adalah sebagai berikut:
1.  **Fine-Tuning Bi-Encoder:** Kita akan menggunakan model Transformer multilingual (LaBSE) dan melakukan fine-tuning menggunakan korpus paralel **Indonesia-Inggris**. Tujuannya adalah agar model dapat memetakan kalimat **Bahasa Indonesia** dan terjemahan Bahasa Inggrisnya ke vektor yang sangat mirip.
2.  **Triplet Loss & Hard Negative Mining:** Untuk melatih model secara efektif, kita akan menggunakan `TripletLoss`. Ini akan mendekatkan pasangan kalimat **Indonesia-Inggris** yang benar dan menjauhkan pasangan yang salah. *Hard Negative Mining* akan mencari contoh Bahasa Inggris yang paling "menipu" untuk meningkatkan kualitas training.
3.  **Indexing dengan FAISS:** Setelah model di-fine-tune, kita akan menggunakannya untuk mengubah seluruh 1459 dokumen **Bahasa Inggris** resmi menjadi vektor dan menyimpannya dalam indeks FAISS untuk pencarian cepat.
4.  **Ranking & Submission:** Untuk setiap *query* **Bahasa Indonesia** (hasil dekripsi), kita akan membuat embedding-nya dan menggunakan indeks FAISS untuk memeringkatkan semua dokumen Inggris berdasarkan kemiripan.

---
### **Why LaBSE?**

LaBSE (*Language-Agnostic BERT Sentence Embedding*) adalah pilihan yang sangat kuat untuk tugas ini karena ia dirancang secara spesifik untuk memahami makna kalimat **tanpa memandang bahasanya**.

Berikut adalah beberapa alasan utama mengapa LaBSE sangat cocok:

1.  **Ruang Embedding Multilingual Bersama (*Shared Multilingual Embedding Space*)**
    LaBSE dilatih pada **109+ bahasa** secara bersamaan. Tujuannya adalah untuk menciptakan sebuah "peta makna" universal di mana kalimat dengan arti yang sama akan ditempatkan di titik yang berdekatan, tidak peduli bahasanya. Seperti yang dijelaskan dalam paper resminya oleh **Feng et al. (2020)**, kalimat "Saya suka membaca buku" dalam Bahasa Indonesia dan "I love to read books" dalam Bahasa Inggris akan menghasilkan representasi vektor yang hampir identik. Ini memungkinkan kita untuk secara langsung membandingkan kemiripan antara kueri Bahasa Indonesia dan dokumen Bahasa Inggris menggunakan matematika sederhana (*cosine similarity*).

2.  **Pelatihan Spesifik untuk Tugas Terjemahan**
    Model ini dilatih dengan arsitektur *dual-encoder* pada jutaan pasang kalimat terjemahan. Artinya, LaBSE secara eksplisit "belajar" untuk mengenali apakah sebuah kalimat adalah terjemahan yang benar dari kalimat lain. Pelatihan ini membuatnya sangat andal dalam tugas inti kompetisi ini: menemukan pasangan terjemahan yang tepat.

Singkatnya, LaBSE memungkinkan kita untuk mengubah masalah pencocokan terjemahan yang kompleks menjadi perbandingan matematis yang sederhana dan cepat antar vektor.

---
### **Persiapan: Instalasi Library**
Langkah pertama adalah menginstal semua library yang kita butuhkan.

* `sentence-transformers`: Framework utama untuk menggunakan dan melatih model bi-encoder.
* `faiss-cpu`: Library dari Facebook AI untuk pencarian kemiripan yang efisien pada data vektor berdimensi tinggi.
* `gdown`: Utilitas untuk mengunduh file dari Google Drive.
* `transformers`, `datasets`, `torch`: Pustaka pendukung untuk model dan operasi tensor.

In [1]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.0 kB)
Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m58.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0.post1


In [2]:
!pip install sentence-transformers transformers datasets torch

Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia

In [3]:
!pip install gdown



---
### **Langkah 1: Memuat dan Mempersiapkan Data**
Di bagian ini, kita akan mengunduh semua file yang diperlukan. Berdasarkan strategi dekripsi anagram, kita akan menggunakan file-file hasil olahan yang **mengandung noise**.

* **`output_hasil_query.txt`**: Berisi 50 *query* hasil dekripsi. Teks ini merupakan **campuran Bahasa Indonesia dan sisa kode 'Mars' yang tidak terpecahkan.** File ini akan menjadi input pencarian kita.
* **`eng_collection.txt`**: Koleksi resmi berisi 1459 dokumen Bahasa Inggris yang bersih dan menjadi target pencarian.
* **`output_hasil_unk.txt`**: Berisi 500 kalimat korpus hasil dekripsi, yang juga **mengandung noise**.
* **`eng500.txt`**: Berisi 500 kalimat Bahasa Inggris bersih yang menjadi pasangan korpus paralel.
* **`unk500.txt` dan `queries.txt`**: File "Bahasa Mars" asli yang tidak digunakan secara langsung dalam notebook ini.

In [4]:
import torch
import gdown
import os
import torch
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader
from sentence_transformers import SentenceTransformer, InputExample, losses, models, util
from torch.optim import AdamW
from torch.nn.utils import clip_grad_norm_
from tqdm.auto import tqdm
import numpy as np

2025-07-22 17:04:26.419109: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1753203866.604868      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753203866.658009      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [5]:
urls = {
    "queries.txt": "1csCrWhjczjvYVibFfV8tq2vlmXW4cCct",
    "eng500.txt": "1AZ913Q_YbM7K3azRzFYTtCIxBK6411Lc",
    "eng_collection.txt": "1nb9MooQQZvquRVp3azd3AfPRfItg_jyh",
    "unk500.txt": "1mTVAEgqhys17dD3LeaHVSecAzKHk88l2",
    "output_hasil_unk.txt": "1mi9LemOL82FOgJXRQDSwIvb-JO7gE_yd",
    "output_hasil_query.txt": "1uOsS07qRFuLXIU-0imAKDe2zll1XCsQC"
}

# Unduh setiap file jika belum ada
print("Memeriksa dan mengunduh file...")
for filename, file_id in urls.items():
    print(f"Mengunduh {filename}...")
    gdown.download(id=file_id, output=filename, quiet=False)

print("\nSemua file siap digunakan.")


Memeriksa dan mengunduh file...
Mengunduh queries.txt...


Downloading...
From: https://drive.google.com/uc?id=1csCrWhjczjvYVibFfV8tq2vlmXW4cCct
To: /kaggle/working/queries.txt
100%|██████████| 15.3k/15.3k [00:00<00:00, 20.6MB/s]


Mengunduh eng500.txt...


Downloading...
From: https://drive.google.com/uc?id=1AZ913Q_YbM7K3azRzFYTtCIxBK6411Lc
To: /kaggle/working/eng500.txt
100%|██████████| 65.8k/65.8k [00:00<00:00, 51.4MB/s]


Mengunduh eng_collection.txt...


Downloading...
From: https://drive.google.com/uc?id=1nb9MooQQZvquRVp3azd3AfPRfItg_jyh
To: /kaggle/working/eng_collection.txt
100%|██████████| 196k/196k [00:00<00:00, 79.7MB/s]


Mengunduh unk500.txt...


Downloading...
From: https://drive.google.com/uc?id=1mTVAEgqhys17dD3LeaHVSecAzKHk88l2
To: /kaggle/working/unk500.txt
100%|██████████| 151k/151k [00:00<00:00, 68.4MB/s]


Mengunduh output_hasil_unk.txt...


Downloading...
From: https://drive.google.com/uc?id=1mi9LemOL82FOgJXRQDSwIvb-JO7gE_yd
To: /kaggle/working/output_hasil_unk.txt
100%|██████████| 78.5k/78.5k [00:00<00:00, 55.9MB/s]


Mengunduh output_hasil_query.txt...


Downloading...
From: https://drive.google.com/uc?id=1uOsS07qRFuLXIU-0imAKDe2zll1XCsQC
To: /kaggle/working/output_hasil_query.txt
100%|██████████| 8.05k/8.05k [00:00<00:00, 13.9MB/s]


Semua file siap digunakan.





In [6]:
import os
import csv
from sentence_transformers import InputExample
from torch.utils.data import DataLoader

def load_queries(file_path="output_hasil_query.txt"):
    """Loads the Martian queries from a file."""
    queries = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split('\t\t')
            if len(parts) == 2:
                queries[parts[0]] = parts[1]
    print(f"Successfully loaded {len(queries)} queries.")
    return queries

def load_english_collection(file_path="eng_collection.txt"):
    """Loads the English document collection."""
    collection = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split('\t\t')
            if len(parts) == 2:
                # We use "D" + ID to match the submission format later
                collection[f"D{parts[0]}"] = parts[1]
    print(f"Successfully loaded {len(collection)} English documents.")
    return collection

def load_parallel_corpus(mars_path="output_hasil_unk.txt", eng_path="eng500.txt"):
    """Loads the parallel corpus for fine-tuning into InputExample format."""
    train_examples = []
    with open(mars_path, 'r', encoding='latin-1') as f_mars, open(eng_path, 'r', encoding='latin-1') as f_eng:
        
        mars_lines = f_mars.readlines()
        eng_lines = f_eng.readlines()
        # print(len(mars_lines), len(eng_lines))
        # print(mars_lines[1])
        for mars_line, eng_line in zip(mars_lines, eng_lines):
            train_examples.append(InputExample(texts=[mars_line.strip(), eng_line.strip()]))
    print(f"Successfully loaded {len(train_examples)} parallel sentences for training.")
    return train_examples

# Execute the loading functions
queries = load_queries()
eng_collection = load_english_collection()
train_examples = load_parallel_corpus()

Successfully loaded 50 queries.
Successfully loaded 1459 English documents.
Successfully loaded 500 parallel sentences for training.


-----

## **Langkah 2: Fine-Tuning Model Bi-Encoder**

Ini adalah inti dari solusi kita. Kita akan melatih model untuk memahami hubungan antara Bahasa Mars dan Bahasa Inggris.

### **2.1. Konfigurasi Training**

Kita tentukan parameter-parameter utama untuk proses training, seperti model dasar yang akan digunakan (`sentence-transformers/labse`), ukuran *batch*, jumlah *epoch*, dan *learning rate*. LaBSE (Language-Agnostic BERT Sentence Embedding) adalah pilihan yang sangat baik karena dirancang untuk menghasilkan embedding yang sebanding untuk kalimat-kalimat yang sama artinya dalam berbagai bahasa, bahkan untuk bahasa yang belum pernah ia lihat secara eksplisit (zero-shot).

In [7]:
print("\n--- BAGIAN 2: KONFIGURASI TRAINING ---")
MODEL_NAME = "sentence-transformers/labse"
BATCH_SIZE = 8
EPOCHS = 1
LR = 1e-5
SAVE_DIR = "./model_triplet_final"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
TOP_K_MINING = 5

print(f"Model: {MODEL_NAME}")
print(f"Device: {DEVICE}")


--- BAGIAN 2: KONFIGURASI TRAINING ---
Model: sentence-transformers/labse
Device: cuda


### **2.2. Mining Hard Negatives & Melatih Model Kebal-Noise**

Untuk menggunakan `TripletLoss`, kita memerlukan data dalam format `(anchor, positive, negative)`.

* **Anchor:** Kalimat dari korpus paralel (`output_hasil_unk.txt`). Penting untuk dicatat bahwa *anchor* ini **bukanlah Bahasa Indonesia murni**, melainkan teks hasil dekripsi yang masih mengandung *noise* (`...zk0...xv...`).
* **Positive:** Terjemahan Bahasa Inggris yang benar (`eng500.txt`).
* **Negative:** Terjemahan Bahasa Inggris yang salah, yang kita cari dari koleksi dokumen resmi.

Memilih contoh *negative* secara acak seringkali terlalu mudah bagi model. Teknik **Hard Negative Mining** memilih contoh *negative* yang paling "sulit", yaitu dokumen Inggris yang menurut model saat ini paling mirip dengan *anchor*, padahal sebenarnya salah.

Dengan melatih model pada *anchor* yang mengandung *noise*, kita secara efektif **memaksa model untuk belajar mengabaikan sisa-sisa kode 'Mars'** dan fokus pada bagian Bahasa Indonesia yang relevan. Ini adalah kunci untuk membuat model menjadi kebal (*robust*) terhadap ketidaksempurnaan data kita.

In [8]:
print("\n--- BAGIAN 3: MINING HARD NEGATIVES ---")
print("Memuat model dasar untuk mining...")
untuned_model = SentenceTransformer(MODEL_NAME, device=DEVICE)

print("Mengubah seluruh koleksi dokumen Inggris menjadi vektor...")
corpus_docs = list(eng_collection.values())
corpus_embeddings = untuned_model.encode(
    corpus_docs, convert_to_tensor=True, show_progress_bar=True, batch_size=BATCH_SIZE
)

triplet_train_examples = []
print(f"Memulai mining untuk {len(train_examples)} pasangan data...")

# PENYESUAIAN: Loop langsung pada 'train_examples' yang sudah Anda buat
for example in tqdm(train_examples, desc="Mining Hard Negatives"):
    query, positive = example.texts[0], example.texts[1] # Ambil query & positive dari InputExample

    query_embedding = untuned_model.encode(query, convert_to_tensor=True)
    search_results = util.semantic_search(
        query_embedding, corpus_embeddings, top_k=TOP_K_MINING
    )[0]

    for result in search_results:
        doc_index = result['corpus_id']
        found_doc = corpus_docs[doc_index]
        if found_doc != positive:
            negative = found_doc
            triplet_train_examples.append(InputExample(texts=[query, positive, negative]))

print(f"Mining selesai. Ditemukan {len(triplet_train_examples)} contoh triplet.")


--- BAGIAN 3: MINING HARD NEGATIVES ---
Memuat model dasar untuk mining...


modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

2_Dense/model.safetensors:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

2_Dense/pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

Mengubah seluruh koleksi dokumen Inggris menjadi vektor...


Batches:   0%|          | 0/183 [00:00<?, ?it/s]

Memulai mining untuk 500 pasangan data...


Mining Hard Negatives:   0%|          | 0/500 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Mining selesai. Ditemukan 2155 contoh triplet.


### **2.3. Persiapan Model Akhir dan Training Loop**
Dengan dataset triplet yang sudah kita buat, sekarang kita siapkan arsitektur model, `DataLoader`, dan fungsi `loss` untuk memulai proses fine-tuning.

* **Model**: Terdiri dari `Transformer` (LaBSE) dan lapisan `Pooling` untuk menghasilkan satu vektor tunggal (embedding) untuk setiap kalimat.
* **DataLoader**: Mengatur data triplet ke dalam batch-batch untuk training.
* **Loss Function**: `TripletLoss` akan bekerja untuk meminimalkan jarak antara *anchor* dan *positive*, sambil memaksimalkan jarak antara *anchor* dan *negative*.
* **Validasi**: Di setiap akhir epoch, kita akan mengukur akurasi pada set validasi. Akurasi di sini didefinisikan sebagai persentase triplet di mana jarak `(anchor, positive)` lebih kecil dari jarak `(anchor, negative)`. Model dengan akurasi validasi terbaik akan disimpan.

In [9]:
train_set, val_set = train_test_split(triplet_train_examples, test_size=0.1, random_state=42)
print(f"Data Triplet: Train {len(train_set)} | Val {len(val_set)}")

word_embedding_model = models.Transformer(MODEL_NAME, max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model], device=DEVICE)

train_loader = DataLoader(
    train_set,
    shuffle=True,
    batch_size=BATCH_SIZE,
    collate_fn=model.smart_batching_collate 
)

loss_fn = losses.TripletLoss(model=model)
optimizer = AdamW(model.parameters(), lr=LR)

Data Triplet: Train 1939 | Val 216


In [10]:
def validate_triplet_accuracy(model, val_set, device):
    model.eval()
    if not val_set: return 0.0
    correct_predictions = 0
    with torch.no_grad():
        for example in val_set:
            query, pos, neg = example.texts
            embeddings = model.encode([query, pos, neg], convert_to_tensor=True, device=device, show_progress_bar=False)
            query_emb, pos_emb, neg_emb = embeddings[0], embeddings[1], embeddings[2]
            pos_dist = util.cos_sim(query_emb, pos_emb)
            neg_dist = util.cos_sim(query_emb, neg_emb)
            if pos_dist > neg_dist:
                correct_predictions += 1
    model.train()
    return correct_predictions / len(val_set)

In [11]:
print("\n--- BAGIAN 5: MEMULAI TRAINING LOOP ---")
best_val_score = -1.0

for epoch in range(1, EPOCHS + 1):
    model.train()
    pbar = tqdm(train_loader, desc=f"Epoch {epoch}/{EPOCHS}")
    
    for sentence_features, labels in pbar:
        for i in range(len(sentence_features)):
          
            for key in sentence_features[i]:
                sentence_features[i][key] = sentence_features[i][key].to(DEVICE)
        
        labels = labels.to(DEVICE)
        loss = loss_fn(sentence_features, labels)

        loss.backward()
        clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        optimizer.zero_grad()
        pbar.set_postfix(loss=f"{loss.item():.4f}")

    # --- VALIDASI  ---
    val_score = validate_triplet_accuracy(model, val_set, DEVICE)
    print(f"Epoch {epoch} | 🧪 Validation Accuracy: {val_score:.4f}")

    if val_score > best_val_score:
        best_val_score = val_score
        os.makedirs(SAVE_DIR, exist_ok=True)
        model.save(SAVE_DIR)
        print(f"📦 Model disimpan (best Val Accuracy: {best_val_score:.4f})")

print(f"\n✅ Selesai. Model terbaik ada di: {SAVE_DIR}")


--- BAGIAN 5: MEMULAI TRAINING LOOP ---


Epoch 1/1:   0%|          | 0/243 [00:00<?, ?it/s]

Epoch 1 | 🧪 Validation Accuracy: 1.0000
📦 Model disimpan (best Val Accuracy: 1.0000)

✅ Selesai. Model terbaik ada di: ./model_triplet_final


---
## **Langkah 3: Indexing Koleksi Dokumen Bahasa Inggris**
Setelah model kita terlatih, kita perlu cara yang efisien untuk mencari di antara 1459 dokumen Inggris. Melakukan perbandingan satu per satu untuk setiap query akan sangat lambat. Di sinilah FAISS berperan.

Langkah-langkahnya adalah:

1.  Gunakan model yang sudah di-fine-tune untuk mengubah setiap dokumen di `eng_collection` menjadi vektor (embedding).
2.  Normalisasi semua vektor (L2 normalization). Ini adalah prasyarat untuk menggunakan kesamaan kosinus (*cosine similarity*) secara efisien dengan `IndexFlatIP`.
3.  Buat indeks FAISS `IndexFlatIP` (*Index Flat Inner Product*) dan tambahkan semua vektor dokumen ke dalamnya.
4.  Simpan indeks ke disk agar tidak perlu dibuat ulang setiap kali menjalankan notebook.

In [12]:
model.eval()

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

In [13]:
import faiss
FAISS_INDEX_PATH = 'faiss_index.bin'
# Get document texts and IDs in a consistent order
doc_ids = list(eng_collection.keys())
doc_texts = list(eng_collection.values())

# if not os.path.exists(FAISS_INDEX_PATH):
print("FAISS index not found. Creating embeddings and building the index...")

# Encode all documents into vectors
corpus_embeddings = model.encode(doc_texts, 
                                            convert_to_tensor=True, 
                                            show_progress_bar=True)

corpus_embeddings_np = corpus_embeddings.cpu().numpy().astype('float32')

# Debugging: Cek bentuk array
print(f"Shape of embeddings array: {corpus_embeddings_np.shape}")

# Hentikan proses jika koleksi kosong
if len(doc_texts) == 0:
    raise ValueError("eng_collection is empty – nothing to index.")


# Pastikan array 2‑D, bahkan untuk satu vektor
if corpus_embeddings_np.ndim == 1:
    corpus_embeddings_np = corpus_embeddings_np.reshape(1, -1)
    
# Move embeddings to CPU and convert to numpy for FAISS
corpus_embeddings_np = corpus_embeddings.cpu().numpy()

# Normalize vectors to unit length for cosine similarity calculation using dot product
faiss.normalize_L2(corpus_embeddings_np)

# Build the FAISS index
embedding_dim = corpus_embeddings_np.shape[1]
# IndexFlatIP is efficient for inner product (cosine similarity) on normalized vectors
index = faiss.IndexFlatIP(embedding_dim) 
index.add(corpus_embeddings_np)

# Save the index to disk
faiss.write_index(index, FAISS_INDEX_PATH)
print(f"FAISS index built and saved to '{FAISS_INDEX_PATH}'")

doc_index = faiss.read_index(FAISS_INDEX_PATH)

FAISS index not found. Creating embeddings and building the index...


Batches:   0%|          | 0/46 [00:00<?, ?it/s]

Shape of embeddings array: (1459, 768)
FAISS index built and saved to 'faiss_index.bin'


---
## **Langkah 4: Validasi dan Pemeringkatan**
Dengan model dan indeks yang siap, kita sekarang bisa melakukan pemeringkatan untuk setiap *query* dan juga melakukan validasi akhir untuk memperkirakan skor DCG kita.

### **4.1 Pemeringkatan untuk Submission**

Di sini, kita akan mengiterasi semua 50 *query* **hasil dekripsi yang mengandung noise**. Untuk setiap *query*:

1.  Buat *embedding*-nya menggunakan model *fine-tuned*.
2.  Lakukan normalisasi L2 pada vektor *query*.
3.  Gunakan `doc_index.search()` dari FAISS untuk mencari dokumen Inggris yang paling mirip.
4.  Simpan hasil peringkat untuk setiap *query* dalam sebuah dictionary `all_query_rankings`.

In [14]:
official_doc_ids = sorted([d for d in doc_ids if d.startswith("D")],
                          key=lambda x: int(x[1:]))          # D0..D1458
assert len(official_doc_ids) == 1459     # atau 1499, sesuai korpus final

# ---------------- Buat ranking per‑query ---------------------
all_query_rankings = {}

for q_id, q_text in queries.items():
    q_vec = model.encode(q_text, convert_to_tensor=True
           ).cpu().numpy().reshape(1, -1).astype("float32")
    faiss.normalize_L2(q_vec)

    _, idx = doc_index.search(q_vec, len(doc_ids))
    ranked_ids = [doc_ids[i] for i in idx[0]               # hasil FAISS
                  if doc_ids[i] in official_doc_ids]       # ← saring AUTO_

    # ranked_ids kini berisi persis 1459 ID resmi
    all_query_rankings[q_id] = ranked_ids

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### **4.2 Validasi Kinerja Model pada Data Bernoise**

Validasi ini mengukur seberapa baik model kita bekerja pada data yang realistis (mengandung *noise*), bukan pada data ideal. Proses ini secara akurat mensimulasikan proses *retrieval* yang sebenarnya.

1.  Ambil *query* (Bahasa Indonesia + *noise*) dan terjemahan Inggris yang benar (*ground truth*) dari `val_set`.
2.  Cari tahu ID dokumen (misal, D123) yang sesuai dengan teks *ground truth*.
3.  Gunakan FAISS untuk mencari *query* di seluruh koleksi dokumen Inggris untuk mendapatkan peringkat lengkap.
4.  Temukan di posisi ke berapa ID dokumen *ground truth* kita berada. Itulah *rank*-nya.
5.  Hitung skor DCG berdasarkan *rank* tersebut.

Skor DCG rata-rata yang dihasilkan akan menjadi estimasi yang andal tentang seberapa baik model kita dalam **menangani input yang tidak sempurna** untuk menemukan terjemahan yang benar.

In [15]:
import math
import numpy as np
from tqdm.autonotebook import tqdm

# --- Persiapan ---
# Buat mapping terbalik dari teks Inggris ke ID Dokumennya untuk pencarian cepat.
# { "teks dokumen 1...": "D0", "teks dokumen 2...": "D1", ... }
eng_text_to_id = {text: doc_id for doc_id, text in eng_collection.items()}

dcg_scores = []

print(f"===== Memulai Validasi (Menggunakan FAISS) =====")
print(f"Menguji {len(val_set)} sampel langsung pada indeks FAIS.")

# Loop melalui set validasi Anda
for example in tqdm(val_set, desc="Mengevaluasi dengan FAISS…"):
    mars_query = example.texts[0]
    ground_truth_english = example.texts[1]

    # --- 1) Temukan ID Dokumen dari Ground Truth ---
    # Cari ID 'Dxxx' yang sesuai dengan teks inggris di set validasi
    ground_truth_doc_id = eng_text_to_id.get(ground_truth_english)

    # Lanjutkan hanya jika teks ground truth ditemukan di koleksi utama
    if not ground_truth_doc_id:
        continue

    # --- 2) Lakukan Pencarian dengan FAISS ---
    # Encode query, normalisasi, dan ubah ke format numpy
    q_vec = model.encode(mars_query, convert_to_tensor=True).cpu().numpy().reshape(1, -1).astype("float32")
    faiss.normalize_L2(q_vec)

    # Cari di indeks FAISS untuk mendapatkan peringkat semua dokumen
    # Minta K = jumlah total dokumen untuk mendapatkan peringkat lengkap
    _, ranked_indices = doc_index.search(q_vec, len(official_doc_ids))
    
    # Ubah hasil indeks FAISS (angka 0-1458) menjadi ID Dokumen ('Dxxx')
    ranked_ids = [official_doc_ids[i] for i in ranked_indices[0]]

    # --- 3) Tentukan Peringkat (Rank) ---
    try:
        # Cari posisi ground_truth_doc_id di dalam list hasil peringkat
        # Tambah 1 karena indeks list dimulai dari 0, sedangkan rank dari 1
        rank = ranked_ids.index(ground_truth_doc_id) + 1
    except ValueError:
        # Ini terjadi jika ID ground truth tidak ditemukan di hasil ranking
        # (seharusnya tidak pernah terjadi jika K = total dokumen)
        rank = len(official_doc_ids) # Beri penalti rank terburuk

    # --- 4) Hitung DCG ---
    # Menggunakan formula log basis 2 yang umum
    dcg = 1 / math.log2(rank + 1)
    dcg_scores.append(dcg)


===== Memulai Validasi (Menggunakan FAISS) =====
Menguji 216 sampel langsung pada indeks FAIS.


Mengevaluasi dengan FAISS…:   0%|          | 0/216 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [16]:
if dcg_scores:
    mean_dcg = np.mean(dcg_scores)
    print("\n===== Hasil Validasi (FAISS) =====")
    print(f"Mean DCG berdasarkan peringkat dari FAISS: {mean_dcg:.4f}")
else:
    print("\nValidasi tidak dapat dijalankan. Pastikan `val_set` tidak kosong.")


===== Hasil Validasi (FAISS) =====
Mean DCG berdasarkan peringkat dari FAISS: 1.0000


Model berhasil mencapai skor **nDCG (Normalized Discounted Cumulative Gain) sebesar 1.0**, yang menandakan performa peringkat yang sempurna, bahkan hanya dengan tuning sebanyak 1 epoch.

---
## **5. Membuat File Submission**
Langkah terakhir adalah memformat hasil pemeringkatan yang telah kita simpan di `all_query_rankings` ke dalam file CSV sesuai format yang diminta oleh kompetisi.

Format per baris: `QID-DID,rank`.

In [17]:
import csv, re

SUBMISSION_FILE = "recreate-2-bi-encoder-only-22-1212.csv"

with open(SUBMISSION_FILE, "w", newline="", encoding="utf-8") as f:
    wr = csv.writer(f)
    wr.writerow(["que_doc", "rank"])

    for q_num, q_id in enumerate(sorted(all_query_rankings.keys(),
                                        key=lambda x: int(x[1:]))):
        rank_map = {doc_id: r for r, doc_id in enumerate(all_query_rankings[q_id], 1)}
        for doc_id in official_doc_ids:                     # D0..D1458 urut
            wr.writerow([f"{q_id}-{doc_id}", rank_map[doc_id]])

print(f"✅  Submission '{SUBMISSION_FILE}' dibuat tanpa ID 'AUTO_…'.")

✅  Submission 'recreate-2-bi-encoder-only-22-1212.csv' dibuat tanpa ID 'AUTO_…'.


---

## **6. Analisis Hasil & Kesimpulan Akhir**

Setelah menjalankan keseluruhan alur kerja, mulai dari persiapan data hingga pembuatan file submission, kita dapat menganalisis hasil yang dicapai dan menarik kesimpulan dari pendekatan yang digunakan.

### **6.1. Ringkasan Hasil Kinerja**

Kinerja model yang telah di-*fine-tune* menunjukkan hasil yang luar biasa dan konsisten di semua tahap evaluasi:

* **Skor Validasi Lokal:** Proses validasi menggunakan `val_set` terhadap indeks FAISS menghasilkan **Mean DCG 1.0000**.
* **Skor Leaderboard:** Skor akhir pada *public leaderboard* kompetisi juga mencapai **1.0**, yang merupakan skor sempurna.

Konsistensi antara skor validasi lokal dan skor *leaderboard* membuktikan bahwa set validasi dan metodologi evaluasi kita merupakan cerminan yang sangat akurat dari tugas kompetisi yang sesungguhnya.

### **6.2. Analisis Faktor Keberhasilan**

Pencapaian skor sempurna ini dapat diatribusikan pada kombinasi beberapa faktor kunci dalam strategi yang kita terapkan:

1.  **Strategi Dua Langkah yang Tepat:** Keputusan untuk terlebih dahulu mendekripsi "Bahasa Mars" menjadi Bahasa Indonesia yang bersih adalah langkah fundamental yang paling penting. Ini mengubah masalah dari teka-teki yang ambigu menjadi tantangan *Cross-Lingual Information Retrieval* (CLIR) yang lebih standar, di mana teknik-teknik canggih dapat diterapkan secara efektif.

2.  **Pemilihan Model Dasar yang Kuat (`LaBSE`):** Memulai dengan LaBSE memberikan fondasi yang sangat kuat. Sebagai model yang sudah dilatih pada 100+ bahasa, LaBSE memiliki pemahaman semantik lintas-bahasa yang mumpuni "dari sananya", sehingga proses *fine-tuning* hanya perlu menyesuaikannya pada nuansa spesifik pasangan Indonesia-Inggris.

3.  **Kekuatan *Fine-Tuning* dengan *Hard Negative Mining*:** Ini adalah faktor pembeda utama yang mendorong model dari "baik" menjadi "sempurna". Dengan menggunakan `TripletLoss` dan secara aktif mencari contoh *negative* yang paling sulit (kalimat Inggris yang mirip tapi salah), kita memaksa model untuk mempelajari perbedaan semantik yang sangat halus. Model tidak hanya belajar mencocokkan makna umum, tetapi belajar untuk menjadi sangat presisi dalam membedakan jawaban yang tepat dari pengalih yang paling meyakinkan.

### **6.3. Kesimpulan Akhir**

Pendekatan *fine-tuning* pada model **Bi-Encoder LaBSE** dengan teknik **Triplet Loss dan Hard Negative Mining** terbukti menjadi solusi yang optimal dan sangat efektif untuk tantangan ini. Kombinasi antara persiapan data yang baik, pemilihan model yang tepat, teknik training yang canggih, dan validasi yang andal menghasilkan sebuah sistem yang mampu mencapai performa puncak dengan skor sempurna.

Keberhasilan ini menunjukkan alur kerja *machine learning* yang solid dari awal hingga akhir, di mana setiap komponen strategi saling mendukung untuk mencapai hasil terbaik.

### **Sumber:** 

* Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2020). *Language-agnostic BERT sentence embedding*. arXiv preprint arXiv:2007.01852. [https://arxiv.org/abs/2007.01852](https://arxiv.org/abs/2007.01852)