Nama : Suwandi Ramadhan


# Project 3 - Voice Recognition

## Data & Algoritma Understanding

### Data Understanding

| Nama Dataaset | MINDS-14 (Multilingual Intent Navigation and Discovery in 14 languages) |
|---------------|-------------------------------------------------------------------------|
| Pembuat | PolyAI |
| Deskripsi Singkat | Ini adalah dataset audio yang sangat populer untuk melatih dan mengevaluasi model Spoken Language Understanding (SLU), terutama untuk tugas klasifikasi niat (intent classification). Dataset ini berisi rekaman suara orang-orang yang memberikan perintah atau pertanyaan terkait domain perbankan online. |
| Isi Dataset | - File audio dalam format .wav <br> - Transkripsi teks dari setiap file audio. <br> - Label niat (intent) untuk setiap rekaman. Contoh niatnya seperti pay_bill (bayar tagihan), transfer (transfer uang), balance (cek saldo), dll. |
| Fitur Utama | Multilingual |

### Daftar Bahasa dan Kode Konfigurasi

Kode bahasa (`name`) :

- `cs-CZ` (Czech)
- `de-DE` (German)
- `en-AU` (English, Australia)
- `en-GB` (English, UK)
- `en-US` (English, US)
- `es-ES` (Spanish)
- `fr-FR` (French)
- `it-IT` (Italian)
- `ko-KR` (Korean)
- `nl-NL` (Dutch)
- `pl-PL` (Polish)
- `pt-PT` (Portuguese)
- `ru-RU` (Russian)
- `zh-CN` (Chinese, Mandarin)

### Data Fields

| Nama | Tipe | Deskripsi |
|------|------|-----------|
| `path` | string | Path to the audio file |
| `audio` | dict | Audio object including loaded audio array, sampling rate and path ot audio |
| `transcription` | string | Transcription of the audio file |
| `english_transcription` | string | English transcription of the audio file |
| `intent_class` | integer | Class id of intent |
| `lang_id` | integer | Id of language | 

### Citation Information

`author`     : Daniela Gerz and Pei{-}Hao Su and Razvan Kusztos and Avishek Mondal and Michal Lis and Eshan Singhal and Nikola Mrksic and Tsung{-}Hsien Wen and Ivan Vulic<br> 
`title`      : Multilingual and Cross-Lingual Intent Detection from Spoken Data<br>
`journal`    : CoRR<br>
`volume`     : abs/2104.08524<br>
`year`       : 2021<br>
`url`        : https://arxiv.org/abs/2104.08524<br>
`eprinttype` : arXiv<br>
`eprint`     : 2104.08524<br>
`timestamp`  : Mon, 26 Apr 2021 17:25:10 +0200<br>
`biburl`     : https://dblp.org/rec/journals/corr/abs-2104-08524.bib<br>
`bibsource`  : dblp computer science bibliography, https://dblp.org

## Algoritma Understanding 

# Model Training & Evaluation

#### Import Library

In [1]:
import pandas as pd
import numpy as np
import random
import torch
import transformers

# Library dari Hugging Face
from datasets import load_dataset, Audio
import evaluate
from transformers import (
    AutoFeatureExtractor,
    AutoModelForAudioClassification,
    TrainingArguments,
    Trainer,
    pipeline,
)

print("Versi Transformers:", transformers.__version__)
print("Versi PyTorch:", torch.__version__)
print("Semua library berhasil di-import.")

  from .autonotebook import tqdm as notebook_tqdm

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "c:\0. Bootcamp AI\Repository\Bootcamp-AI\Project 3\Project 3\.venv\Lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "c:\0. Bootcamp AI\Repository\Bootcamp-AI\Project 3\Project 3\.venv\Lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "c:\0. Bootcamp AI\Reposi

Versi Transformers: 4.41.2
Versi PyTorch: 2.3.1+cpu
Semua library berhasil di-import.


### Load Dataset

In [2]:
print("\n--- Memuat Dataset MINDS-14 ---")

# Muat dataset untuk bahasa Inggris (AS)
dataset = load_dataset("PolyAI/minds14", name="en-US", trust_remote_code=True)

# Pastikan semua audio memiliki sampling rate 16kHz, standar untuk model Wav2Vec2
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

print("\nDataset berhasil dimuat:")
print(dataset)

# Membuat pemetaan dari ID ke label (dan sebaliknya) untuk memudahkan interpretasi
labels = dataset["train"].features["intent_class"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

print("\nDaftar Niat (Intent):", labels)


--- Memuat Dataset MINDS-14 ---

Dataset berhasil dimuat:
DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 563
    })
})

Daftar Niat (Intent): ['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill']


In [3]:
print("\n--- Memuat Model Pre-trained Wav2Vec2 ---")
model_checkpoint = "facebook/wav2vec2-base"

# Feature extractor akan memproses sinyal audio mentah menjadi format yang dimengerti model
feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)

# Muat model dengan kepala klasifikasi di atasnya.
# Kepala klasifikasi ini masih acak dan perlu dilatih (fine-tune).
model = AutoModelForAudioClassification.from_pretrained(
    model_checkpoint,
    num_labels=len(labels),
    label2id=label2id,
    id2label=id2label,
)
print("\nModel dan Feature Extractor berhasil dimuat.")


--- Memuat Model Pre-trained Wav2Vec2 ---


Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Model dan Feature Extractor berhasil dimuat.


In [4]:
print("\n--- Melakukan Pre-processing Data ---")

def preprocess_function(examples):
    """
    Fungsi untuk mengubah data audio mentah menjadi 'input_values' yang siap
    digunakan oleh model.
    """
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=16000 * 5,  # Batasi durasi audio menjadi 5 detik
        truncation=True,
    )
    return inputs

# Terapkan fungsi preprocessing ke seluruh dataset
# Kita juga menghapus kolom yang tidak diperlukan untuk training
encoded_dataset = dataset.map(
    preprocess_function,
    remove_columns=["audio", "transcription", "english_transcription", "lang_id"],
    batched=True
)

# Ganti nama kolom 'intent_class' menjadi 'label' karena Trainer API mencarinya
encoded_dataset = encoded_dataset.rename_column("intent_class", "label")
print("\nPre-processing selesai.")


--- Melakukan Pre-processing Data ---

Pre-processing selesai.


In [5]:
print("\n--- Memisahkan data menjadi Train dan Eval ---")
# Ini adalah praktik terbaik untuk mendapatkan evaluasi performa model yang jujur
# Kita gunakan 20% data untuk validasi
splits = encoded_dataset["train"].train_test_split(test_size=0.2, stratify_by_column="label")

train_dataset = splits['train']
eval_dataset = splits['test']

print("\nJumlah data Latihan:", len(train_dataset))
print("Jumlah data Validasi:", len(eval_dataset))


--- Memisahkan data menjadi Train dan Eval ---


ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

In [None]:
print("\n--- Mengkonfigurasi Proses Training ---")
# Definisikan metrik untuk evaluasi
accuracy_metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    """Menghitung akurasi selama evaluasi."""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return accuracy_metric.compute(predictions=predictions, references=eval_pred.label_ids)

# Tentukan argumen-argumen untuk training
training_args = TrainingArguments(
    output_dir="./voice_recognition_model",
    evaluation_strategy="epoch",      # Evaluasi model di setiap akhir epoch
    save_strategy="epoch",            # Simpan model di setiap akhir epoch
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,               # Jumlah epoch (bisa dimulai dengan 3 untuk percobaan cepat)
    weight_decay=0.01,
    load_best_model_at_end=True,      # Di akhir, model terbaik akan dimuat secara otomatis
    metric_for_best_model="accuracy",
    push_to_hub=False,
)


--- Mengkonfigurasi Proses Training ---


TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=feature_extractor, # Feature extractor diteruskan sebagai tokenizer
    compute_metrics=compute_metrics,
)

print("\n--- Memulai Proses Training ---")
# Mulai proses fine-tuning!
trainer.train()
print("\n--- Proses Training Selesai ---")

In [None]:
print("\n--- Menyimpan dan Menguji Model Final ---")

# Simpan model terbaik yang sudah dilatih
NAMA_MODEL_FINAL = "model_voice_recognition_final"
trainer.save_model(NAMA_MODEL_FINAL)
print(f"Model final disimpan di folder: {NAMA_MODEL_FINAL}")

# Gunakan 'pipeline' untuk cara termudah melakukan prediksi
pipe = pipeline("audio-classification", model=NAMA_MODEL_FINAL)

# Pilih sampel acak dari data validasi (yang tidak pernah dilihat saat training)
random_sample = random.choice(eval_dataset)
file_audio_untuk_tes = random_sample["path"]
label_asli_id = random_sample["label"]
label_asli_nama = id2label[str(label_asli_id)]

print(f"\nMelakukan prediksi pada file audio: {file_audio_untuk_tes}")

# Lakukan prediksi
hasil_prediksi = pipe(file_audio_untuk_tes)

# Tampilkan hasil
print("\n--- HASIL PREDIKSI ---")
print(f"Label Asli        : {label_asli_nama}")
# Prediksi teratas adalah yang pertama di dalam list hasil
prediksi_teratas = hasil_prediksi[0]
print(f"Prediksi Model    : {prediksi_teratas['label']} (dengan skor: {prediksi_teratas['score']:.4f})")

print("\nSemua hasil prediksi:")
print(hasil_prediksi)