# Speech Recognition Model (ASR) for Cook Islands Māori
Rolando Coto-Solano (rolando.a.coto.solano@dartmouth.edu)<br>
Sally Akevai Nicholas (s.nicholas@massey.ac.nz)<br>
Last modification: 20220111

This code loads a language model trained using [XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2) and uses it to transcribe recordings in Cook Islands Māori. It was trained using 4 hours of transcribed recordings from the [Paradisec CIM collection](https://catalog.paradisec.org.au/collections/SN1). The code is based on [Fine-tuning XLS-R for Multi-Lingual ASR with 🤗 Transformers](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) by [Patrick von Platen](https://huggingface.co/patrickvonplaten).

The CIM model is licensed under the Kaitiakitanga license (https://github.com/TeHikuMedia/Kaitiakitanga-License), created by [Te Hiku Media](https://tehiku.nz/). You can use the model for non-profit purposes, but you must contact the authors to reuse it. Unless you are a member of the Cook Islands Māori community, please do not attempt to the data from the model.

# 1. Installation

## Installation of XLSR

In [1]:
%%capture
!pip install datasets
!pip install transformers==4.4.0
!pip install librosa
!pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html  > /dev/null

from transformers import Wav2Vec2ForCTC
from transformers import Wav2Vec2Processor

## Download model

In [None]:
!mkdir cim-checkpoint-lrec
!mkdir cim-checkpoint-lrec/checkpoint

%cd cim-checkpoint-lrec
!gdown --id 1SzP7ZhzZjkoq_LBdgHTBBtWd0dzMxd3d
!tar -xf cim-asr-meta.tar.gz
!rm cim-asr-meta.tar.gz

%cd checkpoint
!gdown --id 1O9FE1_KdgBgEBfgYWL-L2Bbqq1ZK5trd
!tar -xf cim-asr-meta2.tar.gz
!rm cim-asr-meta2.tar.gz
!gdown --id 1-erDEc9uiSQsUUsq8KT3cHa4zNvGMejQ
!gdown --id 1-xnsbUAQ8W9Gy8wpWpnL2KvfnyaBQvsL
%cd /content/

## Download sample CIM files

In [None]:
!wget https://github.com/Akevai/CIM-ASR-Models/blob/main/sample-cim.wav?raw=true 
!mv sample-cim.wav?raw=true sample-cim.wav
!wget https://raw.githubusercontent.com/Akevai/CIM-ASR-Models/main/sample-cim.csv

# 2. Audio Decoding

## Downsampling audio file

In [None]:
!ffmpeg -y -i sample-cim.wav -ac 1 -ar 16000 temp-sample-cim.wav
!rm sample-cim.wav
!mv temp-sample-cim.wav sample-cim.wav

## Preparing XLSR-Wav2Vec2 and decoding the audio files

In [None]:
import torch
import torchaudio
import pandas as pd
from datasets import Dataset
from transformers import Wav2Vec2FeatureExtractor

# Load models
pathCheckpoint = "cim-checkpoint-lrec/checkpoint"
model = Wav2Vec2ForCTC.from_pretrained(pathCheckpoint).to("cuda")
processor = Wav2Vec2Processor.from_pretrained("cim-checkpoint-lrec")

# Convert audio file to array
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = speech_array[0].numpy()
    batch["sampling_rate"] = sampling_rate
    batch["target_text"] = batch["sentence"]
    return batch

# Prepare batch processing of files
def prepare_dataset(batch):
    # check that all files have the correct sampling rate
    assert (
        len(set(batch["sampling_rate"])) == 1
    ), f"Make sure all inputs have the same sampling rate of {processor.feature_extractor.sampling_rate}."

    batch["input_values"] = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0]).input_values
    
    with processor.as_target_processor():
        batch["labels"] = processor(batch["target_text"]).input_ids
    return batch

# Load CSV file with audio files to be transcribed
dataTest = pd.read_csv("sample-cim.csv") 
common_voice_test = Dataset.from_pandas(dataTest)

# Extract features from audio files
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)
common_voice_test = common_voice_test.map(speech_file_to_array_fn, remove_columns=common_voice_test.column_names)
common_voice_test = common_voice_test.map(prepare_dataset, remove_columns=common_voice_test.column_names, batch_size=8, num_proc=4, batched=True)

# Process audio files
input_dict = processor(common_voice_test[0]["input_values"], return_tensors="pt", padding=True)
logits = model(input_dict.input_values.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)[0]

# Decode audio files
predictedText = processor.decode(pred_ids)

# Replace the output transcription with a CIM orthographic output
orthOrigin = ['ax', 'ex', 'ix', 'ox', 'ux', 'q']
orthTarget = ['ā', 'ē', 'ī', 'ō', 'ū', 'ꞌ']
for i in range(0,len(orthOrigin)): predictedText = predictedText.replace(orthOrigin[i], orthTarget[i])

# 3. Check CIM transcription output

In [None]:
print("Prediction:")
print(predictedText)

In [None]:
# Play wave
import IPython
IPython.display.Audio('sample-cim.wav') # This is required on Google Colab due to compatibility issues