# Multi-Lingual Audio Classification based on wav2vec2

#### Dataset language: Gujarati
#### Model: Wav2vec
#### Size: Base
#### Framework: HuggingFace
#### FinalAccuarcy: 99.2
####


Final Model: https://huggingface.co/manthan40/wav2vec2-base-finetuned-manthan_base

In [None]:
model_checkpoint = "facebook/wav2vec2-base"
batch_size = 32

In [78]:
%%capture
!pip install datasets==1.14
!pip install transformers==4.19.0
!pip install librosa

In [79]:
# from huggingface_hub import notebook_login
# notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [80]:
%%capture
!apt install git-lfs

In [81]:
from datasets import load_dataset, load_metric
metric = load_metric("accuracy")

# Step 1: Multi-lingual Audio (Gujarati digits)

####Cite: https://link.springer.com/chapter/10.1007/978-981-15-4828-4_18
####Ref: https://github.com/Nikunj1729/free-spoken-gujarati-digit-dataset


In [88]:
!git clone https://github.com/Nikunj1729/free-spoken-gujarati-digit-dataset
!mv free-spoken-gujarati-digit-dataset load_guj
!git clone https://github.com/manthanthakker/AI.git

Cloning into 'free-spoken-gujarati-digit-dataset'...
remote: Enumerating objects: 3333, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 3333 (delta 10), reused 0 (delta 0), pack-reused 3312[K
Receiving objects: 100% (3333/3333), 281.99 MiB | 29.99 MiB/s, done.
Resolving deltas: 100% (1316/1316), done.
Checking out files: 100% (1956/1956), done.
mv: cannot move 'free-spoken-gujarati-digit-dataset' to 'load_guj/free-spoken-gujarati-digit-dataset': Directory not empty
fatal: destination path 'AI' already exists and is not an empty directory.


### Custom Dataloader for Gujarati Dataset

In [None]:
!cp /content/AI/examples/load_guj.py load_guj/

In [116]:
import glob
all_wavs=glob.glob("/content/load_guj/**/*.wav", recursive=True)


from sklearn.model_selection import train_test_split
x, Y=train_test_split(all_wavs, test_size=0.2)

##Create train and validate json file
import json
with open('/content/load_guj/train.jsonl', 'w') as outfile:
    for entry in x:
        json.dump(entry, outfile)
        outfile.write('\n')
with open("/content/load_guj/test.jsonl", "w") as outfile:
    for entry in Y:
        json.dump(entry, outfile)
        outfile.write('\n')
!cp /content/load_guj/test.jsonl /content/load_guj/dev.jsonl

#### Load Dataset

In [91]:
from datasets import load_dataset, load_metric
train_data=load_dataset("/content/load_guj/load_guj.py",split="train")
val_data=load_dataset("/content/load_guj/load_guj.py",split="test")
test_data=val_data

No config specified, defaulting to: new_dataset/train
Reusing dataset new_dataset (/root/.cache/huggingface/datasets/new_dataset/train/1.1.0/78ee31d629cffeb08c29d8df707ee423a81b3d01b5a03e23006849a5418ea5fd)
No config specified, defaulting to: new_dataset/train
Reusing dataset new_dataset (/root/.cache/huggingface/datasets/new_dataset/train/1.1.0/78ee31d629cffeb08c29d8df707ee423a81b3d01b5a03e23006849a5418ea5fd)


In [92]:
from datasets.dataset_dict import DatasetDict
from datasets.dataset_dict import DatasetDict
dataset={}
dataset["train"]=train_data
dataset["test"]=test_data
dataset=DatasetDict(dataset)

In [93]:
labels = dataset["train"].features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

id2label["9"]

'9'

In [94]:
label2id

{'0': '0',
 '1': '1',
 '2': '2',
 '3': '3',
 '4': '4',
 '5': '5',
 '6': '6',
 '7': '7',
 '8': '8',
 '9': '9',
 '_silence_': '10',
 '_unknown_': '11'}

## Sample Files

In [95]:
import random
from IPython.display import Audio, display

for _ in range(5):
    rand_idx = random.randint(0, len(dataset["train"])-1)
    example = dataset["train"][rand_idx]
    audio = example["audio"]

    print(f'Label: {id2label[str(example["label"])]}')
    print(f'Shape: {audio["array"].shape}, sampling rate: {audio["sampling_rate"]}')
    display(Audio(audio["array"], rate=audio["sampling_rate"]))
    print()

Label: 8
Shape: (11883,), sampling rate: 16000



Label: 4
Shape: (13036,), sampling rate: 16000



Label: 6
Shape: (14159,), sampling rate: 16000



Label: 6
Shape: (10672,), sampling rate: 16000



Label: 9
Shape: (12245,), sampling rate: 16000





In [96]:
dataset["train"][2]

{'audio': {'array': array([-0.00018244, -0.00060319, -0.00058527, ..., -0.00145032,
         -0.00036994,  0.        ], dtype=float32),
  'path': '/content/load_guj/R3 - South Zone/S3/R3S3T7D4.wav',
  'sampling_rate': 16000},
 'file': '/content/load_guj/R3 - South Zone/S3/R3S3T7D4.wav',
 'label': 4}

# Step 2: Pre processing

In [97]:
max_duration = 1.0  # seconds

### Feature Extractor

In [98]:
from transformers import AutoFeatureExtractor
# model_checkpoint = "facebook/wav2vec2-base"
batch_size = 32
feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
feature_extractor

loading feature extractor configuration file https://huggingface.co/facebook/wav2vec2-base/resolve/main/preprocessor_config.json from cache at /root/.cache/huggingface/transformers/d4583dd9e59eb6295f8fe8b18833ae54d963a122d69aa1df7ecce6caafe18c8f.bc3155ca0bae3a39fc37fc6d64829c6a765f46480894658bb21c08db6155358d
loading configuration file https://huggingface.co/facebook/wav2vec2-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/c7746642f045322fd01afa31271dd490e677ea11999e68660a92619ec7c892b4.ce1f96bfaf3d7475cb8187b9668c7f19437ade45fb9ceb78d2b06a2cec198015
  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
Model config Wav2Vec2Config {
  "_name_or_path": "facebook/wav2vec2-base",
  "activation_dropout": 0.0,
  "adapter_kernel_size": 3,
  "adapter_stride": 2,
  "add_adapter": false,
  "apply_spec_augment": true,
  "architectures": [
    "Wav2Vec2ForPreTraining"
  ],
  "attention_dropout": 0.1,
  "bos_token

Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": false,
  "sampling_rate": 16000
}

In [99]:
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, 
        sampling_rate=feature_extractor.sampling_rate, 
        max_length=int(feature_extractor.sampling_rate * max_duration), 
        truncation=True, 
    )
    return inputs

In [100]:
preprocess_function(dataset['train'][:5])

  tensor = as_tensor(value)


{'input_values': [array([-3.3174595e-04, -1.2214939e-03, -5.0246180e-03, ...,
       -6.6962605e-03, -5.9453174e-02, -3.8864637e-05], dtype=float32), array([-4.6076626e-02, -8.0406994e-02, -7.4147314e-02, ...,
       -4.9441382e-02, -4.7110617e-02, -7.3776995e-05], dtype=float32), array([-0.00071336, -0.00459215, -0.00442695, ..., -0.01240169,
       -0.0024419 ,  0.00096849], dtype=float32), array([-1.0314876e-03,  2.7702912e-05,  7.5889082e-04, ...,
       -2.6096432e-02, -2.1230232e-02, -3.6365869e-05], dtype=float32), array([ 0.00704305,  0.04137399,  0.07761089, ..., -0.11417346,
       -0.07525888,  0.00263164], dtype=float32)]}

In [101]:
encoded_dataset = dataset.map(preprocess_function, remove_columns=["audio", "file"], batched=True)
encoded_dataset

  0%|          | 0/2 [00:00<?, ?ba/s]

  tensor = as_tensor(value)


  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_values', 'label'],
        num_rows: 1552
    })
    test: Dataset({
        features: ['input_values', 'label'],
        num_rows: 388
    })
})

# Training/Finetuning

In [102]:
from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer

num_labels = len(id2label)
model = AutoModelForAudioClassification.from_pretrained(
    model_checkpoint, 
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label,
)

loading configuration file https://huggingface.co/facebook/wav2vec2-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/c7746642f045322fd01afa31271dd490e677ea11999e68660a92619ec7c892b4.ce1f96bfaf3d7475cb8187b9668c7f19437ade45fb9ceb78d2b06a2cec198015
  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
Model config Wav2Vec2Config {
  "_name_or_path": "facebook/wav2vec2-base",
  "activation_dropout": 0.0,
  "adapter_kernel_size": 3,
  "adapter_stride": 2,
  "add_adapter": false,
  "apply_spec_augment": true,
  "architectures": [
    "Wav2Vec2ForPreTraining"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 1,
  "classifier_proj_size": 256,
  "codevector_dim": 256,
  "contrastive_logits_temperature": 0.1,
  "conv_bias": false,
  "conv_dim": [
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "conv_kernel": [
    10,
    3,
    3,
    3,
    3,
    2,
    2
  ],
  "conv_stride": [
    5,
 

Downloading:   0%|          | 0.00/363M [00:00<?, ?B/s]

storing https://huggingface.co/facebook/wav2vec2-base/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/ef45231897ce572a660ebc5a63d3702f1a6041c4c5fb78cbec330708531939b3.fcae05302a685f7904c551c8ea571e8bc2a2c4a1777ea81ad66e47f7883a650a
creating metadata file for /root/.cache/huggingface/transformers/ef45231897ce572a660ebc5a63d3702f1a6041c4c5fb78cbec330708531939b3.fcae05302a685f7904c551c8ea571e8bc2a2c4a1777ea81ad66e47f7883a650a
loading weights file https://huggingface.co/facebook/wav2vec2-base/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/ef45231897ce572a660ebc5a63d3702f1a6041c4c5fb78cbec330708531939b3.fcae05302a685f7904c551c8ea571e8bc2a2c4a1777ea81ad66e47f7883a650a
Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2ForSequenceClassification: ['quantizer.codevectors', 'project_q.bias', 'project_hid.bias', 'quantizer.weight_proj.weight', 'quantizer.weight_proj.bias', 'pro

In [110]:
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-manthan-gujarati-digits",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=10,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=False,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [111]:
import numpy as np

def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)


In [113]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics
)

Cloning https://huggingface.co/manthan40/wav2vec2-base-finetuned-manthan-gujarati-digits into local empty directory.


In [114]:
trainer.train()

***** Running training *****
  Num examples = 1552
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 4
  Total optimization steps = 120


Epoch,Training Loss,Validation Loss,Accuracy
0,1.3392,1.131465,0.966495
1,1.2319,0.948726,0.971649
2,1.0824,0.833811,0.981959
3,0.9995,0.753314,0.984536
4,0.8175,0.675887,0.992268
5,0.8015,0.6425,0.984536
6,0.7417,0.604755,0.987113
7,0.7181,0.584964,0.992268
8,0.6907,0.56871,0.989691
9,0.6511,0.561295,0.992268


***** Running Evaluation *****
  Num examples = 388
  Batch size = 32
Saving model checkpoint to wav2vec2-base-finetuned-manthan-gujarati-digits/checkpoint-12
Configuration saved in wav2vec2-base-finetuned-manthan-gujarati-digits/checkpoint-12/config.json
Model weights saved in wav2vec2-base-finetuned-manthan-gujarati-digits/checkpoint-12/pytorch_model.bin
Feature extractor saved in wav2vec2-base-finetuned-manthan-gujarati-digits/checkpoint-12/preprocessor_config.json
Feature extractor saved in wav2vec2-base-finetuned-manthan-gujarati-digits/preprocessor_config.json
***** Running Evaluation *****
  Num examples = 388
  Batch size = 32
Saving model checkpoint to wav2vec2-base-finetuned-manthan-gujarati-digits/checkpoint-24
Configuration saved in wav2vec2-base-finetuned-manthan-gujarati-digits/checkpoint-24/config.json
Model weights saved in wav2vec2-base-finetuned-manthan-gujarati-digits/checkpoint-24/pytorch_model.bin
Feature extractor saved in wav2vec2-base-finetuned-manthan-gujarati-

TrainOutput(global_step=120, training_loss=0.8881256699562072, metrics={'train_runtime': 326.9084, 'train_samples_per_second': 47.475, 'train_steps_per_second': 0.367, 'total_flos': 1.3861431455726208e+17, 'train_loss': 0.8881256699562072, 'epoch': 9.98})

In [115]:
trainer.push_to_hub()

Saving model checkpoint to wav2vec2-base-finetuned-manthan-gujarati-digits
Configuration saved in wav2vec2-base-finetuned-manthan-gujarati-digits/config.json
Model weights saved in wav2vec2-base-finetuned-manthan-gujarati-digits/pytorch_model.bin
Feature extractor saved in wav2vec2-base-finetuned-manthan-gujarati-digits/preprocessor_config.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.34k/361M [00:00<?, ?B/s]

Upload file runs/May13_01-47-25_1a8fbb25ce45/events.out.tfevents.1652406460.1a8fbb25ce45.81.5:  31%|###       …

remote: Enforcing permissions...        
remote: Allowed refs: all        
To https://huggingface.co/manthan40/wav2vec2-base-finetuned-manthan-gujarati-digits
   0b38f29..beb435e  main -> main

Dropping the following result as it does not have all the necessary fields:
{'dataset': {'name': 'new_dataset', 'type': 'new_dataset', 'args': 'train'}, 'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.9922680412371134}]}
remote: Enforcing permissions...        
remote: Allowed refs: all        
To https://huggingface.co/manthan40/wav2vec2-base-finetuned-manthan-gujarati-digits
   beb435e..cf60c22  main -> main



'https://huggingface.co/manthan40/wav2vec2-base-finetuned-manthan-gujarati-digits/commit/beb435ef19f0155bdba4a42e903da5765e329086'

In [109]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 388
  Batch size = 32


{'epoch': 9.98,
 'eval_accuracy': 0.9742268041237113,
 'eval_loss': 1.2405576705932617,
 'eval_runtime': 3.1446,
 'eval_samples_per_second': 123.384,
 'eval_steps_per_second': 4.134}