# Draft Notebook on the Common Voice Dataset

This notebook covers a close up look on the dataset that we will be using in our project

It covers simple DL project steps as well as starting a baseline model to understand the probelm at hand

## Acquiring the data via Kaggle API

⚠️ : **Please note that we need to upload the kaggle.json file first**

In [1]:
!pip install pydub

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [2]:
%%bash
pip install -q kaggle
mkdir -p ~/.kaggle
cp kaggle.json ~/.kaggle/
chmod 600 /root/.kaggle/kaggle.json
kaggle datasets download -d mozillaorg/common-voice

Downloading common-voice.zip to /content



  0%|          | 0.00/12.0G [00:00<?, ?B/s]  0%|          | 1.00M/12.0G [00:00<37:57, 5.68MB/s]  0%|          | 2.00M/12.0G [00:00<28:00, 7.69MB/s]  0%|          | 6.00M/12.0G [00:00<12:45, 16.9MB/s]  0%|          | 10.0M/12.0G [00:00<08:47, 24.5MB/s]  0%|          | 14.0M/12.0G [00:00<08:37, 25.0MB/s]  0%|          | 19.0M/12.0G [00:00<06:40, 32.2MB/s]  0%|          | 24.0M/12.0G [00:00<05:43, 37.6MB/s]  0%|          | 28.0M/12.0G [00:01<06:28, 33.2MB/s]  0%|          | 33.0M/12.0G [00:01<05:45, 37.3MB/s]  0%|          | 39.0M/12.0G [00:01<05:13, 41.1MB/s]  0%|          | 44.0M/12.0G [00:01<05:07, 41.9MB/s]  0%|          | 49.0M/12.0G [00:01<04:55, 43.6MB/s]  0%|          | 54.0M/12.0G [00:01<04:42, 45.6MB/s]  0%|          | 59.0M/12.0G [00:01<04:43, 45.4MB/s]  1%|          | 64.0M/12.0G [00:01<05:08, 41.7MB/s]  1%|          | 69.0M/12.0G [00:02<04:56, 43.4MB/s]  1%|          | 75.0M/12.0G [00:02<04:53, 43.7MB/s]  1%|          | 80.0M/12.0G [00:02<04:47, 44.7MB/s] 

In [3]:
# unzip the downloaded file to `./data/`
%%bash
mkdir data
mv common-voice.zip ./data/
cd ./data/
unzip ./common-voice.zip

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



## Imports

In [35]:
import pandas as pd
import os
import math
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipd
import librosa
import librosa.display
import torch.nn as nn
from pydub import AudioSegment
from tqdm import *

## Getting to know the data

In [5]:
# Sampling a voice from each split

# train
filename = 'data/cv-other-train/cv-other-train/sample-000000.mp3'
display('Train Sample:')
display(ipd.Audio(filename))
# dev
filename = 'data/cv-other-dev/cv-other-dev/sample-000000.mp3'
display('Dev Sample:')
display(ipd.Audio(filename))
# test
filename = 'data/cv-other-test/cv-other-test/sample-000000.mp3'
display('Test Sample:')
display(ipd.Audio(filename))

'Train Sample:'

'Dev Sample:'

'Test Sample:'

In [6]:
# Get to see the CSV file for each split

# train
train_data = pd.read_csv('/content/data/cv-valid-train.csv')
train_data = train_data[:int(len(train_data)/20)]
display('Train Data:')
display(train_data.head())
# dev
dev_data = pd.read_csv('/content/data/cv-valid-dev.csv')
display('Dev Data:')
display(dev_data.head())
# test
test_data = pd.read_csv('/content/data/cv-valid-test.csv')
display('Test Data:')
display(test_data.head())

'Train Data:'

Unnamed: 0,filename,text,up_votes,down_votes,age,gender,accent,duration
0,cv-valid-train/sample-000000.mp3,learn to recognize omens and follow them the o...,1,0,,,,
1,cv-valid-train/sample-000001.mp3,everything in the universe evolved he said,1,0,,,,
2,cv-valid-train/sample-000002.mp3,you came so that you could learn about your dr...,1,0,,,,
3,cv-valid-train/sample-000003.mp3,so now i fear nothing because it was those ome...,1,0,,,,
4,cv-valid-train/sample-000004.mp3,if you start your emails with greetings let me...,3,2,,,,


'Dev Data:'

Unnamed: 0,filename,text,up_votes,down_votes,age,gender,accent,duration
0,cv-valid-dev/sample-000000.mp3,be careful with your prognostications said the...,1,0,,,,
1,cv-valid-dev/sample-000001.mp3,then why should they be surprised when they se...,2,0,,,,
2,cv-valid-dev/sample-000002.mp3,a young arab also loaded down with baggage ent...,2,0,,,,
3,cv-valid-dev/sample-000003.mp3,i thought that everything i owned would be des...,3,0,,,,
4,cv-valid-dev/sample-000004.mp3,he moved about invisible but everyone could he...,1,0,fourties,female,england,


'Test Data:'

Unnamed: 0,filename,text,up_votes,down_votes,age,gender,accent,duration
0,cv-valid-test/sample-000000.mp3,without the dataset the article is useless,1,0,,,,
1,cv-valid-test/sample-000001.mp3,i've got to go to him,1,0,twenties,male,,
2,cv-valid-test/sample-000002.mp3,and you know it,1,0,,,,
3,cv-valid-test/sample-000003.mp3,down below in the darkness were hundreds of pe...,4,0,twenties,male,us,
4,cv-valid-test/sample-000004.mp3,hold your nose to keep the smell from disablin...,2,0,,,,


Looks like we will not need to have most of the columns, so we will keep `filename` and `text`

In [7]:
# Dropping unwanted columns
train_data.drop(columns=["up_votes", "down_votes", "age", "gender", "accent", "duration"], inplace=True)
dev_data.drop(columns=["up_votes", "down_votes", "age", "gender", "accent", "duration"], inplace=True)
test_data.drop(columns=["up_votes", "down_votes", "age", "gender", "accent", "duration"], inplace=True)

In [8]:
# Creating synthetic data of merged audio files

# get length of total audio files in each split
train_len = len(train_data)
dev_len = len(dev_data)
test_len = len(test_data)

print(f">>>Before Merging...\nTrain: {train_len}\nDev: {dev_len}\nTest: {test_len}")

# We will merge 5 audio files together

# create a dataframe for each split to house the merged audio files
train_merge = pd.DataFrame({'audio': [], 'text': [], 'x1': [], 'x2': [], 'x3': [], 'x4': [], 'x5': []})
dev_merge = pd.DataFrame({'audio': [], 'text': [], 'x1': [], 'x2': [], 'x3': [], 'x4': [], 'x5': []})
test_merge = pd.DataFrame({'audio': [], 'text': [], 'x1': [], 'x2': [], 'x3': [], 'x4': [], 'x5': []})

# function to populate the new dataframes with the merged audio files
def populateMerge(split:str):
  data = {
      'train': train_data,
      'dev': dev_data,
      'test': test_data,
  }
  merge = {
      'train': train_merge,
      'dev': dev_merge,
      'test': test_merge,
  }
  bus = []
  for row in data[split].values:
    if len(bus) == 5:
      combined1 = (AudioSegment.from_file(f"./data/cv-valid-{split}/{bus[0][0]}")).overlay(
          AudioSegment.from_file(f"./data/cv-valid-{split}/{bus[1][0]}")
      )
      combined2 = (AudioSegment.from_file(f"./data/cv-valid-{split}/{bus[2][0]}")).overlay(
          AudioSegment.from_file(f"./data/cv-valid-{split}/{bus[3][0]}")
      )
      combined = (combined1.overlay(combined2)).overlay(AudioSegment.from_file(f"./data/cv-valid-{split}/{bus[4][0]}"))
      text = [bus[i][1] for i in range(5)]
      merge[split].loc[len(merge[split])] = [
          combined,
          text,
          bus[0],
          bus[1],
          bus[2],
          bus[3],
          bus[4],
      ]
      bus = []
    else:
      bus.append(row)

# populate the dataframes according to each split
populateMerge('train')
populateMerge('dev')
populateMerge('test')

# get length of total audio files in each merge split
train_len = len(train_merge)
dev_len = len(dev_merge)
test_len = len(test_merge)

print(f">>>After Merging...\nTrain: {train_len}\nDev: {dev_len}\nTest: {test_len}")

>>>Before Merging...
Train: 9788
Dev: 4076
Test: 3995


  element = np.asarray(element)
  element = np.asarray(element)
  element = np.asarray(element)


>>>After Merging...
Train: 1631
Dev: 679
Test: 665


In [19]:
# Examine the result of merge

# merged audio
display('Merged Audio: ')
display(train_merge.loc[0]['audio'])
# audio 1
display('1st Audio: ')
display(AudioSegment.from_file(f"./data/cv-valid-train/{train_merge.loc[0]['x1'][0]}"))
# audio 2
display('2nd Audio: ')
display(AudioSegment.from_file(f"./data/cv-valid-train/{train_merge.loc[0]['x2'][0]}"))
# audio 3
display('3rd Audio: ')
display(AudioSegment.from_file(f"./data/cv-valid-train/{train_merge.loc[0]['x3'][0]}"))
# audio 4
display('4th Audio: ')
display(AudioSegment.from_file(f"./data/cv-valid-train/{train_merge.loc[0]['x4'][0]}"))
# audio 5
display('5th Audio: ')
display(AudioSegment.from_file(f"./data/cv-valid-train/{train_merge.loc[0]['x5'][0]}"))

'Merged Audio: '

'1st Audio: '

'2nd Audio: '

'3rd Audio: '

'4th Audio: '

'5th Audio: '

## Data Preprocessing

We will now undergo various preprocessing steps to prepare the data for baseline modelling.


### Convert to Uniform Dimensions

In [None]:
# Standardise the dimensions of data

### Data Augmentation of Raw Audio

In [None]:
# Add more variety to the data

### Mel Spectrograms

In [None]:
# Capture the nature of the data as images instead of audio formats

### MFCC

In [34]:
# Extract the most essential frequency coefficients

# a helpful function to help us in extraction
def features_extractor(audio):
    #load the (audio) as ndarry
    sample_rate = 22050 # default librosa.load() value
    samples = audio.get_array_of_samples()
    audio = np.array(samples).astype(np.float32)
    #we extract mfcc
    mfccs_features = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
    #in order to find out scaled feature we do mean of transpose of value
    mfccs_scaled_features = np.mean(mfccs_features.T,axis=0)
    return mfccs_scaled_features


# iterate over all the splits
extracted_features_train=[]
extracted_features_dev=[]
extracted_features_test=[]

for row in tqdm(train_merge.values):
    audio = row[0]
    text=row[1]
    data=features_extractor(audio)
    extracted_features_train.append([data,text])
for row in tqdm(dev_merge.values):
    audio = row[0]
    text=row[1]
    data=features_extractor(audio)
    extracted_features_dev.append([data,text])
for row in tqdm(test_merge.values):
    audio = row[0]
    text=row[1]
    data=features_extractor(audio)
    extracted_features_test.append([data,text])

100%|██████████| 1631/1631 [00:49<00:00, 32.67it/s]
100%|██████████| 679/679 [00:22<00:00, 30.75it/s]
100%|██████████| 665/665 [00:16<00:00, 39.88it/s]


### Data Augmentation of Spectrograms

In [None]:
# Apply random frequency and time masking

## Model Experimenting

We will now use the features extracted from the audio files as input to the model