# Notebook for arranging the datasets, splitting into test and train and augmetation

### I suggest that to understand this notebook, it is worth taking note of the points below

+ For the crowd sourced uk english datset, we merged the male and the female classes together manually
+ For the LibriTTs-British dataset, the data was represented in multiple nested directories and so we brought everything to a single parent directory (See the dataset for more context: https://www.kaggle.com/datasets/oscarvl/libritts-british-accents)
+ And finally the audiomentations library was used to augment the dataset
+ We augmented the UK accents data but not the LibriTTs data we used the Libritts-British data for testing and fintuning but not training

### Because manually joined the male and female accents in the crowdsourced UK accents dataset, this notebook is divided into to which are;
#### (1) Arranging the Libritts dataset to have all the classes under a single sub directory
#### (2) Splitting the crowd sourced UK accents dataset into train and test
#### (2) Applying data augmentation to the train data of the crowd sourced UK accents data

Installing the audiomentations library

In [None]:
! pip install audiomentations

Collecting audiomentations
  Obtaining dependency information for audiomentations from https://files.pythonhosted.org/packages/be/0b/a7f3df0bc7625008933276103eaa008c388cc7848163fc562949b379b149/audiomentations-0.33.0-py3-none-any.whl.metadata
  Downloading audiomentations-0.33.0-py3-none-any.whl.metadata (10 kB)
Downloading audiomentations-0.33.0-py3-none-any.whl (76 kB)
   ---------------------------------------- 0.0/76.8 kB ? eta -:--:--
   ---------------------------------------- 76.8/76.8 kB 1.4 MB/s eta 0:00:00
Installing collected packages: audiomentations
Successfully installed audiomentations-0.33.0



[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


#### Importing the important libraries

In [None]:
import os
import librosa
import soundfile as sf
import random
import shutil
from audiomentations import AddBackgroundNoise, PolarityInversion, Compose, AddGaussianNoise, PitchShift, HighPassFilter

<br />
<br />
<br />


## Section 1: Arranging the LibriTTs dataset to have all the classes under a single parent directory

Defining the path to the Libritts Dataset

In [None]:
data_dir = "archive"

#### Moving all the information in subdirectories for the LiBritts dataset to the parent directories

In [None]:
for sub_parent_directory in os.listdir(data_dir):
    parent_directory = f"{data_dir}/{sub_parent_directory}"
    for foldername, subfolders, filenames in os.walk(parent_directory):
        for filename in filenames:
            # Build the full path for the file
            file_path = f"{foldername}/{filename}"

            if f"{file_path.split('/')[-3]}/{file_path.split('/')[-2]}" != parent_directory:
                #Move the file to the parent directory
                shutil.move(file_path, parent_directory)

#### Now that the sub directories are empty, delete the sub directories

In [None]:
# Remove all subdirectories in the parent directory
for sub_parent_directory in os.listdir(data_dir):
    parent_directory = f"{data_dir}/{sub_parent_directory}"
    for foldername in os.listdir(parent_directory):
        folder_path = os.path.join(parent_directory, foldername)
        if os.path.isdir(folder_path):
            shutil.rmtree(folder_path)

#### Finally, we are only intrested in the accent data which are wav files, so we remove all files from the directories that are not WAV

In [None]:
for sub_parent_directory in os.listdir(data_dir):
    sub_parent_directory = f"{data_dir}/{sub_parent_directory}"
    for directory in os.listdir(sub_parent_directory):
        file_path = f"{sub_parent_directory}/{directory}"

        if file_path.lower().endswith(".wav"):
            pass
        else:
            # Remove non-WAV files
            os.remove(file_path)

Now we visualize the directories and they are now just 4 directories with our desired classes

In [None]:
os.listdir("archive")

['libritts-english', 'libritts-irish', 'libritts-scottish', 'libritts-welsh']

<br />
<br />
<br />
<br />
<br />

## Section 2: Splitting the crowd sourced UK accent dataset into train and test (80: 20)

Defining the paths to the test an train data


In [None]:
train_dir = "Data/train"
test_dir = "Data/test"

Creating a function that would take 20% of the specified data directory and move it to the specified test directory randomly

In [None]:
def move_20_percent_to_test(data_dir, test_dir, current_class):
    number_of_images = len(os.listdir(f"{data_dir}/{current_class}"))
    twenty_percent = int((20/100) * number_of_images)
    files_to_move = []

    images_to_move = []
    while len(images_to_move) < twenty_percent:
        random_image = random.randint(0, number_of_images - 1)

        if random_image not in images_to_move:
            images_to_move.append(random_image)

    for img_file in images_to_move:
        files_to_move.append(os.listdir(f"{data_dir}/{current_class}")[img_file])

    for image_name in files_to_move:
        shutil.move(f"{data_dir}/{current_class}/{image_name}", f"{test_dir}/{current_class}")

#### Creating a the test directory for each class

In [None]:
for current_class in os.listdir(train_dir):
    os.mkdir(f"{test_dir}/{current_class}")

Using the move_20_percent_method to move 20% of each class to the test directoriess created

In [None]:
for current_class in os.listdir(train_dir):
    move_20_percent_to_test(train_dir, test_dir, current_class)

<br />
<br />
<br />
<br />
<br />

## Section 3: Augmenting the crowd sourced UK Accent data

### The "audio_data_augmentation" class
#### This class is designed to carry out different augmentations on audio samples in a specified directory


#### About the "audio_data_augmentation" class
(1) This class is instantiated with 3 parameters which are the path to the audio, the path to the background noises to be added and the path where the augmented audio samples would be stored <br />
(2) The methods of the class include the add_noises, the pitch_shift, the high_pass_filter, the pick_background_noise, the random_augment and the augment_samples

#### About the  "audio_data_augmentation" class methods
<h4 style="text-decoration: underline;">(1) add_noises</h4>
<p> The add_noises method randomly selects between guassian noise and one of the background noises in the background noise directory with probabilities of 0.2 and 0.8 respectively</p>
<p> The method takes one parameter which is defines the probability of the selected noise beign applied to an audio sample</p>

<h4 style="text-decoration: underline;">(2) pitch_shift</h4>
<p> The pitch shift method applies a shift in pitch which can either be higher or lower. It also calls the add_noises method with a probability of 0.4 which implies that when ever the pitchshift method is called to augment an audio, there is a 40% chance that noise would also be applied</p>

<h4 style="text-decoration: underline;">(3) high_pass_filter</h4>
<p> This method filters audio sample by cutting off frequencies lower than 2000 and higher than 4000.</p>

<h4 style="text-decoration: underline;">(4) pick_background_noise</h4>
<p> This method calls the add_noise method to augment an audio with either gaussian noise or recorded background noise. It uses a probability of 1 which implies that what ever noise that is selected would definately be added to the audio sample</p>

<h4 style="text-decoration: underline;">(5) random_augment</h4>
<p> This method calls one of the aument methods randomly using specified probabilities</p>

<h4 style="text-decoration: underline;">(6) augment_samples</h4>
<p> This method applies a specified augmentation to the audio samples in the directory with the audios</p>

In [None]:
class audio_data_augmentation():
    def __init__(self, audio_path, aug_path, background_noises):
        self.audio_path = audio_path
        self.aug_path = aug_path
        self.background_noises = background_noises

    def add_noises(self, p = 1):
        # (1) Create a list to hold objects for holding background noise and guassian noise respectively
        Noises = [
            AddBackgroundNoise(sounds_path = self.background_noises,
                               min_snr_in_db = 3.0, max_snr_in_db = 20.0,
                               noise_transform = PolarityInversion(),p = p),

            AddGaussianNoise(min_amplitude=0.1, max_amplitude=0.2, p = p)
        ]

        # (2) Randomly picking between background noise and guassian noise with probabilities of 0.8 and 0.2 respectively
        choice = random.choices(Noises, weights = [0.8, 0.2], k = 1)[0]
        return choice

    def pitch_shift(self):
        return Compose([self.add_noises(p = 0.4),PitchShift(min_semitones = -8, max_semitones = 8, p = 1)])

    def high_pass_filter(self):
        return Compose([HighPassFilter(min_cutoff_freq = 2000, max_cutoff_freq = 4000, p = 1)])

    def pick_background_noise(self):
        return Compose([self.add_noises(p = 1)])

    def random_augment(self):
        return random.choices([self.pitch_shift(),
                               self.high_pass_filter(),
                               self.pick_background_noise()], weights = [0.05, 0.05, 0.9], k=1)[0]

    def augment_samples(self, directory, aug_technique, aug_per_sample = 1):
        for audio in os.listdir(f"{self.audio_path}/{directory}"):
            for i in range(aug_per_sample):
                signal, sr = librosa.load(f"{self.audio_path}/{directory}/{audio}", sr = 22050)
                augmented_signal = aug_technique(signal, sr)
                sf.write(f"{self.aug_path}/{directory}/aug_{i}_{audio}", augmented_signal, sr)

### Testing the class

#### Creating a variable to hold the path to a random audio for testing purposes

In [None]:
test_path = "aug_test/test_train"
test_aug_path = "aug_test/test_augmented_train"
test_dir = "southern"

#### Creating a variable to hold the directory with the recorded background noises to be added

In [None]:
background_noises_dir = "background_noise"

In [None]:
instance_to_test_class = audio_data_augmentation(test_path, test_aug_path, background_noises_dir)

In [None]:
instance_to_test_class.augment_samples(test_dir, instance_to_test_class.random_augment())

##### The test was successful in creating augmented versions of the audio samples Now I proceeded to using the class to augment the samples in the dataset
<br /><br /><br /><br />

## Augmenting the UK accent dataset

#### Creating a list that stores to hold the directory names for the accents

In [None]:
accent_directories = os.listdir("Data/augmented_train")
accent_directories

['irish', 'midlands', 'northern', 'scottish', 'southern', 'welsh']

#### Defining the paths to the data to augment and where the augmented data should reside

In [None]:
train_data_path = "Data/train"
aug_data_path = "Data/augmented_train"

#### Creating an istance of the "audio_data_augmentation" class

In [None]:
augment_uk_accent_data = audio_data_augmentation(train_data_path, aug_data_path, background_noises_dir)

#### calling the "augment_samples" method to create a randomly augmented version for the uk accent data

In [None]:
for accent_directory in accent_directories:
    augment_uk_accent_data.augment_samples(accent_directory, augment_uk_accent_data.random_augment())