<a href="https://www.kaggle.com/code/orestasdulinskas/british-dialect-recognition?scriptVersionId=186865420" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

![voice recognition](https://cdn-icons-png.flaticon.com/512/1231/1231058.png)

# British dialect recognition
---
## Background
Language, particularly its variations and nuances, plays a crucial role in understanding and appreciating the cultural and social fabrics of regions. English, as spoken in the UK and Ireland, is characterized by a rich tapestry of dialects, each with its distinct phonetic, lexical, and syntactic features. These dialects not only reflect historical and geographical influences but also serve as markers of identity for individuals and communities. The advent of machine learning and data science has opened new avenues for exploring these linguistic variations systematically. This project aims to leverage these technologies to analyze and predict English dialects based on audio recordings, contributing to the broader field of sociolinguistics and dialectology.

## Objectives
The primary objective of this project is to train a recurrent neural network (RNN) model capable of predicting the dialect of English spoken by a speaker based on audio recordings. By doing so, the project seeks to:
1. Identify key linguistic features that distinguish different English dialects in the UK and Ireland.
2. Develop a robust and accurate machine learning model that can generalize across various speakers and contexts.

## Data
The dataset used for this project comprises recordings from speakers across a broad spectrum of dialects from the UK and Ireland. Participants self-identified their dialect from the following categories:

- Irish English
- Midlands English
- Northern English
- Scottish English
- Southern English
- Welsh English

To ensure comprehensive coverage of each dialect's unique features, different elicitation scripts were crafted for each speaker. However, a set of common sentences was included in all scripts to allow for direct comparison across dialects. These scripts were meticulously designed to highlight specific linguistic features pertinent to each dialect, facilitating contrastive analysis. Additionally, pronunciation variants of place names and other lexical items were captured.

Each elicitation line in the dataset is associated with a LINE_ID, an identifier linking the line to its source. This allows for the retrieval of the same line across different speakers and dialects, enabling detailed comparative analysis. The sources for these lines include Wikipedia, The Rainbow Passage, and modified lines from virtual assistant tasks, curated to include target words for accent elicitation while preserving the original content.

By utilizing this diverse and carefully curated dataset, the project aims to uncover the intricate patterns and distinctive characteristics of English dialects, ultimately leading to a model that can accurately predict dialects based on spoken language.

# Data Ingestion
---

In [None]:
import os
import librosa
import warnings
warnings.filterwarnings("ignore")

accents = ['irish_english_male',
                  'midlands_english_female',
                  'midlands_english_male',
                  'northern_english_female',
                  'northern_english_male',
                  'scottish_english_female',
                  'scottish_english_male',
                  'southern_english_female',
                  'southern_english_male',
                  'welsh_english_female',
                  'welsh_english_male']

def load_audio_files(data_dir, accents):
    
    audio_data = []
    labels = []
    
    for label, accent in enumerate(accents):
        folder_path = os.path.join(data_dir, accent)
        for file_name in os.listdir(folder_path):
            if file_name.endswith('.wav'):
                file_path = os.path.join(folder_path, file_name)
                audio, sample_rate = librosa.load(file_path, sr=16000)
                audio_data.append(audio)
                labels.append(label)
    
    return audio_data, labels

data_dir = '/kaggle/input/uk-and-ireland-english-dialect-speech/'
audio_data, labels = load_audio_files(data_dir, accents)

# Exploratory Data Analysis
---

### Dialect distribution

In [None]:
number_classes = {}

for accent in accents:
    number_classes[accent] = len(os.listdir(data_dir + accent))

In [None]:
import pandas as pd
import plotly.express as px

subject = pd.DataFrame.from_dict(number_classes, orient='index', columns=['Count'])
px.bar(subject, x=subject.index, y='Count', text='Count', template='ggplot2', title='Dialect distribution')

### Audio sample lengths

In [None]:
import wave
import contextlib

durations = []

for accent in accents:
    for i in os.listdir(data_dir + accent + '/'):
        try:
            with contextlib.closing(wave.open(data_dir + accent + '/' + i, 'r')) as f:
                frames = f.getnframes()
                rate = f.getframerate()
                durations.append(frames / float(rate))
        except:
            pass
        
px.histogram(durations, template='ggplot2', title='Audio file lengths', labels={'value':'Audio length in seconds'})

### Audio Samples

In [None]:
import matplotlib.pyplot as plt
import IPython.display as ipd
from IPython.display import display
plt.style.use('ggplot')

def sample_data(path, category):
    plt.figure(figsize=(15,3))
    data,sample_rate=librosa.load(path)
    print(category, '\n')
    librosa.display.waveshow(data,sr=sample_rate)
    display(ipd.Audio(path))

In [None]:
path = data_dir + accents[0] + '/irm_03397_00650953544.wav'
sample_data(path, accents[0])

In [None]:
path = data_dir + accents[1] + '/mif_02484_01361903134.wav'
sample_data(path, accents[1])

In [None]:
path = data_dir + accents[2] + '/mim_04310_00366760846.wav'
sample_data(path, accents[2])

In [None]:
path = data_dir + accents[3] + '/nof_04310_00012750120.wav'
sample_data(path, accents[3])

In [None]:
path = data_dir + accents[4] + '/nom_09697_01709812810.wav'
sample_data(path, accents[4])

In [None]:
path = data_dir + accents[5] + '/scf_06136_01402841624.wav'
sample_data(path, accents[5])

In [None]:
path = data_dir + accents[6] + '/scm_04310_01623687431.wav'
sample_data(path, accents[6])

In [None]:
path = data_dir + accents[7] + '/sof_07508_00253667648.wav'
sample_data(path, accents[7])

In [None]:
path = data_dir + accents[8] + '/som_07505_00831846630.wav'
sample_data(path, accents[8])

In [None]:
path = data_dir + accents[9] + '/wef_07049_00214407505.wav'
sample_data(path, accents[9])

In [None]:
path = data_dir + accents[10] + '/wem_12484_00567781102.wav'
sample_data(path, accents[10])

# Pre-processing
---

### Extract audio features

In [None]:
def extract_features(audio_data):
    features = []
    for audio in audio_data:
        mfccs = librosa.feature.mfcc(y=audio, sr=16000, n_mfcc=13)
        features.append(mfccs.T)
    return features

features = extract_features(audio_data)

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

max_len = max([len(feature) for feature in features])
X = pad_sequences(features, maxlen=max_len, padding='post', dtype='float32')
y = np.array(labels)

### Splitting data into training, validation and test sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Model Training
---

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Masking, Bidirectional, LSTM, Dropout, BatchNormalization, Dense

def build_rnn_model(input_shape, output_dim):
    model = Sequential()
    model.add(Masking(mask_value=0.0, input_shape=input_shape))
    model.add(Bidirectional(LSTM(128, return_sequences=True)))
    model.add(Dropout(0.5))
    model.add(LSTM(128, return_sequences=False))
    model.add(BatchNormalization())
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

input_shape = (max_len, 13)
output_dim = 11
model = build_rnn_model(input_shape, output_dim)

model.summary()

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ReduceLROnPlateau

early_stopping = EarlyStopping(monitor='loss', patience=10, min_delta=0.0001, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.5, patience=5, min_lr=0.0001)

In [None]:
history = model.fit(X_train,
                    y_train,
                    epochs=200,
                    batch_size=16,
                    validation_data=(X_val, y_val),
                    callbacks=[early_stopping, reduce_lr])

In [None]:
def performance_graph(history):
    history_df = pd.DataFrame(history.history)
    fig, axs = plt.subplots(1, 2, figsize=(15, 4))
    history_df.loc[2:, ['loss', 'val_loss']].plot(ax=axs[0])
    history_df.loc[2:, ['accuracy', 'val_accuracy']].plot(ax=axs[1])
    
    print(("Best Validation Loss: {:0.4f}" +\
          "\nBest Validation accuracy: {:0.4f}")\
          .format(history_df['val_loss'].min(), 
                  history_df['val_accuracy'].max()))

In [None]:
performance_graph(history)

# Evaluation
---

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

def performance_metrics(model, X_test, y_test):
    
    preds = model.predict(X_test)
   
    preds_labels = preds.argmax(axis=1)
    
    target_names = accents
    
    print(classification_report(y_test, preds_labels, target_names=target_names), '\n')

    cf_matrix = confusion_matrix(y_test, preds_labels, normalize='all')
    fig = px.imshow(pd.DataFrame(cf_matrix, columns=target_names, index=target_names), 
          template='ggplot2', title='Confusion Matrix', aspect='auto', text_auto=True, zmin=0,
          zmax=1, labels={'0':target_names[0],'1':target_names[1]})
    fig.show()

In [None]:
performance_metrics(model, X_test, y_test)

### Prediction on randomly picked audio samples

In [None]:
import random

def predict_emotion(model, accent):
    audio_file = data_dir + accent + '/' + random.choice(os.listdir(data_dir + accent))
    audio, sr = librosa.load(audio_file, sr=16000)
    mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13)
    X = pad_sequences([mfccs.T], maxlen=max_len, padding='post', dtype='float32')
    prediction = model.predict(X)
    predicted_label = np.argmax(prediction, axis=1)
    print('Audio:', accent, '\n\nPrediction:', accents[predicted_label[0]], ' Confidence:', prediction[0][predicted_label[0]], '\n')
    display(ipd.Audio(audio_file))

In [None]:
for accent in accents:
    predict_emotion(model, accent)