This notebook implements **Probabilistic Approach for Detection of Vocal Pathologies in the Arabic Speech**, a method by Naim TERBEH for detecting vocal pathologies contained in the arabic speech

For more information read his paper [Probabilistic Approach for Detection of Vocal Pathologies in the Arabic Speech](https://www.researchgate.net/publication/274832247_Probabilistic_Approach_for_Detection_of_Vocal_Pathologies_in_the_Arabic_Speech)

# Phonetic Distance and Classification

This task to generate the phonetic distance requires that:

1. We prepare n healthy speech corpus **($C_{i}$, 1 ≤ i ≤ n)**, and for each corpus, we determinate the correspond phonetic model **$M_{i}$, (1 ≤ i ≤ n)**.
2. We define **S = { αij; 1 ≤ i,j ≤ n and i≠j }** a set of angles that separate **Mi** and **Mj (αij=αji and αii=0)**. 
3. We define the value **Max = maximum{S}**.
4. We define the value **δ = standard deviation{S}**.
5. We define the value **Avg = average{S}**.
6. We calculate **β = Max+|Avg-δ|**.

To calculate the set **S**, we follow these scalar product formulas: 

$$M_{i} \cdot M_{j}=\sum_{k=1}^{n}M_{i}[k]M_{j}[k]$$

$M_{i} \cdot M_{j}=\left \| M_{i} \right \| \cdot \left \| M_{j} \right \|\cdot \cos \left (\alpha  \right )$ with $\alpha$ is the angle that separates between $M_{i}$ and $M_{j}$

$$\cos \left (\alpha  \right ) = \frac{M_{i} \cdot M_{j}}{\left \| M_{i} \right \| \cdot \left \| M_{j} \right \|}$$

For more information read the paper [Probabilistic Approach for Detection of Vocal Pathologies in the Arabic Speech](https://www.researchgate.net/publication/274832247_Probabilistic_Approach_for_Detection_of_Vocal_Pathologies_in_the_Arabic_Speech)

In [17]:
import re
import string


class Cleaner:
    """
    This class will serve for cleanning the corpus.
    It keeps only the arabic letters in a corpus, removing punctuations, spectial caracters 
    and normalizing the arabic letters. 
    """
    
    @classmethod
    def keep_only_arabic(cls, text):
        """ Keep only arabic letters
        The interval [\u0600-\u06FF] represents utf-8 code point representation for the arabic letters
        For more information visit https://www.utf8-chartable.de/unicode-utf8-table.pl and chose Arabic as block
        """
        if not text:
            return text
        else:
            text = re.findall(r'[\u0600-\u06FF]+', text)
            # Remove all items in the resulting list that have len(item) <= 1
            text = [word for word in text if len(word) > 1]
            # Rejoin the words
            text = ' '.join(text)
            return text

    @classmethod
    def remove_diacritics(cls, text):
        """Remove diacritics"""
        arabic_diacritics = re.compile("""
                                     ّ    | # Tashdid
                                     َ    | # Fatha
                                     ً    | # Tanwin Fath
                                     ُ    | # Damma
                                     ٌ    | # Tanwin Damm
                                     ِ    | # Kasra
                                     ٍ    | # Tanwin Kasr
                                     ْ    | # Sukun
                                     ـ     # Tatwil/Kashida
                                 """, re.VERBOSE)
        if not text:
            return text
        else:
            text = re.sub(arabic_diacritics, '', text)
            return text

    @classmethod
    def remove_punctuations(cls, text):
        """Remove punctuations"""
        arabic_punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ'''
        english_punctuations = string.punctuation
        punctuations_list = arabic_punctuations + english_punctuations

        if not text:
            return text
        else:
            translator = str.maketrans('', '', punctuations_list)
            return text.translate(translator)

    @classmethod
    def normalize_arabic(cls, text):
        """Normalize characters"""
        if not text:
            return text
        else:
            text = re.sub("[إأآ]", "ا", text)
            text = re.sub("ى", "ي", text)
            text = re.sub("ؤ", "و", text)
            text = re.sub("ئ", "ي", text)
            text = re.sub("ة", "ت", text)
            text = re.sub("گ", "ك", text)
            return text


In [18]:
import numpy as np


class Corpus:
    """This class will be the numerical representation for a corpus"""
    
    def __init__(self, corpus=None, path_to_corpus=None):
        # Either we get a path to a file or the corpus it self
        # No checking if booth are None
        # May be added later, if further development
        self.corpus = corpus
        self.path_to_corpus = path_to_corpus

        # If path to corpus file is set
        # read the file and store its content to self.corpus
        if self.path_to_corpus:
            with open(file=self.path_to_corpus, encoding="utf-8") as file:
                self.corpus = file.read()

        # A dict of bi-phonemes and corresponding frequency
        self.bi_phonemes_frequencies = self.calc_bi_ph_freq()
        # A numpyArr of bi-phonemes frequencies
        self.bi_phonemes_frequencies_list = np.array(list(self.bi_phonemes_frequencies.values()))

    def calc_bi_ph_freq(self):
        # Step 1
        # Clean the corpus
        # Keep only arabic letters
        self.corpus = Cleaner.keep_only_arabic(self.corpus)
        # Remove diacritics, special characters, punctuations...
        self.corpus = Cleaner.remove_diacritics(self.corpus)
        self.corpus = Cleaner.remove_punctuations(self.corpus)
        # Normalize characters
        self.corpus = Cleaner.normalize_arabic(self.corpus)

        # Generate a dict with all possible bi-phonemes arrangement from the arabic alphabet
        # with values initialized to 0.0
        arabic_alphabet = 'ابجدهوزحطيكلمنسعفصقرشتثخذضظغ'
        bi_phonemes_arrangement = {''.join([letter_1, letter_2]): 0.0 for letter_1 in arabic_alphabet for letter_2 in
                                   arabic_alphabet}
        #print(bi_phonemes_arrangement)
        # Split the corpus into list of words
        corpus_word_list = self.corpus.split()

        bi_phonemes_from_corpus = []

        # for each word in the corpus
        for word in corpus_word_list:
            # exemple :
            # word = 'حمزة'
            # i in [1, 4]
            for i in range(1, len(word)):
                # in this example for each itr bi_phoneme_from_word will be
                # 'حم'
                # 'مز'
                # 'زة'
                bi_phoneme_from_word = ''.join([word[i - 1], word[i]])
                # Inc the corresponding key
                if bi_phoneme_from_word in bi_phonemes_arrangement:
                    bi_phonemes_arrangement[bi_phoneme_from_word] += 1
                bi_phonemes_from_corpus.append(bi_phoneme_from_word)
        # A dict of bi-phonemes and corresponding frequency
        bi_phonemes_frequencies = {bi_phoneme: (bi_phoneme_count / len(bi_phonemes_from_corpus)) for
                                   bi_phoneme, bi_phoneme_count in
                                   bi_phonemes_arrangement.items()}
        return bi_phonemes_frequencies


In [19]:
import numpy as np
from numpy.linalg import norm


class Model:
    """This class will serve for determinating and hloding our magic numbers"""
    
    def __init__(self, *args):
        # A list of bi-phonemes frequencies, we'll name it models, a model for each corpus
        self._models = []
        self._models.extend(args)

        # Initialize S (self.separating_angles)
        # S will contain the angles separating the models
        self.angles_between_models = [[0 for _ in range(len(args))] for _ in range(len(args))]

        # Calculate the angles that separates the models,
        # and store each value in it's appropriate location in S
        for index1, model1 in enumerate(self._models):
            for index2, model2 in enumerate(self._models):
                # Calculate the scalar product
                # We can replace @ operator with np.dot()
                scalar_product = model1 @ model2
                # Calculate norm product
                norm_product = np.linalg.norm(model1) * np.linalg.norm(model2)
                # Calculate cos(alpha)
                # Calculate cos(alpha)
                # cos_alpha = (model1 @ model2) / (norm(model1) * norm(model2))
                cos_alpha = scalar_product / norm_product

                # cos_alpha = np.round(cos_alpha, 10)

                # if cos_alpha > 1:
                #     cos_alpha = 1
                # elif cos_alpha < -1:
                #     cos_alpha = -1
                angle = np.arccos(np.clip(cos_alpha, -1, 1))

                self.angles_between_models[index1][index2] = angle

        # Final step
        # Getting avg, max and std out of S
        # S is a symmetric matrix, so we need only the half -1 of its values
        # The upper or lower half doesn't matter
        # Converting S into a set
        x = set(np.ndarray.flatten(np.array(self.angles_between_models)))
        x.discard(0)
        x = np.array(list(x))
        self._max_s = np.max(x)
        self._avg_s = np.average(x)
        self._std_s = np.std(x)
        # Our beloved BETA
        self.beta = self._max_s + np.absolute(self._avg_s - self._std_s)

        # A global model for later usage when verifying a speech if healthy or not
        self.global_reference_model = np.average(np.array(self._models), axis=0)


# Classification procedure

The proposed method to classify Arabic speech can be summarized in these following
steps:

- Generation of n phonetic models of the Arabic speech (using one corpus for each phonetic model) to calculate the maximum distance between the Arabic phonetic models (phonetic distance),
- Generation of the phonetic reference model (the average of n previous models),
- For each new sequence to be classified, we generate the phonetic model proper to speaker (speaker can be normal, native, with disability, …),
- Compare these two models and classify the speech in input to healthy or pathological. 

In [20]:
import glob

In [29]:
# Step 1
# Get betha and the gloabl ref model
dataset_path = r'C:\PFE\datasets\v1\*.txt'
models = []
for file_name in glob.glob(dataset_path):
    models.append(Corpus(path_to_corpus=file_name).bi_phonemes_frequencies_list)

m = Model(*models)

beta = m.beta
GLOBAL_REF_MODEL = m.global_reference_model

with open(file='text.txt', encoding="utf-8") as file:
    txt = file.read()

# Test example 1
speaker_1 = " السلحفاة "
# Test example 2
speaker_2 = " الثلحفاة "
# Test example 3
speaker_3 = " بسم الله الرحمن الرحيم "
# Test example 4
speaker_4 = " بسم الله الرحمان الرحيم "

txt = txt + speaker_4

In [30]:
import pandas as pd

result = pd.DataFrame(np.round_(m.angles_between_models, 7),
                      columns=['Model_1', 'Model_2', 'Model_3', 'Model_4'],
                      index=['Model_1', 'Model_2', 'Model_3', 'Model_4'])
result

Unnamed: 0,Model_1,Model_2,Model_3,Model_4
Model_1,0.0,0.190761,0.471203,0.224548
Model_2,0.190761,0.0,0.473858,0.236543
Model_3,0.471203,0.473858,0.0,0.483466
Model_4,0.224548,0.236543,0.483466,0.0


In [31]:
# Step 2
# Get the speaker betha
speaker = Model(Corpus(corpus=txt).bi_phonemes_frequencies_list, GLOBAL_REF_MODEL)

phi = speaker.beta

In [32]:
# Step 3
# Get the classification
if phi >= beta:
    print ('classification: Pathological')
else:
    print ('classification: Healthy')

classification: Pathological
