##### Description des fonctions et des paramètres :

1. **Imports et Configuration de base**
   - Importation des bibliothèques nécessaires : `sys`, `os`, `re`, `shutil`, `subprocess`, et des modules spécifiques du projet.
   - Ajout du chemin vers le dossier 'scripts' au `sys.path`.

2. **convert_misc_to_dict(misc: str) -> dict**
   - Convertit une chaîne de données diverses (misc) au format clé=valeur, séparées par `|`, en un dictionnaire.
   - Paramètre :
     - `misc`: Chaîne de caractères contenant des données diverses.
   - Retourne : 
     - Un dictionnaire avec des paires clé-valeur extraites de la chaîne diverse.

3. **create_textgrid(file_path: str, output_textgrid_path: str)**
   - Crée un fichier TextGrid à partir d'un fichier CoNLL-U.
   - Paramètres :
     - `file_path`: Chemin vers le fichier CoNLL-U.
     - `output_textgrid_path`: Chemin où le fichier TextGrid de sortie sera enregistré.

4. **preprocess_text_list(text_list: list, file_name: str) -> list**
   - Prétraite une liste de mots en remplaçant les contractions par leur forme étendue.
   - Paramètres :
     - `text_list`: Liste de mots à prétraiter.
     - `file_name`: Nom du fichier pour déterminer quelles contractions remplacer.
   - Retourne : 
     - Une liste de mots avec les contractions remplacées par leur forme étendue.

5. **create_textgrid_taln(file_path: str, output_textgrid_path: str)**
   - Crée un fichier TextGrid à partir d'un fichier CoNLL-U, avec des ajustements spécifiques pour certains fichiers.
   - Paramètres :
     - `file_path`: Chemin vers le fichier CoNLL-U.
     - `output_textgrid_path`: Chemin où le fichier TextGrid de sortie sera enregistré.

##### Paramètres globaux :

- `input_conllu_dir`: Répertoire contenant les fichiers CoNLL-U à traiter.
- `output_tg_dir`: Répertoire où les fichiers TextGrid de sortie seront enregistrés.
- `input_wav_dir`: Répertoire contenant les fichiers audio WAV correspondant aux fichiers CoNLL-U.

##### Processus principal :

1. Le script parcourt tous les fichiers dans `input_conllu_dir`.
2. Pour chaque fichier se terminant par `_MG.conllu` ou `_M.conllu` :
   - Il détermine le chemin de sortie pour le fichier TextGrid.
   - Il appelle la fonction `create_textgrid` ou `create_textgrid_taln` en fonction du type de fichier.
   - Il copie le fichier audio WAV correspondant dans le répertoire de sortie.

##### Résumé :
Le script convertit les fichiers CoNLL-U en fichiers TextGrid, en prétraitant les données textuelles pour gérer les contractions et en copiant les fichiers audio WAV correspondants dans un répertoire structuré.

In [1]:
import os
import re
import shutil
import subprocess
import sys

current_path = os.getcwd()
scripts_path = os.path.join(current_path, 'scripts')

# Add the path to the 'scripts' folder dynamically
sys.path.append(scripts_path)

from outils.conll3 import *
from praatio import textgrid

In [2]:
def convert_misc_to_dict(misc: str) -> dict:
    """
    Convert miscellaneous information from a string to a dictionary.

    Parameters:
    misc (str): A string containing miscellaneous data in key=value format, separated by '|'.

    Returns:
    dict: A dictionary with key-value pairs extracted from the miscellaneous string.
    """
    result_dict = {}
    pairs = misc.split("|")
    if pairs != ['_']:
        for pair in pairs:
            key, value = pair.split("=")
            result_dict[key] = value
    return result_dict

In [3]:
def create_textgrid(file_path: str, output_textgrid_path: str):
    """
    Create a TextGrid file from a CoNLL-U file.

    Parameters:
    file_path (str): Path to the CoNLL-U file.
    output_textgrid_path (str): Path where the output TextGrid file will be saved.
    """
    tg = textgrid.Textgrid()
    entryList = []
    trees = conllFile2trees(file_path)
    last_end_time = 0
    # post_start_time = 0
    intervals = []
    for tree_pos, tree in enumerate(trees):
        line = str(tree)
        # print(line)
        text_match = re.search(r"# text = (.+)", line)
        print("\n text_match:", text_match)  # Ensure text_match is not None
        if text_match:
            text_list = text_match.group(1).split()
        else:
            continue  # Skip this iteration if no match is found
        # print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
        print(text_list)
        misc_list = []
        # print(line)
        for l in line.split("\n"):
            if re.search(r"\d: ", l):
                idx = int(l.split(":")[0])  # get index
                misc_list.append(
                    {idx: convert_misc_to_dict(tree[idx].get("misc", "#"))}
                )
        # print(misc_list)
        # print("--------------------------------------------------------------------------------")
        words = tree.words
        # print(words)
        # print("=================================================================================")
        sentence = []
        i_text = 0
        current_text = ""
        current_xmin = 0
        current_xmax = 0
        for i in range(len(words)):
            if i == 0 and tree_pos == 0:
                align_end = misc_list[i_text].get(i_text + 1).get("AlignBegin")
                if align_end is not None:
                    current_xmax = int(align_end) / 1000
                    if current_xmax != 0.0:
                        intervals.append((0, current_xmax, "#"))
                

            if words[i] != "#":
                sentence.append(words[i])
                
            if i == len(words) - 1:
                if sentence:
                    print('sentence: ', sentence)
                    sentence_without_punc = [
                        sentence[j]
                        for j in range(len(sentence))
                        if tree[(i_text + j + 1)].get("tag") != "PUNCT"
                    ]
                    # put the words of the sentence into a string
                    current_text = " ".join(sentence_without_punc)
                    start = i_text
                    while i_text < len(text_list):
                        print(
                            "text_list[i_text]:",
                            text_list[i_text],
                            "sentence[-1]:",
                            sentence[-1],
                            "i_text - start +1:",
                            i_text - start + 1,
                            "len(sentence):",
                            len(sentence),
                            "start:",
                            start,
                            "i_text:",
                            i_text,
                            "text_list[i_text]:",
                            text_list[i_text],
                        )
                        if len(sentence) == 1 and text_list[i_text] != "#":
                            align_begin = (
                                misc_list[i_text].get(i_text + 1).get("AlignBegin")
                            )
                            current_xmin = int(align_begin) / 1000
                            align_end = (
                                misc_list[i_text].get(i_text + 1).get("AlignEnd")
                            )
                            current_xmax = int(align_end) / 1000
                            if current_xmin != current_xmax:
                                intervals.append((
                                        current_xmin, current_xmax, current_text
                                    )
                                )
                            i_text += 1
                            break

                        elif text_list[i_text] == sentence[0] and start == i_text:
                            pos = tree[i_text + 1].get("tag")
                            temp_i_text = i_text
                            while pos == "PUNCT" and words[i_text] != "#":
                                temp_i_text += 1
                                pos = tree[temp_i_text + 1].get("tag")
                            align_begin = (
                                misc_list[temp_i_text]
                                .get(temp_i_text + 1)
                                .get("AlignBegin")
                            )
                            current_xmin = int(align_begin) / 1000

                        elif text_list[i_text] == sentence[
                            -1
                        ] and i_text - start + 1 == len(sentence):
                            print(sentence)
                            temp_i_text = i_text
                            pos = tree[temp_i_text + 1].get("tag")
                            not_punct = ["#"]
                            while pos == "PUNCT" and words[i_text] not in not_punct:
                                temp_i_text -= 1
                                pos = tree[temp_i_text + 1].get("tag")
                            align_end = (
                                misc_list[temp_i_text]
                                .get(temp_i_text + 1)
                                .get("AlignEnd")
                            )
                            current_xmax = int(align_end) / 1000
                            if current_xmin != current_xmax:
                                intervals.append((
                                        current_xmin, current_xmax, current_text
                                    )
                                )
                            i_text += 1
                            break
                        i_text += 1

    tier = textgrid.IntervalTier('trans', intervals, 0, current_xmax)
    tg.addTier(tier)
    tg.save(output_textgrid_path, format="long_textgrid", includeBlankSpaces=True)

In [4]:
def preprocess_text_list(text_list: list, file_name:str) -> list:
    """
    Preprocesses a list of words by replacing contractions with their expanded form.

    Parameters:
    text_list (list): A list of words.

    Returns:
    list: A list of words with contractions replaced by their expanded form.
    """
    new_text_list = []
    contractions = {
        "don't": ["do", "n't"],
        "i'm": ["i", " 'm"],
        "what's": ["what", " 's"],
        "can't": ["ca", "n't"],
        "we're": ["we", " 're"],
        # "dat's": ["dat", " 's"], # présents sous les deux formes dans le corpus (dat's et dat 's)
        "didn't": ["did", "n't"],
        "devil's": ["devil", " 's"],
        "you'll": ["you", " 'll"],
        }

    if "WAZP_03" not in file_name and "ENU_01" not in file_name and "ONI_07" not in file_name:
        contractions["it's"] = ["it", " 's"]
        
    if "WAZA_02" not in file_name and "PRT_11" not in file_name:
        contractions["cannot"] = ["can", "not"]
    
    if "KAD_10" in file_name:
        contractions["dat's"] = ["dat", " 's"]
        
    for word in text_list:
        if word.lower() in contractions:  
            new_text_list.extend(contractions[word.lower()])
        else:
            new_text_list.append(word)
    return new_text_list

In [5]:
def create_textgrid_taln(file_path: str, output_textgrid_path: str) -> None:
    """
    Create a TextGrid file from a CoNLL-U formatted file. Here, remove the # in the gold CoNLL-U files.

    Parameters:
    file_path (str): Path to the CoNLL-U file.
    output_textgrid_path (str): Path where the output TextGrid file will be saved.
    
    Returns:
    None
    """
    tg = textgrid.Textgrid()
    entryList = []
    trees = conllFile2trees(file_path)
    last_end_time = 0

    intervals = []
    for tree_pos, tree in enumerate(trees):
        line = str(tree)
        text_match = re.search(r"# text = (.+)", line)
        text_list = text_match.group(1).split()
        text_list = preprocess_text_list(text_list, file_path)
        misc_list = []
        for l in line.split("\n"):
            if re.search(r"\d: ", l):
                idx = int(l.split(":")[0])  # get index
                misc_list.append(
                    {idx: convert_misc_to_dict(tree[idx].get("misc", "#"))}
                )
        words = tree.words
        sentence = []
        i_text = 0
        current_text = ""
        current_xmin = 0
        current_xmax = 0

        # print("\nwords: ", words)
        for i in range(len(words)):
            if i == 0 and tree_pos == 0:
                align_end = misc_list[i_text].get(i_text + 1).get("AlignBegin")
                current_xmax = int(align_end) / 1000
                if current_xmax != 0.0:
                    intervals.append((0, current_xmax, "#"))
                    last_end_time = current_xmax

            # Include words starting with #
            sentence.append(words[i])

            if i == len(words) - 1:
                if sentence:
                    
                    sentence_without_punc = [
                        sentence[j]
                        for j in range(len(sentence))
                        if tree[(i_text + j + 1)].get("tag") != "PUNCT"
                    ]
                    current_text = " ".join(sentence_without_punc)

                    print('\nsentence: ', sentence)
                    start = i_text

                    while i_text < len(text_list):
                        # print(
                        #     "text_list[i_text]:",
                        #     text_list[i_text],
                        #     "sentence[-1]:",
                        #     sentence[-1],
                        #     "i_text - start +1:",
                        #     i_text - start + 1,
                        #     "len(sentence):",
                        #     len(sentence),
                        #     "start:",
                        #     start,
                        #     "i_text:",
                        #     i_text,
                        #     "text_list[i_text]:",
                        #     text_list[i_text],
                        # )

                        if len(sentence) == 1:
                            print("len(sentence) == 1")
                            print("sentence: ", sentence)
                            align_begin = (
                                misc_list[i_text].get(i_text + 1).get("AlignBegin")
                            )
                            current_xmin = int(align_begin) / 1000
                            align_end = (
                                misc_list[i_text].get(i_text + 1).get("AlignEnd")
                            )
                            current_xmax = int(align_end) / 1000
                            if current_xmin != current_xmax:
                                if intervals and current_xmin < intervals[-1][1]:
                                    current_xmin = intervals[-1][1]
                                intervals.append(
                                    (
                                        current_xmin, current_xmax, current_text
                                    )
                                )
                            i_text += 1
                            break

                        elif text_list[i_text] == sentence[0] and start == i_text:
                            print("text_list[i_text] == sentence[0] and start == i_text")
                            print("sentence: ", sentence)
                            pos = tree[i_text + 1].get("tag")
                            temp_i_text = i_text
                            while pos == "PUNCT":
                                temp_i_text += 1
                                pos = tree[temp_i_text + 1].get("tag")
                            align_begin = (
                                misc_list[temp_i_text]
                                .get(temp_i_text + 1)
                                .get("AlignBegin")
                            )
                            current_xmin = int(align_begin) / 1000

                        elif text_list[i_text] == sentence[
                            -1
                        ] and i_text - start + 1 == len(sentence):
                            print("text_list[i_text] == sentence[-1] and i_text - start + 1 == len(sentence)")
                            print("sentence: ", sentence)
                            temp_i_text = i_text
                            pos = tree[temp_i_text + 1].get("tag")
                            while pos == "PUNCT":
                                temp_i_text -= 1
                                pos = tree[temp_i_text + 1].get("tag")
                            align_end = (
                                misc_list[temp_i_text]
                                .get(temp_i_text + 1)
                                .get("AlignEnd")
                            )
                            current_xmax = int(align_end) / 1000
                            if current_xmin != current_xmax:
                                if intervals and current_xmin < intervals[-1][1]:
                                    current_xmin = intervals[-1][1]
                                intervals.append(
                                    (
                                        current_xmin, current_xmax, current_text
                                    )
                                )
                            i_text += 1
                            break
                        i_text += 1
                # print(intervals)

    tier = textgrid.IntervalTier('trans', intervals, 0, current_xmax)
    tg.addTier(tier)
    tg.save(output_textgrid_path, format="long_textgrid", includeBlankSpaces=True)


## Paramètres

In [6]:
input_conllu_dir = "../data/SUD_Naija-NSC-master/" # Path to the directory containing the CoNLL-U files
output_tg_dir = "../data/TG_WAV/" # Path to the directory where the TextGrid files will be saved
input_wav_dir = "/Users/perrine/Desktop/Stage_2023-2024/WAV/" # Path to the directory containing the WAV files

if not os.path.exists(output_tg_dir):
    os.makedirs(output_tg_dir)

In [7]:
for fichier in os.listdir(input_conllu_dir):
    if fichier.endswith("MG.conllu") or fichier.endswith("M.conllu"):
        chemin_conllu = os.path.join(input_conllu_dir, fichier)
        # Generate the output TextGrid file name
        nom_fichier_sans_extension = os.path.splitext(fichier)[0]

        if nom_fichier_sans_extension.startswith("ABJ"):
            folder = "_".join(nom_fichier_sans_extension.split("_")[:3])
        else:
            folder = "_".join(nom_fichier_sans_extension.split("_")[:2])

        output_folder_path = os.path.join(output_tg_dir, folder)

        if not os.path.exists(output_folder_path):
            os.makedirs(output_folder_path)
            
        # print(nom_fichier_sans_extension)
        chemin_textgrid = os.path.join(
            "", f"./{output_tg_dir}/{folder}/{nom_fichier_sans_extension}.TextGrid"
        )

        # chemin_textgrid = os.path.join('', f'./TEXTGRID_WAV_gold_non_gold_TALN/{folder}/{nom_fichier_sans_extension}.TextGrid')

        directory = os.path.dirname(chemin_textgrid)
        if not os.path.exists(directory):
            os.makedirs(directory)

        # Call the function for each CoNLL-U file
        if fichier.endswith("_M.conllu"):
            # if nom_fichier_sans_extension == "ENU_07_South-Eastern-Politics_M":
            print("\n Creating TextGrid for", nom_fichier_sans_extension)
            create_textgrid(chemin_conllu, chemin_textgrid)
            
            # Copy the corresponding WAV file to the new directory
            wave_file = nom_fichier_sans_extension + ".wav"
            wave_file_src = os.path.join(input_wav_dir, wave_file)
            wave_file_dst = os.path.join(output_folder_path, wave_file)
            
            if os.path.exists(wave_file_src) and not os.path.exists(wave_file_dst):
                print("Copying", wave_file_src, "to", wave_file_dst)
                shutil.copy(wave_file_src, wave_file_dst)

        elif fichier.endswith("_MG.conllu"):
            if "PRT_01" not in nom_fichier_sans_extension:
                print("\n Creating TextGrid for", nom_fichier_sans_extension, 'in', chemin_textgrid)
                create_textgrid_taln(chemin_conllu, chemin_textgrid)

                # Copy the corresponding WAV file to the new directory
                wave_file = nom_fichier_sans_extension + ".wav"
                wave_file_src = os.path.join(input_wav_dir, wave_file)
                wave_file_dst = os.path.join(output_folder_path, wave_file)

                if os.path.exists(wave_file_src) and not os.path.exists(wave_file_dst):
                    print(wave_file_src, "to", wave_file_dst)
                    shutil.copy(wave_file_src, wave_file_dst)


 Creating TextGrid for ABJ_GWA_14_Mary-Lifestory_MG in ./../data/TG_WAV//ABJ_GWA_14/ABJ_GWA_14_Mary-Lifestory_MG.TextGrid

sentence:  ['okay', '//']
text_list[i_text] == sentence[0] and start == i_text
sentence:  ['okay', '//']
text_list[i_text] == sentence[-1] and i_text - start + 1 == len(sentence)
sentence:  ['okay', '//']

sentence:  ['toh', '#', 'na', 'well', '//']
text_list[i_text] == sentence[0] and start == i_text
sentence:  ['toh', '#', 'na', 'well', '//']
text_list[i_text] == sentence[-1] and i_text - start + 1 == len(sentence)
sentence:  ['toh', '#', 'na', 'well', '//']

sentence:  ['#', 'make', 'I', 'talk', '{', 'as', 'm~', '||', 'about', '}', 'my', 'life', '{', 'as', 'moder', 'now', '||', 'as', 'eh', 'motherhood', 'now', '}', '//']
text_list[i_text] == sentence[0] and start == i_text
sentence:  ['#', 'make', 'I', 'talk', '{', 'as', 'm~', '||', 'about', '}', 'my', 'life', '{', 'as', 'moder', 'now', '||', 'as', 'eh', 'motherhood', 'now', '}', '//']
text_list[i_text] == sent