##### Description des fonctions et des paramètres :

1. **Imports et Configuration de base**
   - Importation des bibliothèques nécessaires : `re`, `os`, `sys`, `textgrid` de `praatio`, et `outils.conll3`.
   - Ajout du chemin dynamique vers le dossier 'scripts'.

2. **Fonctions**
   - **load_textgrid(textgrid: str, alignment=False) -> dict**
     - Récupère les phrases en tant que dictionnaires à partir d'un fichier TextGrid.
     - Paramètres :
       - `textgrid`: Le chemin vers le fichier TextGrid.
       - `alignment`: Booléen pour indiquer si les alignements des mots doivent être inclus.
     - Retourne :
       - Un dictionnaire contenant les phrases.

   - **update_sent_text_conllu(conllu: str, sentences: dict, out: str)**
     - Met à jour les phrases dans un fichier CONLL-U en utilisant les phrases fournies dans un dictionnaire.
     - Paramètres :
       - `conllu`: Le chemin vers le fichier CONLL-U.
       - `sentences`: Le dictionnaire contenant les phrases à mettre à jour.
       - `out`: Le chemin vers le fichier de sortie pour enregistrer le fichier CONLL-U mis à jour.
     - Retourne :
       - Deux dictionnaires contenant les anciennes et nouvelles phrases.

   - **compare_and_replace(old_tokens, new_tokens)**
     - Remplace les tokens modifiés dans les nouvelles annotations par les anciennes annotations et ajoute les tokens manquants des nouvelles annotations.
     - Paramètres :
       - `old_tokens`: Les anciennes annotations.
       - `new_tokens`: Les nouvelles annotations.
     - Retourne :
       - Les nouvelles annotations avec les tokens modifiés remplacés par les anciennes annotations.

   - **update_conllu_id(conllu_path: str, output_path: str, old_sent: dict) -> None**
     - Met à jour les index des tokens et ajuste les index HEAD dans un fichier CONLL-U.
     - Paramètres :
       - `conllu_path`: Le chemin du fichier CONLL-U à mettre à jour.
       - `output_path`: Le chemin du fichier de sortie pour enregistrer le fichier CONLL-U mis à jour.
       - `old_sent`: Le dictionnaire contenant les phrases précédentes.

   - **correct_alignements(sentences: dict, conllu_path: str)**
     - Corrige les alignements dans un fichier CONLL-U en utilisant les alignements fournis dans un dictionnaire.
     - Paramètres :
       - `sentences`: Le dictionnaire contenant les alignements.
       - `conllu_path`: Le chemin du fichier CONLL-U à corriger.

   - **adjust_alignments_delete_x(conllu_path: str)**
     - Ajuste les alignements dans un fichier CONLL-U en supprimant les valeurs 'X' et en les remplaçant par des valeurs valides.
     - Paramètres :
       - `conllu_path`: Le chemin du fichier CONLL-U à ajuster.

##### Paramètres globaux :

- `input_textgrid_directory`: Répertoire contenant les fichiers TextGrid.
- `input_conllu_directory`: Répertoire contenant les fichiers CONLL-U.
- `conllu_output_directory`: Répertoire où les fichiers CONLL-U mis à jour seront enregistrés.

##### Processus principal :

1. **Chargement et mise à jour des fichiers CONLL-U**
   - Le script parcourt chaque sous-dossier dans `input_conllu_directory`.
   - Pour chaque fichier CONLL-U, il charge le fichier TextGrid correspondant.
   - Les phrases du fichier CONLL-U sont mises à jour avec les nouvelles phrases extraites du fichier TextGrid.
   - Les index des tokens sont mis à jour et les alignements sont corrigés.

2. **Correction des alignements**
   - Les alignements des phrases sont corrigés en utilisant les informations d'alignement fournies dans les fichiers TextGrid.
   - Les valeurs 'X' dans les alignements sont supprimées et remplacées par des valeurs valides.

##### Résumé :
Le script met à jour les phrases dans les fichiers CONLL-U en utilisant les nouvelles phrases extraites des fichiers TextGrid, corrige les index des tokens et les alignements, et enregistre les fichiers CONLL-U mis à jour dans un répertoire de sortie. Les alignements des phrases sont corrigés en utilisant les informations d'alignement fournies dans les fichiers TextGrid.

In [1]:
import re
import os
import sys

current_path = os.getcwd()
scripts_path = os.path.join(current_path, 'scripts')

# Ajoutez le chemin vers le dossier 'scripts' dynamiquement
sys.path.append(scripts_path)

from outils.conll3 import *
from praatio import textgrid as tgio

In [2]:
def load_textgrid(textgrid:str, alignment=False) -> dict:
    """
    Retrieves sentences as dictionaries (file_name{sent_id{sent_text}, sent_id{sent_text}}}, with the Sent-ID tier having the ID of the sentence and Sent-Text tier having the text of the sentence)
    
    Parameters:
    textgrid (str): Path to the textgrid file

    Returns:
    dict: Dictionary containing the sentences
    """
    tg = tgio.openTextgrid(textgrid, includeEmptyIntervals=False)
    file_name = os.path.basename(textgrid).replace(".TextGrid", "")
    sentences = {file_name: {}}

    # Verify if the "Sent-ID" and "Sent-Text" tiers exist
    if "Sent-ID" in tg.tierNames and "Sent-Text" in tg.tierNames and "Word-ID" in tg.tierNames and "Word-Text" in tg.tierNames:
        sent_id_tier = tg.getTier("Sent-ID")
        sent_text_tier = tg.getTier("Sent-Text")
        word_id_tier = tg.getTier("Word-ID")
        word_text_tier = tg.getTier("Word-Text")

        if alignment == True:
            for i in range(min(len(sent_id_tier.entries), len(sent_text_tier.entries))):
                sent_id = sent_id_tier.entries[i].label
                sent_text = sent_text_tier.entries[i].label
                sentences[file_name][sent_id] = {"text": sent_text, "words": {}}
                for j in range(min(len(word_id_tier.entries), len(word_text_tier.entries))):
                    word_id = word_id_tier.entries[j].label
                    word_text = word_text_tier.entries[j].label
                    start = word_text_tier.entries[j].start
                    end = word_text_tier.entries[j].end
                    sentences[file_name][sent_id]["words"][word_id] = (start, end, word_text)
        else:
            for i in range(min(len(sent_id_tier.entries), len(sent_text_tier.entries))):
                sent_id = sent_id_tier.entries[i].label
                sent_text = sent_text_tier.entries[i].label
                sentences[file_name][sent_id] = sent_text


    # print(sentences)
    return sentences

In [3]:
def update_sent_text_conllu(conllu: str, sentences: dict, out: str):
    """
    Update the sentences in a CONLL-U file using the provided sentences in a dictionary.
    
    Parameters:
    conllu (str): The path to the CONLL-U file.
    sentences (dict): The dictionary containing the sentences to update.
    out (str): The path to the output file to save the updated CONLL-U file.
    """
    with open(conllu, 'r', encoding='utf-8') as f:
        data = f.read()

    updated_data = []
    old_sent = {}
    new_sent = {}
    current_file = os.path.basename(conllu).replace(".conllu", "")
    if current_file.startswith("ABJ"):
        textgrid_name = "_".join(current_file.split("_")[:3])
    else:
        textgrid_name = "_".join(current_file.split("_")[:2])

    textgrid_name = textgrid_name + "-merged"
    
    for sentence in data.split("\n\n"):
        # print(data)
        if sentence.strip():
            lines = sentence.split("\n")
            sent_id_match = re.search(r"# sent_id = (.+)", sentence)
            if sent_id_match:
                sent_id = sent_id_match.group(1)
                sent_id = sent_id.split("__")[1]
                # print(textgrid_name, sentences[textgrid_name])
                if textgrid_name in sentences and sent_id in sentences[textgrid_name]:
                    new_text = sentences[textgrid_name][sent_id]
                    for i, line in enumerate(lines):
                        if line.startswith("# text ="):
                            old_text = line.split(" = ")[1]
                            lines[i] = f"# text = {new_text}"
                            old_sent[sent_id] = old_text
                            new_sent[sent_id] = new_text
                            break
                    
            updated_data.append("\n".join(lines))

    with open(out, 'w', encoding='utf-8') as f:
        f.write("\n\n".join(updated_data) + "\n")
    
    return old_sent, new_sent

In [4]:
def compare_and_replace(old_tokens, new_tokens):
    """
    Replaces the modified tokens in the new annotations with the old annotations and adds the missing tokens from the new annotations.

    Parameters:
    old_tokens (dict): The old annotations.
    new_tokens (dict): The new annotations.

    Returns:
    dict: The new annotations with the modified tokens replaced by the old annotations.
    """
    new_tokens_dic = {}
    new_key = 1

    # print("\nold ", old_tokens)
    # print("new ", new_tokens)

    # Dict to keep track of token occurrences
    old_token_occurrences = {}
    for idx, token in old_tokens.items():
        token_upper = token.upper()
        if token_upper not in old_token_occurrences:
            old_token_occurrences[token_upper] = []
        old_token_occurrences[token_upper].append(idx)

    used_old_indices = set()

    def get_context(tokens, idx):
        """ Helper function to get the context of a token """
        prev_idx = idx - 1
        while prev_idx in tokens and tokens[prev_idx] == '#':
            prev_idx -= 1
        prev_token = tokens.get(prev_idx, '')

        next_idx = idx + 1
        while next_idx in tokens and tokens[next_idx] == '#':
            next_idx += 1
        next_token = tokens.get(next_idx, '')

        return prev_token.upper(), next_token.upper()

    for new_idx, new_token in new_tokens.items():
        new_token_upper = new_token.upper()
        token_found = False

        if new_token_upper in old_token_occurrences:
            new_prev, new_next = get_context(new_tokens, new_idx)

            for old_idx in old_token_occurrences[new_token_upper]:
                if old_idx not in used_old_indices:
                    old_prev, old_next = get_context(old_tokens, old_idx)

                    if old_prev == new_prev and old_next == new_next:
                        new_tokens_dic[new_key] = old_tokens[old_idx]
                        # print("old=new ", new_tokens_dic)
                        new_key += 1
                        used_old_indices.add(old_idx)
                        token_found = True
                        break

        if not token_found:
            # Handling tokens with `~`
            new_token_tild = new_token.replace("~", "")
            for old_idx, old_token in old_tokens.items():
                old_token_tild = old_token.replace("~", "")
                if new_token_tild.upper() == old_token_tild.upper() and old_idx not in used_old_indices:
                    old_prev, old_next = get_context(old_tokens, old_idx)
                    new_prev, new_next = get_context(new_tokens, new_idx)

                    if old_prev == new_prev and old_next == new_next:
                        new_tokens_dic[new_key] = old_token
                        # print("tild ", new_tokens_dic)
                        new_key += 1
                        used_old_indices.add(old_idx)
                        token_found = True
                        break

        if not token_found:
            new_tokens_dic[new_key] = new_token
            # print("# or new_token ", new_tokens_dic)
            new_key += 1

    return new_tokens_dic

In [5]:
def update_conllu_id(conllu_path: str, output_path: str, old_sent: dict) -> None:
    """
    Updates the token indexing and adjusts the HEAD indices in a CONLL-U file.
    
    Parameters:
    conllu_path (str): The path of the CONLL-U file to update.
    output_path (str): The output file path to save the updated CONLL-U file.
    old_sent (dict): The dictionary containing the previous sentences.

    Returns:
    None
    """
    with open(conllu_path, 'r', encoding='utf-8') as f:
        data = f.read().strip().split('\n\n')

    processed_sentences = []

    # Get the ID of the old sentence {sent_id: {word_id: word}}
    old_dico = {}
    for sent_id, sent_text in old_sent.items():
        sent_text = sent_text.split()
        dict_id_word = {i + 1: word for i, word in enumerate(sent_text)}
        old_dico[sent_id] = dict_id_word

    new_dico = {}
    for sentence in data:
        lines = sentence.strip().split('\n')
        metadata = []
        token_lines = []
        for line in lines:
            if line.startswith('#'):
                metadata.append(line)
            else:
                token_lines.append(line.split('\t'))

        sent_id = None
        for m in metadata:
            if m.startswith("# sent_id ="):
                sent_id = m.split(" = ")[1]
                sent_id = sent_id.split("__")[1]
                new_dico[sent_id] = {}
                
            if m.startswith("# text ="):
                sent_text = m.split(" = ")[1]
                sent_text = sent_text.split()
                dict_id_word = {i + 1: word for i, word in enumerate(sent_text)}
                new_dico[sent_id] = dict_id_word

        # Check if sent_id is in old_dico and new_dico
        if sent_id in old_dico and sent_id in new_dico:
            # Correcting the token indices using the dictionary + adding missing tokens to token_lines
            corrected_token_lines = []
            old_tokens = old_dico[sent_id]
            new_tokens = new_dico[sent_id]

            new_tokens = compare_and_replace(old_tokens, new_tokens)

            # print("\n\nAncienne phrase et ses mots avec ID:", old_tokens)
            # print("Nouvelle phrase et ses mots avec ID:", new_tokens)
            
            # Dictionary to keep track of already used indices for each word
            used_old_indices = {}

            for new_idx, new_word in new_tokens.items():
                # If the token itself is missing, we add it with the corresponding index or if the token is "#" and the index is different from the old token
                # print("\n\n",conllu_path)
                # print("\nNew index:", new_idx, "New word:", new_word)
                old_word_x = [old_word for old_idx, old_word in old_tokens.items() if old_idx == new_idx]
                old_idx_x = [old_idx for old_idx, old_word in old_tokens.items() if old_idx == new_idx]
                # print("Old word:", old_word_x, "Old index:", old_idx_x)

                new_word_not_in_old_tokens = new_word.upper() not in [old_word.upper() for old_word in old_tokens.values()]
                new_word_is_hash_with_different_index = new_word == "#" and (not old_idx_x or new_idx != old_idx_x[0])
                new_word_is_hash_different_old_word = new_word == "#" and old_word_x and old_word_x[0] != "#"

                # Adding debug prints
                # print("new_word_not_in_old_tokens:", new_word_not_in_old_tokens)
                # print("new_word_is_hash_with_different_index:", new_word_is_hash_with_different_index)
                # print("new_word_is_hash_different_old_word:", new_word_is_hash_different_old_word)
                # print("Condition principale:", new_word_not_in_old_tokens or new_word_is_hash_with_different_index)



                if new_word_not_in_old_tokens or new_word_is_hash_with_different_index or new_word_is_hash_different_old_word:
                    # print("Token manquant:", new_word)
                    if new_word == "#":
                        # print(conllu_path)
                        # print("Sent ID:", sent_id)
                        # print("Ancienne phrase et ses mots avec ID:", old_tokens)
                        # print("Nouvelle phrase et ses mots avec ID:", new_tokens)
                        # print("Token manquant", new_word)
                        corrected_token_lines.append([str(new_idx), new_word, new_word, 'PUNCT', '_', '_', '_', '_', '_', '_'])
                    else:
                        # print("\n\n",conllu_path)
                        # print("Sent ID:", sent_id)
                        # print("Ancienne phrase et ses mots avec ID:", old_tokens)
                        # print("Nouvelle phrase et ses mots avec ID:", new_tokens)
                        # print("Token manquant", new_word)
                        corrected_token_lines.append([str(new_idx), new_word, new_word, '_', '_', '_', '_', '_', '_', '_'])
                else:
                    # If the token is present, update it with the corresponding index in the new dictionary
                    matching_old_indices = [old_idx for old_idx, old_word in old_tokens.items() if old_word.upper() == new_word.upper()]

                    try:
                        # Find the first unused index
                        available_old_idx = next(old_idx for old_idx in matching_old_indices if old_idx not in used_old_indices.get(new_word.upper(), []))
                    except StopIteration:
                        print(f"Erreur: Pas d'indice disponible pour le mot '{new_word}' avec l'indice '{new_idx}'")
                        continue
                    
                    # Update the dictionary of used indices
                    if new_word.upper() not in used_old_indices:
                        used_old_indices[new_word.upper()] = []
                    used_old_indices[new_word.upper()].append(available_old_idx)
                    
                    if available_old_idx - 1 < len(token_lines):
                        old_word = old_tokens[available_old_idx]
                        corrected_token_lines.append([str(new_idx), old_word] + token_lines[available_old_idx - 1][2:])
                    else:
                        print(f"Erreur: Indice {available_old_idx - 1} est hors limite pour les tokens dans la phrase avec sent_id = {sent_id}")


                # print("Corrected token lines:", corrected_token_lines)

            for token_line in corrected_token_lines:
                head_idx = token_line[6]
                if head_idx != "0" and head_idx != "_":
                    head_idx = int(head_idx)
                    for idx, word in new_tokens.items():
                        # print("Index:", idx, "Word:", word)
                        if idx <= head_idx and word == '#' and idx not in used_old_indices.get('#', []):
                            # print("Head index:", head_idx, "New head index:", head_idx + 1)
                            head_idx += 1
                    new_head_idx = head_idx
                    # print("Head index:", head_idx, "New head index:", new_head_idx)
                    token_line[6] = str(new_head_idx)


            # Reformatting token lines
            formatted_token_lines = ['\t'.join(token_line) for token_line in corrected_token_lines]
            processed_sentences.append('\n'.join(metadata + formatted_token_lines))

    with open(output_path, 'w', encoding='utf-8') as f:
        f.write('\n\n'.join(processed_sentences) + '\n')



In [6]:
def correct_alignements(sentences: dict, conllu_path: str):
    """
    Correct the alignments in a CONLL-U file using the alignments provided in a dictionary.

    Parameters:
    sentences (dict): The dictionary containing the alignments.
    conllu_path (str): The path of the CONLL-U file to be corrected.

    Returns:
    None
    """

    punctuation_list = [
        ">",
        "<",
        "//",
        "?//",
        "]",
        "}",
        "|c",
        ">+",
        "||",
        "}//",
        "&//",
        "//.",
        ")",
        "|r",
        ">=",
        "//+",
        "<+",
        "?//]",
        "//]",
        "//=",
        "!//",
        "?//=",
        "!//=",
        "//)",
        "|a",
        "&?//",
        "!//]",
        "&//]",
        "//&",
        "?//]",
        "!//)",
        "&?//]",
        "{",
        "(",
        "[",
        "&",
        "||e",
        "<{",
        "//t",
        "{|c",
        "|}",
        "|",
        "/",
        "?",
        "+",
        ",",
    ]

    with open(conllu_path, 'r', encoding='utf-8') as f:
        data = f.read().strip().split('\n\n')

    corrected_data = []

    for sentence in data:
        lines = sentence.strip().split('\n')
        metadata = []
        token_lines = []
        for line in lines:
            if line.startswith('#'):
                metadata.append(line)
            else:
                token_lines.append(line.split('\t'))

        sent_id = None
        for m in metadata:
            if m.startswith("# sent_id ="):
                sent_id = m.split(" = ")[1]
                break

        if sent_id and "__" in sent_id:
            sent_id_key = sent_id.split("__")[1]
            file_name = sent_id.split("__")[0]
            if file_name.startswith("ABJ"):
                file_name = "_".join(file_name.split("_")[:3])
            else:
                file_name = "_".join(file_name.split("_")[:2])
            
            file_name = file_name + "-merged"

            previous_end = None

            if file_name in sentences and sent_id_key in sentences[file_name]:
                for token in token_lines:
                    token_id = sent_id_key + ":" + token[0]
                    token_id_plus_1 = sent_id_key + ":" + str(int(token[0]) + 1)

                    # print("1. ", token_id, token)
                    if token_id in sentences[file_name][sent_id_key]["words"]:
                        word_info = sentences[file_name][sent_id_key]["words"][token_id]
                        word_info_plus_1 = sentences[file_name][sent_id_key]["words"].get(token_id_plus_1, None)
                        
                        # print(" 2. ", word_info)
                        if token[1] == word_info[2] and word_info[2]:
                            start, end, token_text = word_info
                            token[1] = token_text
                            previous_end = end
                            if token[-1] == "_":
                                token[-1] = f"AlignBegin={int(start * 1000)}|AlignEnd={int(end * 1000)}"  # Convertir les secondes en millisecondes
                            else:
                                token[-1] = f"AlignBegin={int(start * 1000)}|AlignEnd={int(end * 1000)}"

                        elif word_info_plus_1 and token[1] == word_info_plus_1[2] and word_info_plus_1[2]:
                            start, end, token_text = word_info_plus_1
                            token[1] = token_text
                            previous_end = end
                            if token[-1] == "_":
                                token[-1] = f"AlignBegin={int(start * 1000)}|AlignEnd={int(end * 1000)}"
                            else:
                                token[-1] = f"AlignBegin={int(start * 1000)}|AlignEnd={int(end * 1000)}"

                        elif word_info_plus_1 and token[1] != word_info[2] and token[1] != word_info_plus_1[2]:
                            start, end, token_text = word_info_plus_1
                            if previous_end and previous_end < start:
                                token[-1] = f"AlignBegin={int(previous_end * 1000)}|AlignEnd={int(start * 1000)}"
                                previous_end = start
                            else:
                                token[-1] = f"AlignBegin=X|AlignEnd=X"

                        else:
                            token[-1] = f"AlignBegin=X|AlignEnd=X"

                    elif token[1] in punctuation_list:
                        if previous_end is not None:
                            token[-1] = f"AlignBegin={int(previous_end * 1000)}|AlignEnd={int(previous_end * 1000)}"
                        else:
                            token[-1] = f"AlignBegin=X|AlignEnd=X"

                    else:
                        token[-1] = f"AlignBegin=X|AlignEnd=X"

                corrected_lines = ['\t'.join(token) for token in token_lines]
                corrected_sentence = '\n'.join(metadata + corrected_lines)
                corrected_data.append(corrected_sentence)
            else:
                corrected_data.append(sentence)
        else:
            corrected_data.append(sentence)

    corrected_conllu = '\n\n'.join(corrected_data)

    with open(conllu_path.replace('.conllu', '.conllu'), 'w', encoding='utf-8') as f:
        f.write(corrected_conllu)

In [7]:
def adjust_alignments_delete_x(conllu_path:str):
    """
    Adjusts the alignments in a CONLL-U file by removing 'X' values and replacing them with valid values.

    Parameters:
    conllu_path (str): The path of the CONLL-U file to adjust.
    """
    with open(conllu_path, 'r', encoding='utf-8') as f:
        data = f.read().strip().split('\n\n')

    adjusted_data = []
    
    for sentence in data:
        lines = sentence.split('\n')
        first_alignbegin = None
        last_alignend = None

        # Get the alignments of each token
        alignments = []
        for line in lines:
            if '\t' in line:
                parts = line.split('\t')
                align_info = parts[-1]
                alignbegin_match = re.search(r'AlignBegin=(\d+|X)', align_info)
                alignend_match = re.search(r'AlignEnd=(\d+|X)', align_info)
                if alignbegin_match and alignend_match:
                    alignbegin = alignbegin_match.group(1)
                    alignend = alignend_match.group(1)
                    if alignbegin != 'X':
                        alignbegin = int(alignbegin)
                    else:
                        alignbegin = None
                    if alignend != 'X':
                        alignend = int(alignend)
                    else:
                        alignend = None
                    alignments.append((alignbegin, alignend))
                    if first_alignbegin is None and alignbegin is not None:
                        first_alignbegin = alignbegin
                    if alignend is not None:
                        last_alignend = alignend

        # Make sure the first and last alignments are defined
        if first_alignbegin is None:
            first_alignbegin = 0  # Default to 0 if no value found
        if last_alignend is None and alignments:
            last_alignend = alignments[-1][1] if alignments[-1][1] is not None else 0

        if not alignments:
            adjusted_data.append(sentence)
            continue

        # Adjust missing alignments
        for i, (alignbegin, alignend) in enumerate(alignments):
            if alignbegin is None:
                if i > 0 and alignments[i-1][1] is not None:
                    alignbegin = alignments[i-1][1]
                else:
                    alignbegin = first_alignbegin
            if alignend is None:
                if i < len(alignments) - 1 and alignments[i+1][0] is not None:
                    alignend = alignments[i+1][0]
                else:
                    alignend = last_alignend
            alignments[i] = (alignbegin, alignend)

        # Set alignbegin = alignend if alignbegin > alignend
        for i, (alignbegin, alignend) in enumerate(alignments):
            if alignbegin > alignend:
                alignend = alignbegin
            alignments[i] = (alignbegin, alignend)

        for i in range(1, len(alignments)):
            alignbegin, alignend = alignments[i]
            # print("i-1 :" , i-1, alignments[i-1])
            # print("old :", i, alignbegin, alignend, alignments[i])
            prev_alignend = alignments[i-1][1]
            if i + 1 < len(alignments) and prev_alignend > alignments[i+1][1]:
                prev_alignend = alignments[i-1][0]
            # print("prev : ", prev_alignend)
            if alignbegin is not None and alignend is not None:
                if prev_alignend is not None:
                    if alignbegin != prev_alignend or alignbegin > alignend:
                        if alignbegin == alignend:
                            alignbegin = prev_alignend
                            alignments[i] = (alignbegin, alignbegin)
                        else:
                            alignbegin = prev_alignend
                            alignments[i] = (alignbegin, alignend)
                        # print("new : ", alignbegin, alignend, alignments[i])

        # Adjust the lines with alignments
        adjusted_lines = []
        alignment_index = 0
        for line in lines:
            if '\t' in line:
                parts = line.split('\t')
                align_info = parts[-1]
                if 'AlignBegin' in align_info and 'AlignEnd' in align_info:
                    alignbegin, alignend = alignments[alignment_index]
                    # print(alignbegin, alignend, alignments[alignment_index])
                    alignment_index += 1
                    if parts[3] == 'PUNCT' and parts[1] != "#":
                        alignend = alignbegin
                    else:
                        if alignbegin is not None and alignbegin != first_alignbegin:
                            alignbegin = alignbegin
                        if alignend is not None and alignend != last_alignend:
                            alignend = alignend
                    if alignbegin is not None:
                        alignbegin = max(0, alignbegin)
                    if alignend is not None:
                        alignend = max(0, alignend)
                    parts[-1] = f'AlignBegin={alignbegin}|AlignEnd={alignend}'
                adjusted_lines.append('\t'.join(parts))
            else:
                adjusted_lines.append(line)
        # print(adjusted_lines)
        adjusted_data.append('\n'.join(adjusted_lines))

    with open(conllu_path, 'w', encoding='utf-8') as f:
        f.write('\n\n'.join(adjusted_data))


## Paramètres

In [8]:
input_textgrid_directory = "../data/TG_WAV/"
input_conllu_directory = "../data/SUD_Naija-NSC-master/gold_nongold/"
conllu_output_directory = "../data/conllu_output/gold_nongold/"

Correction des transcriptions en ajoutant les # manquants

In [9]:
if not os.path.exists(conllu_output_directory):
    os.makedirs(conllu_output_directory)

for root, dirs, files in os.walk(input_conllu_directory):
    # print(root)
    for file in files:
        # print(file)
        if file.endswith("_MG.conllu") or file.endswith("_M.conllu"):
            conllu_file = os.path.join(root, file)

            # print(conllu_file)
            if file.startswith("ABJ"):
                textgrid_name = "_".join(file.split("_")[:3])
            else:
                textgrid_name = "_".join(file.split("_")[:2])

            textgrid_path = os.path.join(input_textgrid_directory, textgrid_name + "/new/" + textgrid_name + "-merged.TextGrid")

            # print(textgrid_path)

            if os.path.exists(textgrid_path):
                # if "JOS_38" in textgrid_path:
                print(f"\n Processing {conllu_file} with {textgrid_path}")
                sentences_dict = load_textgrid(textgrid_path)
                conllu_output_file = conllu_output_directory + file
                old_text, new_text = update_sent_text_conllu(conllu_file, sentences_dict, conllu_output_file)
                update_conllu_id(conllu_output_file, conllu_output_file, old_text)
                sentences_all = load_textgrid(textgrid_path, alignment=True)
                correct_alignements(sentences_all, conllu_output_file)
                # adjust_alignments_delete_x(conllu_output_file)
                


 Processing ../data/SUD_Naija-NSC-master/gold_nongold/ABJ_GWA_14_Mary-Lifestory_MG.conllu with ../data/TG_WAV/ABJ_GWA_14/new/ABJ_GWA_14-merged.TextGrid

 Processing ../data/SUD_Naija-NSC-master/gold_nongold/KAD_17_Turkeys_MG.conllu with ../data/TG_WAV/KAD_17/new/KAD_17-merged.TextGrid

 Processing ../data/SUD_Naija-NSC-master/gold_nongold/WAZL_08_Edewor-Lifestory_MG.conllu with ../data/TG_WAV/WAZL_08/new/WAZL_08-merged.TextGrid

 Processing ../data/SUD_Naija-NSC-master/gold_nongold/JOS_19_Bukuru_MG.conllu with ../data/TG_WAV/JOS_19/new/JOS_19-merged.TextGrid

 Processing ../data/SUD_Naija-NSC-master/gold_nongold/PRT_07_Drummer_MG.conllu with ../data/TG_WAV/PRT_07/new/PRT_07-merged.TextGrid

 Processing ../data/SUD_Naija-NSC-master/gold_nongold/WAZL_03_News-On-Gmns_MG.conllu with ../data/TG_WAV/WAZL_03/new/WAZL_03-merged.TextGrid

 Processing ../data/SUD_Naija-NSC-master/gold_nongold/BEN_36_Clever-Girl_MG.conllu with ../data/TG_WAV/BEN_36/new/BEN_36-merged.TextGrid

 Processing ../data