# Traitement intermédiaire

Ce notebook permet de récupérer, depuis un fichier de sous-titre, deux nouveaux fichiers qui serviront à avoir des sous-titres segmentés en phrases :
1. Un fichier de sous-titre .vtt sur segmenté, où un sous-titre correspond exclusivement à une phrase ou à un morceau de phrases 
2. Un fichier .txt contenant une phrase par ligne

In [15]:
import spacy
import os
import re
import module_traitement as m
spacy.prefer_gpu()
nlp = spacy.load("fr_dep_news_trf")
from spacy.language import Language

In [98]:
# Pour Mediapi
file_with_path = m.lister_fichiers_with_path("../data/aligned_mediapi/")
folder = m.lister_fichiers("../data/aligned_mediapi/")
output_seg = "../data/new_segmentation_mediapi/"
output_sent = "../data/sentence_mediapi/"

In [2]:
# Pour Matignon - LSF 
file_with_path = m.lister_fichiers_with_path("../data/cr_audio_aligned/")
folder = m.lister_fichiers("../data/cr_audio_aligned/")
output_seg = "../data/new_segmentation_cr/"
output_sent = "../data/sentence_matignon/"

## Nettoyage fichier

Nettoyer le fichier en amont des différents pré-traitement pour facilier la detection de phrase et le découpage. 

In [177]:
# TEST - Pour Matignon - LSF 
file_with_path = m.lister_fichiers_with_path("../data/cr_audio_aligned/")
folder = m.lister_fichiers("../data/cr_audio_aligned/")
output_seg = "test_new_seg_cr/"
output_cleaning = "test_cleaning_cr/"
#output_sent = "../data/sentence_matignon/"

In [178]:
def get_dict_vtt_clean(input):
    with open(input,encoding="utf-8") as f:
        lines = f.readlines()

    dict_sub = {}
    i = 0
    j = 0  

    while j < len(lines): 
        element = lines[j]
        if element.startswith("00:") or element.startswith("01:") or element.startswith("02:"):
            # Extraire le temps de début et de fin
            timing_line = element.strip().split(' --> ')
            start_time, end_time = timing_line

            text = ""
            while j + 1 < len(lines) and not lines[j + 1].startswith("00:") and not lines[j+1].startswith("01:") and not lines[j+1].startswith("02:"):
                j += 1
                content = lines[j]
                if content.startswith("-"):
                    content = content.replace("-","")
                text = text + " " + content.strip()
                text=re.sub(r'\[INAUDIBLE\],?', 'Inaudible.', text)
                text=text.replace("[ INAUDIBLE ]","Inaudible.")
                text=text.replace("... -G. Attal : ","")
                text=text.replace("-G. Attal : ","")
                text=text.replace("G. Attal : ","")
                text = text.replace("-Bonjour", "Bonjour")
                text = text.replace("Bonjour.", "Bonjour,")
                text = text.replace('".','"')
                text = text.replace('"?','"')
                text = text.replace('"!','"')
                text = text.replace('"...','"')
                text = re.sub(r'["“”«»]', '', text)
                text = text.replace(" Inaudible.",". Inaudible.") # to check 0-R

            dict_sub[i] = {'start': start_time, 'end': end_time, 'text': text.strip()}
            i += 1

        j += 1

    return dict_sub


In [179]:
for file,name in zip(file_with_path,folder):
    dict_sub = get_dict_vtt_clean(file)
    m.create_vtt_file(dict_sub,f"{output_cleaning}/{name}")

## Fichier sur-segmenté

Dans ces fichiers, on fait les traitements suivants : 
1. Récupérer les ponctuations fortes (point, point d'exclamation, point d'interrogation) au milieu des sous-titre
2. Garder les timestamps en mémoire
3. Avoir la durée de prononciation d'une lettre, pour découper le sous-titre en fonction
4. Couper le sous-titre et générer un nouveau fichier

### Traitement 

Le prochain traitement peut s'effectuer deux fois, au cas où il y avait plusieurs ponctations fortes. Je peux modifier l'expression régulière pour ne pas prendre les LETTRE_MAJ. dans ma segmentation.

In [152]:
import module_traitement as m

In [232]:
file_with_path = m.lister_fichiers_with_path("test_cleaning_cr/")
folder = m.lister_fichiers("test_cleaning_cr/")

In [233]:
def time_to_seconds(timestamp):
    # Split the timestamp into hours, minutes, seconds, and milliseconds
    milliseconds = int(timestamp.split('.')[1])
    tmp = timestamp.split('.')[0]
    hours, minutes, seconds = map(int, tmp.split(':'))

    # Calculate the total seconds
    total_seconds = hours * 3600 + minutes * 60 + seconds + milliseconds / 1000.0

    return total_seconds

In [234]:
for file,name in zip(file_with_path,folder):
    print(f"TRAITEMENT {file} ---- {name}")
    dict_sub = m.get_dict_vtt(file)
    new_dict = {}
    mm = 0
    pattern = r'([.!?]+)'
    sous_unite = []
    for k, v in dict_sub.items():
        for kk, vv in v.items():
            if kk == "text":
                # Replace the point between two capital letters with '#'
                modified_text = re.sub(r'(?<=[A-Z])\.(?=[A-Z])', '#', vv)
                # Use re.split() to split the text based on the pattern
                sentences = re.split(pattern, modified_text)
                # Combine pairs of adjacent list elements (sentence + punctuation)
                result = [sentences[i] + sentences[i + 1] if i < len(sentences) - 1 else sentences[i] for i in range(0, len(sentences), 2)]
                # Remove empty strings from the result
                result = [sentence.strip() for sentence in result if sentence.strip()]
                if len(result) == 1:
                    if mm not in new_dict:
                        new_dict[mm]=v
                        mm = mm +1
                else:
                    if result:
                        print(f"resultat : {result}")
                        start_time_str = v["start"]
                        end_time_str = v["end"]
                        # start_time = m.conv_str_to_time(start_time_str)
                        # end_time = m.conv_str_to_time(end_time_str)
                        nb_of_carach = len(v["text"])
                        duration = time_to_seconds(end_time_str) - time_to_seconds(start_time_str)
                        duration_sec = duration
                        print(duration)
                        sec_par_letter = duration_sec / nb_of_carach
                        for match in result:
                            len_match = len(match)
                            duration_match = len_match*sec_par_letter
                            if mm not in new_dict:
                                print(end_time)
                                end_time = m.ajouter_secondes(start_time_str,duration_match)
                                print(end_time)
                                print(f"start time ({type(start_time_str)}) : {start_time_str}, end time ({type(end_time)}) : {end_time}, text : {match}")
                                new_dict[mm]={'start':start_time_str,"end":end_time,'text':match}
                                start_time_str = end_time
                                mm = mm +1
                    else:
                        continue
    m.create_vtt_file(new_dict,f"{output_seg}/{name}")

                    
    

TRAITEMENT test_cleaning_cr/-LhfYZ1ihpI.vtt ---- -LhfYZ1ihpI.vtt
resultat : ['Monsieur le Ministre, bonjour.', 'Marie Chanterait']
1.5600000000000023
00:31:12.765
00:06:42.555
start time (<class 'str'>) : 00:06:41.560, end time (<class 'str'>) : 00:06:42.555, text : Monsieur le Ministre, bonjour.
00:06:42.555
00:06:43.086
start time (<class 'str'>) : 00:06:42.555, end time (<class 'str'>) : 00:06:43.086, text : Marie Chanterait
resultat : ['Alors, quand ?', "L'année prochaine ?", '2025 ?', '2026 ?']
3.0400000000000205
00:06:43.086
00:07:06.846
start time (<class 'str'>) : 00:07:05.960, end time (<class 'str'>) : 00:07:06.846, text : Alors, quand ?
00:07:06.846
00:07:08.049
start time (<class 'str'>) : 00:07:06.846, end time (<class 'str'>) : 00:07:08.049, text : L'année prochaine ?
00:07:08.049
00:07:08.429
start time (<class 'str'>) : 00:07:08.049, end time (<class 'str'>) : 00:07:08.429, text : 2025 ?
00:07:08.429
00:07:08.809
start time (<class 'str'>) : 00:07:08.429, end time (<cla

## Sortir un fichier de phrases

L'idée est de sortie un fichier contenant une phrase par ligne pour pouvoir créer plus simplement le nouveau fichier de sous-titre segmenté en phrases. 

### Traitement

Utilisation de SpaCy pour récupérer les phrases

In [218]:
file_with_path

['test_cleaning_cr/-LhfYZ1ihpI.vtt',
 'test_cleaning_cr/-sgE2QHsskA.vtt',
 'test_cleaning_cr/0-RtcGGzUoE.vtt',
 'test_cleaning_cr/0IViLqFmgIY.vtt',
 'test_cleaning_cr/0hXvxmgHk_c.vtt',
 'test_cleaning_cr/1AjRdJ5d_Ww.vtt',
 'test_cleaning_cr/1ILfD_BjLNk.vtt',
 'test_cleaning_cr/1MHphyCtLLE.vtt',
 'test_cleaning_cr/3TEX9ruhaXo.vtt',
 'test_cleaning_cr/6fsIXStr6w4.vtt',
 'test_cleaning_cr/8IqhpOiPMxY.vtt',
 'test_cleaning_cr/8ZUIw7jcaZE.vtt',
 'test_cleaning_cr/B62_uSapEhI.vtt',
 'test_cleaning_cr/CDsP8gaVGbg.vtt',
 'test_cleaning_cr/CxgdUjywiDE.vtt',
 'test_cleaning_cr/F5-w4cvC_L0.vtt',
 'test_cleaning_cr/FDNmSH37IEk.vtt',
 'test_cleaning_cr/G3Tz-srNvjs.vtt',
 'test_cleaning_cr/H0dUQaWH3UE.vtt',
 'test_cleaning_cr/H448NJiwMRI.vtt',
 'test_cleaning_cr/K7UFz1gAzAQ.vtt',
 'test_cleaning_cr/K7WqKupeGVk.vtt',
 'test_cleaning_cr/KhxgwQDJpCg.vtt',
 'test_cleaning_cr/LicjiPStTmU.vtt',
 'test_cleaning_cr/M5SVtrVMQm0.vtt',
 'test_cleaning_cr/NiQIWYxm8EQ.vtt',
 'test_cleaning_cr/QXVJZkmva3s.vtt',
 

In [235]:
output_sent = "test_clean_sent/"

In [236]:
def get_dict_vtt(input):
    with open(input,encoding="utf-8") as f:
        lines = f.readlines()

    dict_sub = {}
    i = 0
    j = 0  

    while j < len(lines): 
        element = lines[j]
        if element.startswith("00:") or element.startswith("01:") or element.startswith("02:"):
            # Extraire le temps de début et de fin
            timing_line = element.strip().split(' --> ')
            start_time, end_time = timing_line

            text = ""
            while j + 1 < len(lines) and not lines[j + 1].startswith("00:") and not lines[j+1].startswith("01:") and not lines[j+1].startswith("02:"):
                j += 1
                content = lines[j]
                text = text + " " + content.strip()

            dict_sub[i] = {'start': start_time, 'end': end_time, 'text': text.strip()}
            i += 1

        j += 1

    return dict_sub

In [237]:
for files, name in zip(file_with_path,folder):
    print("Traitement",files,name)
    dict_sub = m.get_dict_vtt(files)
    text = ""
    sentences = []
    for k,v in dict_sub.items():
        for kk,vv in v.items():
            if kk=="text":
                text = text + vv + " "
    doc = nlp(text)
    assert doc.has_annotation("SENT_START")
    for sent in doc.sents:
        sentences.append(sent.text)
    with open(f"{output_sent}/{name}","w",encoding="utf-8") as output:
        for sent in sentences:
            output.write(sent+"\n")

Traitement test_cleaning_cr/-LhfYZ1ihpI.vtt -LhfYZ1ihpI.vtt
Traitement test_cleaning_cr/-sgE2QHsskA.vtt -sgE2QHsskA.vtt
Traitement test_cleaning_cr/0-RtcGGzUoE.vtt 0-RtcGGzUoE.vtt
Traitement test_cleaning_cr/0IViLqFmgIY.vtt 0IViLqFmgIY.vtt
Traitement test_cleaning_cr/0hXvxmgHk_c.vtt 0hXvxmgHk_c.vtt
Traitement test_cleaning_cr/1AjRdJ5d_Ww.vtt 1AjRdJ5d_Ww.vtt
Traitement test_cleaning_cr/1ILfD_BjLNk.vtt 1ILfD_BjLNk.vtt
Traitement test_cleaning_cr/1MHphyCtLLE.vtt 1MHphyCtLLE.vtt
Traitement test_cleaning_cr/3TEX9ruhaXo.vtt 3TEX9ruhaXo.vtt
Traitement test_cleaning_cr/6fsIXStr6w4.vtt 6fsIXStr6w4.vtt
Traitement test_cleaning_cr/8IqhpOiPMxY.vtt 8IqhpOiPMxY.vtt
Traitement test_cleaning_cr/8ZUIw7jcaZE.vtt 8ZUIw7jcaZE.vtt
Traitement test_cleaning_cr/B62_uSapEhI.vtt B62_uSapEhI.vtt
Traitement test_cleaning_cr/CDsP8gaVGbg.vtt CDsP8gaVGbg.vtt
Traitement test_cleaning_cr/CxgdUjywiDE.vtt CxgdUjywiDE.vtt
Traitement test_cleaning_cr/F5-w4cvC_L0.vtt F5-w4cvC_L0.vtt
Traitement test_cleaning_cr/FDNmSH37IEk.

### Nettoyer fichier phrase
- remonter la ponctuation forte si elle est isolée sur une ligne
- remonter les sections commençant par une virgule si elles sont isolées du reste de la phrase sur la ligne suivante

In [238]:
files= m.lister_fichiers_with_path(output_sent)

In [239]:
files

['test_clean_sent/-LhfYZ1ihpI.vtt',
 'test_clean_sent/-sgE2QHsskA.vtt',
 'test_clean_sent/0-RtcGGzUoE.vtt',
 'test_clean_sent/0IViLqFmgIY.vtt',
 'test_clean_sent/0hXvxmgHk_c.vtt',
 'test_clean_sent/1AjRdJ5d_Ww.vtt',
 'test_clean_sent/1ILfD_BjLNk.vtt',
 'test_clean_sent/1MHphyCtLLE.vtt',
 'test_clean_sent/3TEX9ruhaXo.vtt',
 'test_clean_sent/6fsIXStr6w4.vtt',
 'test_clean_sent/8IqhpOiPMxY.vtt',
 'test_clean_sent/8ZUIw7jcaZE.vtt',
 'test_clean_sent/B62_uSapEhI.vtt',
 'test_clean_sent/CDsP8gaVGbg.vtt',
 'test_clean_sent/CxgdUjywiDE.vtt',
 'test_clean_sent/F5-w4cvC_L0.vtt',
 'test_clean_sent/FDNmSH37IEk.vtt',
 'test_clean_sent/G3Tz-srNvjs.vtt',
 'test_clean_sent/H0dUQaWH3UE.vtt',
 'test_clean_sent/H448NJiwMRI.vtt',
 'test_clean_sent/K7UFz1gAzAQ.vtt',
 'test_clean_sent/K7WqKupeGVk.vtt',
 'test_clean_sent/KhxgwQDJpCg.vtt',
 'test_clean_sent/LicjiPStTmU.vtt',
 'test_clean_sent/M5SVtrVMQm0.vtt',
 'test_clean_sent/NiQIWYxm8EQ.vtt',
 'test_clean_sent/QXVJZkmva3s.vtt',
 'test_clean_sent/RenNdZa3-Q

In [240]:
def get_dict_vtt(input):
    with open(input,encoding="utf-8") as f:
        lines = f.readlines()

    dict_sub = {}
    i = 0
    j = 0  

    while j < len(lines): 
        element = lines[j]
        if element.startswith("00:") or element.startswith("01:") or element.startswith("02:"):
            # Extraire le temps de début et de fin
            timing_line = element.strip().split(' --> ')
            start_time, end_time = timing_line

            text = ""
            while j + 1 < len(lines) and not lines[j + 1].startswith("00:") and not lines[j+1].startswith("01:") and not lines[j+1].startswith("02:"):
                j += 1
                content = lines[j]
                text = text + " " + content.strip()

            dict_sub[i] = {'start': start_time, 'end': end_time, 'text': text.strip()}
            i += 1

        j += 1

    return dict_sub

In [241]:
ponctuations = {"!", ".", "?", "....", "...",'"',":"}

for file in files:
    with open(file, 'r', encoding="utf-8") as f:
        print(f"Traitement de {file}")
        ligne_precedente = ""
        liste_sent = f.readlines()
        if len(liste_sent) > 3:
            print(f"File long enough, traitement : {file} --- {len(liste_sent)}")
            i = 0
            txt = ""
            while i < len(liste_sent):
                if i < len(liste_sent) - 1:
                    if liste_sent[i + 1].strip() in ponctuations or liste_sent[i+1].startswith(","):
                        txt = txt + liste_sent[i].strip() + liste_sent[i + 1]
                        i = i + 2
                        var = True
                    else:
                        txt = txt + liste_sent[i]
                        i = i + 1
                        var = False
                else:
                    if var == False:
                        txt = txt + liste_sent[i]
                        i = i + 1
                    else:
                        i = i +1

            with open(file, "w", encoding="utf-8") as f:
                print(f"Ecriture du nouveau fichier {file}")
                f.write(txt)

            print(f"{file} done")
            i = 0
        else:
            print(f"not long enough : {file}")


Traitement de test_clean_sent/-LhfYZ1ihpI.vtt
File long enough, traitement : test_clean_sent/-LhfYZ1ihpI.vtt --- 366
Ecriture du nouveau fichier test_clean_sent/-LhfYZ1ihpI.vtt
test_clean_sent/-LhfYZ1ihpI.vtt done
Traitement de test_clean_sent/-sgE2QHsskA.vtt
File long enough, traitement : test_clean_sent/-sgE2QHsskA.vtt --- 190
Ecriture du nouveau fichier test_clean_sent/-sgE2QHsskA.vtt
test_clean_sent/-sgE2QHsskA.vtt done
Traitement de test_clean_sent/0-RtcGGzUoE.vtt
File long enough, traitement : test_clean_sent/0-RtcGGzUoE.vtt --- 327
Ecriture du nouveau fichier test_clean_sent/0-RtcGGzUoE.vtt
test_clean_sent/0-RtcGGzUoE.vtt done
Traitement de test_clean_sent/0IViLqFmgIY.vtt
File long enough, traitement : test_clean_sent/0IViLqFmgIY.vtt --- 441
Ecriture du nouveau fichier test_clean_sent/0IViLqFmgIY.vtt
test_clean_sent/0IViLqFmgIY.vtt done
Traitement de test_clean_sent/0hXvxmgHk_c.vtt
File long enough, traitement : test_clean_sent/0hXvxmgHk_c.vtt --- 151
Ecriture du nouveau fichier

- remonter ce qu'il y a après ":" si c'est isolée de la phrase sur la ligne suivante

In [242]:
ponctuations = {":"}

# Ouvrez le fichier en mode lecture
for file in files:
    with open(file,"r",encoding="utf-8") as f:
        print(f"Traitement de {file}")
        ligne_precedente = ""
        liste_sent = f.readlines()
        if len(liste_sent) > 3:
            print(f"File long enough, traitement : {file} --- {len(liste_sent)}")
            i=0
            txt=""
            while i < len(liste_sent):
                if i < len(liste_sent)-1:
                    if liste_sent[i].strip().endswith(":"):
                        txt = txt + liste_sent[i].strip() + " " + liste_sent[i+1]
                        i = i+2
                        var = True
                    else:
                        txt = txt + liste_sent[i]
                        i = i +1
                        var = False
                else:
                    if var == False:
                        txt = txt + liste_sent[i]
                        i = i +1
                    else:
                        i = i +1
            with open(file,"w",encoding="utf-8") as f:
                print(f"Ecriture du nouveau fichier {file}")
                f.write(txt)
            print(f"{file} done")
            i = 0
        else:
            print(f"not long enough : {file}")



Traitement de test_clean_sent/-LhfYZ1ihpI.vtt
File long enough, traitement : test_clean_sent/-LhfYZ1ihpI.vtt --- 365
Ecriture du nouveau fichier test_clean_sent/-LhfYZ1ihpI.vtt
test_clean_sent/-LhfYZ1ihpI.vtt done
Traitement de test_clean_sent/-sgE2QHsskA.vtt
File long enough, traitement : test_clean_sent/-sgE2QHsskA.vtt --- 189
Ecriture du nouveau fichier test_clean_sent/-sgE2QHsskA.vtt
test_clean_sent/-sgE2QHsskA.vtt done
Traitement de test_clean_sent/0-RtcGGzUoE.vtt
File long enough, traitement : test_clean_sent/0-RtcGGzUoE.vtt --- 325
Ecriture du nouveau fichier test_clean_sent/0-RtcGGzUoE.vtt
test_clean_sent/0-RtcGGzUoE.vtt done
Traitement de test_clean_sent/0IViLqFmgIY.vtt
File long enough, traitement : test_clean_sent/0IViLqFmgIY.vtt --- 436
Ecriture du nouveau fichier test_clean_sent/0IViLqFmgIY.vtt
test_clean_sent/0IViLqFmgIY.vtt done
Traitement de test_clean_sent/0hXvxmgHk_c.vtt
File long enough, traitement : test_clean_sent/0hXvxmgHk_c.vtt --- 150
Ecriture du nouveau fichier