# Traitement intermédiaire

Ce notebook permet de récupérer, depuis un fichier de sous-titre, deux nouveaux fichiers qui serviront à avoir des sous-titres segmentés en phrases :
1. Un fichier de sous-titre .vtt sur segmenté, où un sous-titre correspond exclusivement à une phrase ou à un morceau de phrases 
2. Un fichier .txt contenant une phrase par ligne

In [1]:
import spacy
import os
import re
import module_traitement as m
spacy.prefer_gpu()
nlp = spacy.load("fr_dep_news_trf")
from spacy.language import Language

In [98]:
# Pour Mediapi
file_with_path = m.lister_fichiers_with_path("../data/aligned_mediapi/")
folder = m.lister_fichiers("../data/aligned_mediapi/")
output_seg = "../data/new_segmentation_mediapi/"
output_sent = "../data/sentence_mediapi/"

In [2]:
# Pour Matignon - LSF 
file_with_path = m.lister_fichiers_with_path("../data/cr_audio_aligned/")
folder = m.lister_fichiers("../data/cr_audio_aligned/")
output_seg = "../data/new_segmentation_cr/"
output_sent = "../data/sentence_matignon/"

## Nettoyage fichier

Nettoyer le fichier en amont des différents pré-traitement pour facilier la detection de phrase et le découpage. 

In [177]:
# TEST - Pour Matignon - LSF 
file_with_path = m.lister_fichiers_with_path("../data/cr_audio_aligned/")
folder = m.lister_fichiers("../data/cr_audio_aligned/")
output_seg = "test_new_seg_cr/"
output_cleaning = "test_cleaning_cr/"
#output_sent = "../data/sentence_matignon/"

In [6]:
# TEST - Pour Mediapi 
file_with_path = m.lister_fichiers_with_path("../data/aligned_mediapi//")
folder = m.lister_fichiers("../data/aligned_mediapi/")
output_seg = "test_new_seg_mediapi/"
output_cleaning = "test_cleaning_mediapi/"
#output_sent = "../data/sentence_matignon/"

In [7]:
def get_dict_vtt_clean(input):
    with open(input,encoding="utf-8") as f:
        lines = f.readlines()

    dict_sub = {}
    i = 0
    j = 0  

    while j < len(lines): 
        element = lines[j]
        if element.startswith("00:") or element.startswith("01:") or element.startswith("02:"):
            # Extraire le temps de début et de fin
            timing_line = element.strip().split(' --> ')
            start_time, end_time = timing_line

            text = ""
            while j + 1 < len(lines) and not lines[j + 1].startswith("00:") and not lines[j+1].startswith("01:") and not lines[j+1].startswith("02:"):
                j += 1
                content = lines[j]
                if content.startswith("-"):
                    content = content.replace("-","")
                text = text + " " + content.strip()
                text=re.sub(r'\[INAUDIBLE\],?', 'Inaudible.', text)
                text=text.replace("[ INAUDIBLE ]","Inaudible.")
                text=text.replace("... -G. Attal : ","")
                text=text.replace("-G. Attal : ","")
                text=text.replace("G. Attal : ","")
                text = text.replace("-Bonjour", "Bonjour")
                text = text.replace("Bonjour.", "Bonjour,")
                text = text.replace('".','"')
                text = text.replace('"?','"')
                text = text.replace('"!','"')
                text = text.replace('"...','"')
                text = re.sub(r'["“”«»]', '', text)
                text = text.replace(" Inaudible.",". Inaudible.") # to check 0-R

            dict_sub[i] = {'start': start_time, 'end': end_time, 'text': text.strip()}
            i += 1

        j += 1

    return dict_sub


In [8]:
for file,name in zip(file_with_path,folder):
    dict_sub = get_dict_vtt_clean(file)
    m.create_vtt_file(dict_sub,f"{output_cleaning}/{name}")

## Fichier sur-segmenté

Dans ces fichiers, on fait les traitements suivants : 
1. Récupérer les ponctuations fortes (point, point d'exclamation, point d'interrogation) au milieu des sous-titre
2. Garder les timestamps en mémoire
3. Avoir la durée de prononciation d'une lettre, pour découper le sous-titre en fonction
4. Couper le sous-titre et générer un nouveau fichier

### Traitement 

Le prochain traitement peut s'effectuer deux fois, au cas où il y avait plusieurs ponctations fortes. Je peux modifier l'expression régulière pour ne pas prendre les LETTRE_MAJ. dans ma segmentation.

In [9]:
import module_traitement as m

In [10]:
file_with_path = m.lister_fichiers_with_path("test_cleaning_mediapi/")
folder = m.lister_fichiers("test_cleaning_mediapi/")

In [11]:
def time_to_seconds(timestamp):
    # Split the timestamp into hours, minutes, seconds, and milliseconds
    milliseconds = int(timestamp.split('.')[1])
    tmp = timestamp.split('.')[0]
    hours, minutes, seconds = map(int, tmp.split(':'))

    # Calculate the total seconds
    total_seconds = hours * 3600 + minutes * 60 + seconds + milliseconds / 1000.0

    return total_seconds

In [13]:
for file,name in zip(file_with_path,folder):
    print(f"TRAITEMENT {file} ---- {name}")
    dict_sub = m.get_dict_vtt(file)
    new_dict = {}
    mm = 0
    pattern = r'([.!?]+)'
    sous_unite = []
    for k, v in dict_sub.items():
        for kk, vv in v.items():
            if kk == "text":
                # Replace the point between two capital letters with '#'
                modified_text = re.sub(r'(?<=[A-Z])\.(?=[A-Z])', '#', vv)
                # Use re.split() to split the text based on the pattern
                sentences = re.split(pattern, modified_text)
                # Combine pairs of adjacent list elements (sentence + punctuation)
                result = [sentences[i] + sentences[i + 1] if i < len(sentences) - 1 else sentences[i] for i in range(0, len(sentences), 2)]
                # Remove empty strings from the result
                result = [sentence.strip() for sentence in result if sentence.strip()]
                if len(result) == 1:
                    if mm not in new_dict:
                        new_dict[mm]=v
                        mm = mm +1
                else:
                    if result:
                        print(f"resultat : {result}")
                        start_time_str = v["start"]
                        end_time_str = v["end"]
                        # start_time = m.conv_str_to_time(start_time_str)
                        # end_time = m.conv_str_to_time(end_time_str)
                        nb_of_carach = len(v["text"])
                        duration = time_to_seconds(end_time_str) - time_to_seconds(start_time_str)
                        duration_sec = duration
                        print(duration)
                        sec_par_letter = duration_sec / nb_of_carach
                        for match in result:
                            len_match = len(match)
                            duration_match = len_match*sec_par_letter
                            if mm not in new_dict:
                                #print(end_time)
                                end_time = m.ajouter_secondes(start_time_str,duration_match)
                                print(end_time)
                                print(f"start time ({type(start_time_str)}) : {start_time_str}, end time ({type(end_time)}) : {end_time}, text : {match}")
                                new_dict[mm]={'start':start_time_str,"end":end_time,'text':match}
                                start_time_str = end_time
                                mm = mm +1
                    else:
                        continue
    m.create_vtt_file(new_dict,f"{output_seg}/{name}")

                    
    

TRAITEMENT test_cleaning_mediapi/b7f2d8f0c3.vtt ---- b7f2d8f0c3.vtt
TRAITEMENT test_cleaning_mediapi/3d0b82b459.vtt ---- 3d0b82b459.vtt
resultat : ["Est-ce qu'il se livrera à un massacre ?", 'Mystère...']
2.920000000000016
00:06:39.777
start time (<class 'str'>) : 00:06:37.500, end time (<class 'str'>) : 00:06:39.777, text : Est-ce qu'il se livrera à un massacre ?
00:06:40.361
start time (<class 'str'>) : 00:06:39.777, end time (<class 'str'>) : 00:06:40.361, text : Mystère...
TRAITEMENT test_cleaning_mediapi/44f554f914.vtt ---- 44f554f914.vtt
TRAITEMENT test_cleaning_mediapi/b9a51f4361.vtt ---- b9a51f4361.vtt
TRAITEMENT test_cleaning_mediapi/4e073949b1.vtt ---- 4e073949b1.vtt
TRAITEMENT test_cleaning_mediapi/0b1437bc85.vtt ---- 0b1437bc85.vtt
TRAITEMENT test_cleaning_mediapi/752500b761.vtt ---- 752500b761.vtt
TRAITEMENT test_cleaning_mediapi/cba6cefad2.vtt ---- cba6cefad2.vtt
TRAITEMENT test_cleaning_mediapi/41c606553f.vtt ---- 41c606553f.vtt
TRAITEMENT test_cleaning_mediapi/3f1b2118c

## Sortir un fichier de phrases

L'idée est de sortie un fichier contenant une phrase par ligne pour pouvoir créer plus simplement le nouveau fichier de sous-titre segmenté en phrases. 

### Traitement

Utilisation de SpaCy pour récupérer les phrases

In [18]:
file_with_path

['test_cleaning_mediapi/b7f2d8f0c3.vtt',
 'test_cleaning_mediapi/3d0b82b459.vtt',
 'test_cleaning_mediapi/44f554f914.vtt',
 'test_cleaning_mediapi/b9a51f4361.vtt',
 'test_cleaning_mediapi/4e073949b1.vtt',
 'test_cleaning_mediapi/0b1437bc85.vtt',
 'test_cleaning_mediapi/752500b761.vtt',
 'test_cleaning_mediapi/cba6cefad2.vtt',
 'test_cleaning_mediapi/41c606553f.vtt',
 'test_cleaning_mediapi/3f1b2118ca.vtt',
 'test_cleaning_mediapi/ed969c3e70.vtt',
 'test_cleaning_mediapi/26f1ff8385.vtt',
 'test_cleaning_mediapi/2b4b33189c.vtt',
 'test_cleaning_mediapi/ada7d10a19.vtt',
 'test_cleaning_mediapi/fd41a24117.vtt',
 'test_cleaning_mediapi/15abfc95ae.vtt',
 'test_cleaning_mediapi/495145911e.vtt',
 'test_cleaning_mediapi/5ef5fa319a.vtt',
 'test_cleaning_mediapi/0fd91cb814.vtt',
 'test_cleaning_mediapi/9a58b08185.vtt',
 'test_cleaning_mediapi/ac6160b61a.vtt',
 'test_cleaning_mediapi/74bb642e72.vtt',
 'test_cleaning_mediapi/17217ca54b.vtt',
 'test_cleaning_mediapi/928499b438.vtt',
 'test_cleaning_

In [19]:
output_sent = "test_clean_sent_mediapi/"

In [16]:
def get_dict_vtt(input):
    with open(input,encoding="utf-8") as f:
        lines = f.readlines()

    dict_sub = {}
    i = 0
    j = 0  

    while j < len(lines): 
        element = lines[j]
        if element.startswith("00:") or element.startswith("01:") or element.startswith("02:"):
            # Extraire le temps de début et de fin
            timing_line = element.strip().split(' --> ')
            start_time, end_time = timing_line

            text = ""
            while j + 1 < len(lines) and not lines[j + 1].startswith("00:") and not lines[j+1].startswith("01:") and not lines[j+1].startswith("02:"):
                j += 1
                content = lines[j]
                text = text + " " + content.strip()

            dict_sub[i] = {'start': start_time, 'end': end_time, 'text': text.strip()}
            i += 1

        j += 1

    return dict_sub

In [20]:
for files, name in zip(file_with_path,folder):
    print("Traitement",files,name)
    dict_sub = m.get_dict_vtt(files)
    text = ""
    sentences = []
    for k,v in dict_sub.items():
        for kk,vv in v.items():
            if kk=="text":
                text = text + vv + " "
    doc = nlp(text)
    assert doc.has_annotation("SENT_START")
    for sent in doc.sents:
        sentences.append(sent.text)
    with open(f"{output_sent}/{name}","w",encoding="utf-8") as output:
        for sent in sentences:
            output.write(sent+"\n")

Traitement test_cleaning_mediapi/b7f2d8f0c3.vtt b7f2d8f0c3.vtt
Traitement test_cleaning_mediapi/3d0b82b459.vtt 3d0b82b459.vtt
Traitement test_cleaning_mediapi/44f554f914.vtt 44f554f914.vtt
Traitement test_cleaning_mediapi/b9a51f4361.vtt b9a51f4361.vtt
Traitement test_cleaning_mediapi/4e073949b1.vtt 4e073949b1.vtt
Traitement test_cleaning_mediapi/0b1437bc85.vtt 0b1437bc85.vtt
Traitement test_cleaning_mediapi/752500b761.vtt 752500b761.vtt
Traitement test_cleaning_mediapi/cba6cefad2.vtt cba6cefad2.vtt
Traitement test_cleaning_mediapi/41c606553f.vtt 41c606553f.vtt
Traitement test_cleaning_mediapi/3f1b2118ca.vtt 3f1b2118ca.vtt
Traitement test_cleaning_mediapi/ed969c3e70.vtt ed969c3e70.vtt
Traitement test_cleaning_mediapi/26f1ff8385.vtt 26f1ff8385.vtt
Traitement test_cleaning_mediapi/2b4b33189c.vtt 2b4b33189c.vtt
Traitement test_cleaning_mediapi/ada7d10a19.vtt ada7d10a19.vtt
Traitement test_cleaning_mediapi/fd41a24117.vtt fd41a24117.vtt
Traitement test_cleaning_mediapi/15abfc95ae.vtt 15abfc9

### Nettoyer fichier phrase
- remonter la ponctuation forte si elle est isolée sur une ligne
- remonter les sections commençant par une virgule si elles sont isolées du reste de la phrase sur la ligne suivante

In [21]:
files= m.lister_fichiers_with_path(output_sent)

In [22]:
files

['test_clean_sent_mediapi/b7f2d8f0c3.vtt',
 'test_clean_sent_mediapi/3d0b82b459.vtt',
 'test_clean_sent_mediapi/44f554f914.vtt',
 'test_clean_sent_mediapi/b9a51f4361.vtt',
 'test_clean_sent_mediapi/4e073949b1.vtt',
 'test_clean_sent_mediapi/0b1437bc85.vtt',
 'test_clean_sent_mediapi/752500b761.vtt',
 'test_clean_sent_mediapi/cba6cefad2.vtt',
 'test_clean_sent_mediapi/41c606553f.vtt',
 'test_clean_sent_mediapi/3f1b2118ca.vtt',
 'test_clean_sent_mediapi/ed969c3e70.vtt',
 'test_clean_sent_mediapi/26f1ff8385.vtt',
 'test_clean_sent_mediapi/2b4b33189c.vtt',
 'test_clean_sent_mediapi/ada7d10a19.vtt',
 'test_clean_sent_mediapi/fd41a24117.vtt',
 'test_clean_sent_mediapi/15abfc95ae.vtt',
 'test_clean_sent_mediapi/495145911e.vtt',
 'test_clean_sent_mediapi/5ef5fa319a.vtt',
 'test_clean_sent_mediapi/0fd91cb814.vtt',
 'test_clean_sent_mediapi/9a58b08185.vtt',
 'test_clean_sent_mediapi/ac6160b61a.vtt',
 'test_clean_sent_mediapi/74bb642e72.vtt',
 'test_clean_sent_mediapi/17217ca54b.vtt',
 'test_clea

In [23]:
def get_dict_vtt(input):
    with open(input,encoding="utf-8") as f:
        lines = f.readlines()

    dict_sub = {}
    i = 0
    j = 0  

    while j < len(lines): 
        element = lines[j]
        if element.startswith("00:") or element.startswith("01:") or element.startswith("02:"):
            # Extraire le temps de début et de fin
            timing_line = element.strip().split(' --> ')
            start_time, end_time = timing_line

            text = ""
            while j + 1 < len(lines) and not lines[j + 1].startswith("00:") and not lines[j+1].startswith("01:") and not lines[j+1].startswith("02:"):
                j += 1
                content = lines[j]
                text = text + " " + content.strip()

            dict_sub[i] = {'start': start_time, 'end': end_time, 'text': text.strip()}
            i += 1

        j += 1

    return dict_sub

In [24]:
ponctuations = {"!", ".", "?", "....", "...",'"',":"}

for file in files:
    with open(file, 'r', encoding="utf-8") as f:
        print(f"Traitement de {file}")
        ligne_precedente = ""
        liste_sent = f.readlines()
        if len(liste_sent) > 3:
            print(f"File long enough, traitement : {file} --- {len(liste_sent)}")
            i = 0
            txt = ""
            while i < len(liste_sent):
                if i < len(liste_sent) - 1:
                    if liste_sent[i + 1].strip() in ponctuations or liste_sent[i+1].startswith(","):
                        txt = txt + liste_sent[i].strip() + liste_sent[i + 1]
                        i = i + 2
                        var = True
                    else:
                        txt = txt + liste_sent[i]
                        i = i + 1
                        var = False
                else:
                    if var == False:
                        txt = txt + liste_sent[i]
                        i = i + 1
                    else:
                        i = i +1

            with open(file, "w", encoding="utf-8") as f:
                print(f"Ecriture du nouveau fichier {file}")
                f.write(txt)

            print(f"{file} done")
            i = 0
        else:
            print(f"not long enough : {file}")


Traitement de test_clean_sent_mediapi/b7f2d8f0c3.vtt
not long enough : test_clean_sent_mediapi/b7f2d8f0c3.vtt
Traitement de test_clean_sent_mediapi/3d0b82b459.vtt
File long enough, traitement : test_clean_sent_mediapi/3d0b82b459.vtt --- 61
Ecriture du nouveau fichier test_clean_sent_mediapi/3d0b82b459.vtt
test_clean_sent_mediapi/3d0b82b459.vtt done
Traitement de test_clean_sent_mediapi/44f554f914.vtt
not long enough : test_clean_sent_mediapi/44f554f914.vtt
Traitement de test_clean_sent_mediapi/b9a51f4361.vtt
File long enough, traitement : test_clean_sent_mediapi/b9a51f4361.vtt --- 8
Ecriture du nouveau fichier test_clean_sent_mediapi/b9a51f4361.vtt
test_clean_sent_mediapi/b9a51f4361.vtt done
Traitement de test_clean_sent_mediapi/4e073949b1.vtt
not long enough : test_clean_sent_mediapi/4e073949b1.vtt
Traitement de test_clean_sent_mediapi/0b1437bc85.vtt
not long enough : test_clean_sent_mediapi/0b1437bc85.vtt
Traitement de test_clean_sent_mediapi/752500b761.vtt
File long enough, traiteme

- remonter ce qu'il y a après ":" si c'est isolée de la phrase sur la ligne suivante

In [25]:
ponctuations = {":"}

# Ouvrez le fichier en mode lecture
for file in files:
    with open(file,"r",encoding="utf-8") as f:
        print(f"Traitement de {file}")
        ligne_precedente = ""
        liste_sent = f.readlines()
        if len(liste_sent) > 3:
            print(f"File long enough, traitement : {file} --- {len(liste_sent)}")
            i=0
            txt=""
            while i < len(liste_sent):
                if i < len(liste_sent)-1:
                    if liste_sent[i].strip().endswith(":"):
                        txt = txt + liste_sent[i].strip() + " " + liste_sent[i+1]
                        i = i+2
                        var = True
                    else:
                        txt = txt + liste_sent[i]
                        i = i +1
                        var = False
                else:
                    if var == False:
                        txt = txt + liste_sent[i]
                        i = i +1
                    else:
                        i = i +1
            with open(file,"w",encoding="utf-8") as f:
                print(f"Ecriture du nouveau fichier {file}")
                f.write(txt)
            print(f"{file} done")
            i = 0
        else:
            print(f"not long enough : {file}")



Traitement de test_clean_sent_mediapi/b7f2d8f0c3.vtt
not long enough : test_clean_sent_mediapi/b7f2d8f0c3.vtt
Traitement de test_clean_sent_mediapi/3d0b82b459.vtt
File long enough, traitement : test_clean_sent_mediapi/3d0b82b459.vtt --- 61
Ecriture du nouveau fichier test_clean_sent_mediapi/3d0b82b459.vtt
test_clean_sent_mediapi/3d0b82b459.vtt done
Traitement de test_clean_sent_mediapi/44f554f914.vtt
not long enough : test_clean_sent_mediapi/44f554f914.vtt
Traitement de test_clean_sent_mediapi/b9a51f4361.vtt
File long enough, traitement : test_clean_sent_mediapi/b9a51f4361.vtt --- 8
Ecriture du nouveau fichier test_clean_sent_mediapi/b9a51f4361.vtt
test_clean_sent_mediapi/b9a51f4361.vtt done
Traitement de test_clean_sent_mediapi/4e073949b1.vtt
not long enough : test_clean_sent_mediapi/4e073949b1.vtt
Traitement de test_clean_sent_mediapi/0b1437bc85.vtt
not long enough : test_clean_sent_mediapi/0b1437bc85.vtt
Traitement de test_clean_sent_mediapi/752500b761.vtt
File long enough, traiteme