# GENERACION DEL TEXTO ALINEADO AL TIEMPO

# 📎 Documentation

* `input_format`: The source of the audio/video file to be transcribed
  * `youtube`: A YouTube video
    * The transcribed file(s) are saved to this Colab, and will be deleted when the Colab runtime is disconnected.
  * `gdrive`: A file in your Google Drive account
    * If you select this option, you will need to allow this notebook to connect to your Google Drive account.
    * The transcribed file(s) are saved to the same folder as the original file.
  * `local`: A local file that you have uploaded to this Colab
    * If you select this option, you will need to first upload the file to the Files tab (see Step 1 [here](https://wandb.ai/wandb_fc/gentle-intros/reports/How-to-transcribe-your-audio-to-text-for-free-with-SRTs-VTTs---VmlldzozMzc1MzU3)).
    * The transcribed file(s) are saved to this Colab, and will be deleted when the Colab runtime is disconnected.
* `file`: The URL of the YouTube video or the path of the audio file to be transcribed.
  * Example: `file = "https://www.youtube.com/watch?v=AUDIO"` (transcribing a YouTube video)
  * Example: `file = "/content/drive/My Drive/AUDIO.mp3"` (transcribing a Google Drive file)
  * Example: `file = "/content/AUDIO.mp3"` (transcribing a local file)
* `plain`: Whether to save the transcription as a text file or not.
* `srt`: Whether to save the transcription as an SRT file or not.
* `vtt`: Whether to save the transcription as a VTT file or not.
* `tsv`: Whether to save the transcription as a TSV (tab-separated values) file or not.
* `download`: Whether to download the transcribed file(s) or not.


# ✨ README

This is the companion Colab for the article "[How to transcribe your audio to text, for free (with SRTs/VTTs!)](https://wandb.ai/wandb_fc/gentle-intros/reports/How-to-transcribe-your-audio-to-text-for-free-with-SRTs-VTTs---VmlldzozNDczNTI0)".

This Colab shows how to use OpenAI's Whisper to transcribe audio and audiovisual files, and how to save that transcription as a plain text file or as a VTT/SRT caption file.


In [None]:
# @title 🌴 Change the values in this section

# @markdown Select the source of the audio/video file to be transcribed
input_format = "youtube" #@param ["youtube", "gdrive", "local"]

# @markdown Enter the URL of the YouTube video or the path of the audio file to be transcribed
file = "https://www.youtube.com/watch?v=iFl2_XlSvX4" #@param {type:"string"}

#@markdown Click here if you'd like to save the transcription as text file
plain = True #@param {type:"boolean"}

# @markdown Click here if you'd like to save the transcription as an SRT file
srt = True #@param {type:"boolean"}

#@markdown Click here if you'd like to save the transcription as a VTT file
vtt = False #@param {type:"boolean"}

#@markdown Click here if you'd like to save the transcription as a TSV file
tsv = False #@param {type:"boolean"}

#@markdown Click here if you'd like to download the transcribed file(s) locally

download = False #@param {type:"boolean"}

# 🛠 Set Up

The blocks below install all of the necessary Python libraries (including Whisper), configures Whisper, and contains code for various helper functions.



## 🤝 Dependencies

In [None]:
# Dependencies

!pip install -q yt-dlp
!pip install -q git+https://github.com/openai/whisper.git
!pip install -q git+https://github.com/m-bain/whisperx.git
!pip install silero-vad pydub torch --quiet

!pip install html2image
import os, re
import torch
from pathlib import Path
!apt-get update
!apt-get install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin


import whisper
from whisper.utils import get_writer

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,607 kB]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:9 https://r2u.stat

## 👋 Whisper configuration

This Colab use `large`, [the medium-sized, English-only](https://github.com/openai/whisper#available-models-and-languages) Whisper model.


In [None]:
# Use CUDA, if available
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load the desired model
model = whisper.load_model("large").to(DEVICE)

100%|█████████████████████████████████████| 2.88G/2.88G [00:38<00:00, 79.8MiB/s]


## 💪 YouTube helper functions

Code for helper functions when running Whisper on a YouTube video.

In [None]:
import yt_dlp
def to_snake_case(name):
    return name.lower().replace(" ", "_").replace(":", "_").replace("__", "_")

def download_youtube_audio_yt_dlp(url ,output_path="audioL.mp3"):
    ydl_opts = {
        'format': 'bestaudio/best',
        'outtmpl': output_path.replace(".mp3", ""),
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],


        'quiet': True,
        'noplaylist': True
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([url])

    return output_path
import subprocess

def recortar_audio(input_path, output_path, start_time, end_time):
    comando = [
        "ffmpeg",
        "-i", input_path,
        "-ss", start_time,
        "-to", end_time,
        "-c", "copy",
        output_path
    ]
    subprocess.run(comando, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)




# ✍ Transcribing with Whisper

Ultimately, calling Whisper is as easy as one line!
* `result = model.transcribe(file)`

The majority of this new `transcribe_file` function is actually just for exporting the results of the transcription as a text, VTT, or SRT file.

In [None]:
def transcribe_file(model, file, plain, srt, vtt, tsv, download):
    """
    Runs Whisper on an audio file

    Parameters
    ----------
    model: Whisper
        The Whisper model instance.

    file: str
        The file path of the file to be transcribed.

    plain: bool
        Whether to save the transcription as a text file or not.

    srt: bool
        Whether to save the transcription as an SRT file or not.

    vtt: bool
        Whether to save the transcription as a VTT file or not.

    tsv: bool
        Whether to save the transcription as a TSV file or not.

    download: bool
        Whether to download the transcribed file(s) or not.

    Returns
    -------
    A dictionary containing the resulting text ("text") and segment-level details ("segments"), and
    the spoken language ("language"), which is detected when `decode_options["language"]` is None.
    """
    file_path = Path(file)
    print(f"Transcribing file: {file_path}\n")

    output_directory = file_path.parent

    # Run Whisper
    result = model.transcribe(file, verbose = False, language = "es")

    if plain:
        txt_path = file_path.with_suffix(".txt")
        print(f"\nCreating text file")

        with open(txt_path, "w", encoding="utf-8") as txt:
            txt.write(result["text"])
    if srt:
        print(f"\nCreating SRT file")
        srt_writer = get_writer("srt", output_directory)
        srt_writer(result, str(file_path.stem))

    if vtt:
        print(f"\nCreating VTT file")
        vtt_writer = get_writer("vtt", output_directory)
        vtt_writer(result, str(file_path.stem))

    if tsv:
        print(f"\nCreating TSV file")

        tsv_writer = get_writer("tsv", output_directory)
        tsv_writer(result, str(file_path.stem))

    if download:
        from google.colab import files

        colab_files = Path("/content")
        stem = file_path.stem

        for colab_file in colab_files.glob(f"{stem}*"):
            if colab_file.suffix in [".txt", ".srt", ".vtt", ".tsv"]:
                print(f"Downloading {colab_file}")
                files.download(str(colab_file))

    return result

# 🧼 Código para eliminar el ruido del audio

In [None]:
# Importar lo necesario
import torch
import numpy as np
import os
from pydub import AudioSegment
import torch

# Cargar el modelo Silero VAD y las funciones auxiliares
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                              model='silero_vad',
                              force_reload=True)

# Desempaquetar las funciones necesarias
(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils

# Ahora puedes usar 'get_speech_timestamps' y 'collect_chunks' en tu código


def aplicar_vad(input_path="audioS.mp3", output_path="audio.mp3"):
    # Cargar el modelo Silero VAD
    model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad', force_reload=True)
    (get_speech_timestamps, _, read_audio, _, _) = utils

    # Leer audio en formato compatible
    audio = read_audio(input_path, sampling_rate=16000)

    # Obtener los fragmentos con voz
    speech_timestamps = get_speech_timestamps(audio, model, sampling_rate=16000)

    if len(speech_timestamps) == 0:
        print("❌ No se detectó voz en el audio.")
        return None

    # Combinar partes con voz
    clean_audio = collect_chunks(speech_timestamps, audio)



    # Guardar resultado
    import soundfile as sf
    sf.write(output_path, clean_audio, 16000)
    print(f"✅ Audio sin silencios guardado en: {output_path}")
    return output_path

Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /root/.cache/torch/hub/master.zip


# Alineacion del texto con el audio

In [None]:
def alinear_con_whisperx(audio_path, transcription_text, idioma="es", device="cuda" if torch.cuda.is_available() else "cpu"):
    import whisperx
    import torch
    import re
    from pathlib import Path
    import json

    def limpiar_texto_espanol(texto):
            texto = re.sub(r"[^\w\sáéíóúüñÁÉÍÓÚÜÑ.,;:¡!¿?\"'()\-\[\]{}<>…–—°%€$@#&+=*/\\]", "", texto)
            texto = texto.replace(":", "")   # Escapar los dos puntos
            texto = texto.replace("'", "")   # Escapar comillas simples
            texto = texto.replace('"', '')   # Escapar comillas dobles
            texto = texto.replace("\\", "") # Escapar backslashes
            return texto

    # Si es texto plano, error porque no hay timestamps
    if isinstance(transcription_text, str):
        raise ValueError("Se requiere una transcripción segmentada con tiempos (start, end) para alinear.")

    # Limpiar el texto de emojis y símbolos raros
    for s in transcription_text:
        s["text"] = limpiar_texto_espanol(s["text"])

    # Eliminar segmentos vacíos
    transcription_text = [s for s in transcription_text if s.get("text", "").strip()]

    # Cargar el modelo de alineación
    model_a, metadata = whisperx.load_align_model(language_code=idioma, device=device)

    # Alinear
    alignment_result = whisperx.align(
        transcript=transcription_text,
        model=model_a,
        align_model_metadata=metadata,
        audio=audio_path,
        device=device,
        return_char_alignments=False
    )
     # Obtener los segmentos alineados a nivel de palabra
    alineado = alignment_result["word_segments"]

     # Guardar el resultado en un archivo JSON
    json_path = Path(audio_path).with_suffix(".aligned.json")
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(alineado, f, ensure_ascii=False, indent=4)


    print(f"Alineación guardada en: {json_path}")

    return alignment_result["word_segments"]

"""import torch
import whisperx
import json

def alinear_con_whisperx(audio_path, transcription_text, idioma="es", device="cuda" if torch.cuda.is_available() else "cpu"):
    """"""
    Alinea la transcripción con el audio usando WhisperX.

    Parámetros:
      audio_path (str): Ruta al archivo de audio (ej. "audio.mp3").
      transcription_text (str): Texto transcrito previamente (ej. result["text"] de Whisper).
      idioma (str): Código del idioma (por ejemplo, "es" para español, "en" para inglés).
      device (str): Dispositivo para ejecutar ("cuda" o "cpu").

    Retorna:
      word_segments (list): Lista de diccionarios con la alineación de cada palabra,
                            cada uno con keys: 'word', 'start', 'end', etc.
    """"""
    # Cargar el modelo de alineación y sus metadatos
    model_a, metadata = whisperx.load_align_model(language_code=idioma, device=device)
    print("Tipo de metadata:", type(metadata))
    print("Contenido de metadata:", metadata)

    # Verificamos tipo de transcripción
    if isinstance(transcription_text, str):
        transcription_text = [{"text": transcription_text}]
    # Ejecutar la alineación.

    alignment_result = whisperx.align(
    transcript=transcription_text,
    model=model_a,
    align_model_metadata=metadata,
    audio=audio_path,
    device=device,
    return_char_alignments=False
)



    # Obtener los segmentos alineados a nivel de palabra
    alineado = alignment_result["word_segments"]

     # Guardar el resultado en un archivo JSON
    json_path = Path(audio_path).with_suffix(".aligned.json")
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(alineado, f, ensure_ascii=False, indent=4)


    print(f"Alineación guardada en: {json_path}")
    # Devuelve la lista de segmentos alineados a nivel de palabra
    return alignment_result["word_segments"]
"""
# Ejemplo de uso:
# Supón que ya tienes tu transcripción de Whisper en la variable `result`
# transcripcion = result["text"]
# word_segments = alinear_con_whisperx("audio.mp3", transcripcion, idioma="es")



'import torch\nimport whisperx\nimport json\n\ndef alinear_con_whisperx(audio_path, transcription_text, idioma="es", device="cuda" if torch.cuda.is_available() else "cpu"):\n    \n    Alinea la transcripción con el audio usando WhisperX.\n\n    Parámetros:\n      audio_path (str): Ruta al archivo de audio (ej. "audio.mp3").\n      transcription_text (str): Texto transcrito previamente (ej. result["text"] de Whisper).\n      idioma (str): Código del idioma (por ejemplo, "es" para español, "en" para inglés).\n      device (str): Dispositivo para ejecutar ("cuda" o "cpu").\n\n    Retorna:\n      word_segments (list): Lista de diccionarios con la alineación de cada palabra,\n                            cada uno con keys: \'word\', \'start\', \'end\', etc.\n    \n    # Cargar el modelo de alineación y sus metadatos\n    model_a, metadata = whisperx.load_align_model(language_code=idioma, device=device)\n    print("Tipo de metadata:", type(metadata))\n    print("Contenido de metadata:", metad

# 💬 TRANCRIPCION

This block actually calls `transcribe_file` 😉


In [None]:
import os
import shutil

# Archivos y carpetas a eliminar
elementos = [
    "audio.mp3", "audioL.mp3", "audio.txt", "audio.srt",
    "salida.mp4", "audio.aligned.json", "inputs.txt",
     "imagenes_por_frase","t", "tt",  # etc.
]

for elemento in elementos:
    if os.path.exists(elemento):
        if os.path.isdir(elemento):
            shutil.rmtree(elemento)
            print(f"🧹 Carpeta eliminada: {elemento}")
        else:
            os.remove(elemento)
            print(f"✅ Archivo eliminado: {elemento}")
    else:
        print(f"❌ No existe: {elemento}")


❌ No existe: audio.mp3
❌ No existe: audioL.mp3
❌ No existe: audio.txt
❌ No existe: audio.srt
❌ No existe: salida.mp4
❌ No existe: audio.aligned.json
❌ No existe: inputs.txt
❌ No existe: imagenes_por_frase
❌ No existe: t
❌ No existe: tt


In [None]:
if input_format == "youtube":
    # Download the audio stream of the YouTube video

    url = file  # En este caso, `file` es la URL del video
    audio = download_youtube_audio_yt_dlp(url)
    recortar_audio("audioL.mp3", "audio.mp3", "00:1:40", "00:2:13")
    # aplicar_vad("audioS.mp3", "audio.mp3") 1:01 : 1:12 ,

    # Run Whisper on the audio stream
    result = transcribe_file(model, "audio.mp3", plain, srt, vtt, tsv, download)
elif input_format == "gdrive":
    # Authorize a connection between Google Drive and Google Colab
    from google.colab import drive
    drive.mount('/content/drive')

    # Run Whisper on the specified file
    result = transcribe_file(model, file, plain, srt, vtt, tsv, download)
elif input_format == "local":
    # Run Whisper on the specified file
    result = transcribe_file(model, file, plain, srt, vtt, tsv, download)

Transcribing file: audio.mp3



100%|██████████| 3298/3298 [00:06<00:00, 541.62frames/s]


Creating text file

Creating SRT file





# ALINEALO !

In [None]:
transcripcion = result["segments"]  # <-- Esto es una lista de diccionarios ✅

print(result["text"])
# Pass the variable transcripcion which contains the transcription text
alineado = alinear_con_whisperx("audio.mp3", transcripcion, idioma="es")
    # Obtener los segmentos alineados a nivel de palabra
    # Guardar como JSON




 Mucha money, pero los millones no aguantan el llanto De cuando me siento solo en mi cuarto De cuando el balcón no se siente tan alto To' quieren la fama, to' quieren el cuarto Y yo ¿Qué cambiaría la gloria y mi riqueza? Solo por saber cómo es que tú vas a Por sacarte un día de mi casa Ay, ay


DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _speechbrain_save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _speechbrain_load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _recover
Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_voxpopuli_base_10k_asr_es.pt" to /root/.cache/torch/hub/checkpoints/wav2vec2_voxpopuli_base_10k_asr_es.pt
100%|██████████| 360M/360M [00:04<00:00, 85.6MB/s]


Alineación guardada en: audio.aligned.json


GENERACION DE IMAGENES PRO PARA VIDEOS AUTOMATICOS Y RENDERIZADO !!!!🎬

In [None]:
import json
import re
import random


def parse_srt_time(timestr):
    """
    Convierte un tiempo SRT (HH:MM:SS,ms) a segundos (float).
    """
    hours, minutes, rest = timestr.split(':')
    seconds, ms = rest.split(',')
    return int(hours) * 3600 + int(minutes) * 60 + int(seconds) + int(ms) / 1000


def load_srt(srt_path):
    """
    Parsea un archivo .srt y devuelve una lista de frases con start, end y texto.
    """
    phrases = []
    with open(srt_path, 'r', encoding='utf-8') as f:
        content = f.read().strip()

    blocks = re.split(r'\n\n+', content)
    for block in blocks:
        lines = block.splitlines()
        if len(lines) >= 3:
            start_str, end_str = lines[1].split(' --> ')
            start = parse_srt_time(start_str.strip())
            end = parse_srt_time(end_str.strip())
            text = ' '.join(lines[2:]).strip()
            phrases.append({'start': start, 'end': end, 'text': text})
    return phrases


def load_alignment_json(aligned_json_path):
    """
    Carga el JSON alineado y devuelve la lista de palabras con tiempos.
    Soporta:
      - Formato WhisperX completo: {'segments': [...], ...}
      - Formato simplificado: [{ 'word'|'text', 'start', 'end' }, ...]
    """
    with open(aligned_json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    # WhisperX completo
    if isinstance(data, dict) and 'segments' in data:
        words = []
        for seg in data['segments']:
            words.extend(seg.get('words', []))
    # Formato lista de palabras
    elif isinstance(data, list) and isinstance(data[0], dict):
        words = data
    else:
        raise ValueError("Formato de JSON alineado no válido")
    return words


def assign_filter(text, index):
    """Asigna filtro favoreciendo el tipo 1 y ocasionalmente del 2 al 5"""
    if random.random() < 0.8:
        return 1
    else:
        return random.randint(2, 3)

def assign_font(text):
    """Fuente Impact si está en MAYÚSCULAS, BebasNeue en otro caso."""
    if text.isupper():
        return "PoetsenOne-Regular.ttf"
    return "PoetsenOne-Regular.ttf"


def is_closing_word(word, phrases, phrase_id):
    """Marca si la palabra es la última de su frase (umbral 0.1s)"""
    phrase = phrases[phrase_id] if phrase_id is not None else None
    if not phrase:
        return False
    return abs(word['end'] - phrase['end']) < 0.1


def get_phrase_id(word, phrases):
    """Retorna el índice de la frase donde encaja la palabra según tiempos."""
    for i, p in enumerate(phrases):
        if word['start'] >= p['start'] and word['end'] <= p['end']:
            return i
    return None


def build_visual_metadata(srt_file, aligned_json_file, output_json):
    phrases = load_srt(srt_file)
    words = load_alignment_json(aligned_json_file)
    result = []

    for i, w in enumerate(words):
        text = w.get('word', w.get('text', '')).strip()
        start = w['start']
        end = w['end']
        duration = end - start
        phrase_id = get_phrase_id(w, phrases)
        filter_type = assign_filter(text, i)
        font = assign_font(text)
        closing = is_closing_word(w, phrases, phrase_id)

        result.append({
            "text": text,
            "start": start,
            "end": end,
            "duration": duration,
            "font": font,
            "filter_type": filter_type,
            "is_closing_word": closing,
            "phrase_id": phrase_id
        })

    with open(output_json, 'w', encoding='utf-8') as f:
        json.dump(result, f, ensure_ascii=False, indent=2)
    print(f"Visual metadata guardada en '{output_json}'")

# Notebook/Colab usage:
aligned_json_file = "audio.aligned.json"
output_json = "visual_metadata.json"
if os.path.exists(output_json):
  os.remove(output_json)

srt_file = "audio.srt"
build_visual_metadata(srt_file, aligned_json_file, output_json)


Visual metadata guardada en 'visual_metadata.json'


In [None]:
import json
import os
import string
from PIL import Image, ImageDraw, ImageFont

# ---------------- CONFIGURACIÓN ----------------
DATA_FILE        = "visual_metadata.json"  # JSON generado con metadata visual
OUTPUT_ROOT      = "imagenes_por_frase"    # Carpeta raíz de salida
ANCHO, ALTO      = 1080, 1920               # Dimensiones de cada imagen
FUENTE_DIR       = ""               # Carpeta donde guardas tus .ttf
BASE_FONT_SIZE   = 120                      # Tamaño base de fuente
COLOR_FONDO       = (0, 0, 0)
COLOR_TEXTO       = (255, 255, 255)
MAX_WORDS_PER_LINE = 2
# -----------------------------------------------
"""
def draw_text_multiline(draw, text, font, max_width, start_y):
    words = text.strip().split()
    lines, current = [], ""
    for w in words:
        test = (current + " " + w).strip()
        if draw.textlength(test, font=font) <= max_width:
            current = test
        else:
            lines.append(current)
            current = w
    if current:
        lines.append(current)

    bbox = font.getbbox("A")
    line_h = (bbox[3] - bbox[1]) + 15
    total_h = line_h * len(lines)
    y = start_y - total_h // 2
    for line in lines:
        w = draw.textlength(line, font=font)
        x = (ANCHO - w) // 2
        draw.text((x, y), line, font=font, fill=COLOR_TEXTO)
        y += line_h

"""
def draw_text_multiline(draw, text, font, max_width, start_y):
    """
    Dibuja texto centrado, respetando max_width y MAX_WORDS_PER_LINE. Palabras con mayúsculas indican estilo destacado.
    """
    words = text.split()
    lines = []
    current_line = []

    for word in words:
        current_line.append(word)
        if len(current_line) >= MAX_WORDS_PER_LINE:
            lines.append(current_line)
            current_line = []
    if current_line:
        lines.append(current_line)

    # Dibujar líneas
    line_height = font.getbbox("Ay")[3] - font.getbbox("Ay")[1] + 20
    total_height = len(lines) * line_height
    y = start_y - total_height // 2

    for line in lines:
        line_text = " ".join(line)
        line_width = draw.textlength(line_text, font=font)
        x = (ANCHO - line_width) // 2
        draw.text((x, y), line_text, font=font, fill=COLOR_TEXTO)
        y += line_height

def ensure_dir(path):
    os.makedirs(path, exist_ok=True)


def main():
    # Cargar metadata completa
    with open(DATA_FILE, 'r', encoding='utf-8') as f:
        metadata = json.load(f)

    # Ordenar por tiempo de inicio y rellenar phrase_id faltantes
    metadata = sorted(metadata, key=lambda x: x.get('start', 0))
    last_pid = None
    for item in metadata:
        if item.get('phrase_id') is not None:
            last_pid = item['phrase_id']
        else:
            item['phrase_id'] = last_pid

    # Agrupar por parte entera de phrase_id
    groups = {}
    for item in metadata:
        pid = float(item['phrase_id']) if item.get('phrase_id') is not None else 0.0
        key = int(pid)
        groups.setdefault(key, []).append(item)

    # Procesar cada frase-grupo
    for group_id, items in sorted(groups.items()):
        folder = os.path.join(OUTPUT_ROOT, f"frase_{group_id:04d}")
        ensure_dir(folder)

        # Ordenar por phrase_id decimal y luego por tiempo
        items_sorted = sorted(
            items,
            key=lambda x: (float(x['phrase_id']), x.get('start', 0))
        )

        # Intentar cargar fuente base
        font_name = items_sorted[0].get('font', '')
        base_path = os.path.join(FUENTE_DIR, font_name)
        try:
            ImageFont.truetype(base_path, BASE_FONT_SIZE)
        except Exception as e:
            print(f"⚠️ No se pudo cargar fuente '{base_path}': {e}. Usando defecto.")
            base_path = None

        text_accum = ""
        prev_end = None
        metadata_list = []
        counter = 0

        for w in items_sorted:
            palabra = w['text']
            style = int(w.get('filter_type', 1))
            start, end = w.get('start'), w.get('end')
            prev_accum = text_accum

            # Construir textos y tamaños según estilo
            texts, sizes = [], []
            if style == 1:
                text_accum += palabra + " "
                texts = [text_accum.strip()]
                sizes = [BASE_FONT_SIZE*1.1]
            elif style == 2:
                # Mostrar acumulado anterior + palabra destacada abajo
                #Yo no quiero que se genere sola y grande, yo quoero que cuando se gnere al imagen se genere mas grande y que esa palabra avlga por dos
                # Mostrar la palabra sola, en grande, en su propia línea sin acumular aún
                texts = [text_accum.strip() ]
                sizes = [BASE_FONT_SIZE * 1.4]

            elif style == 3:
                text_accum += palabra + " "
                letters = len(palabra)
                for i in range(1, letters + 1):
                    texts.append(prev_accum + palabra[:i])
                    sizes.append(BASE_FONT_SIZE)
            else:
                text_accum += palabra + " "
                texts = [text_accum.strip()]
                sizes = [BASE_FONT_SIZE]

            # Renderizar frames
            for idx, (txt, size) in enumerate(zip(texts, sizes)):
                try:
                    img = Image.new('RGB', (ANCHO, ALTO), COLOR_FONDO)
                    draw = ImageDraw.Draw(img)
                    font = ImageFont.truetype(base_path, size) if base_path else ImageFont.load_default()
                    draw_text_multiline(draw, txt, font, ANCHO - 100, ALTO // 2)

                    # Nombre de archivo
                    name = f"palabra_{counter:03d}"
                    if len(texts) > 1:
                        name += f"_{idx:02d}"
                    name += ".png"
                    img.save(os.path.join(folder, name))

                    # Metadata para último frame de la palabra
                    if idx == len(texts) - 1:
                        if start is not None and end is not None:
                            dur = (end - prev_end) if prev_end is not None else (end - start)
                            prev_end = end
                        else:
                            dur = 0.0
                        metadata_list.append({
                            "archivo": name,
                            "start": round(start or 0.0, 3),
                            "end": round(end or 0.0, 3),
                            "duration": round(dur, 3)
                        })
                except Exception as e:
                    print(f"❌ ERROR al renderizar '{palabra}', estilo {style}, frame {idx}: {e}")
                    continue

            # Para style 2, concatenar después de renderizado
            # Dentro del bloque render:
            """if style == 2:
              text_accum += palabra + "" """


            counter += 1

        # Guardar metadata.json
        with open(os.path.join(folder, 'metadata.json'), 'w', encoding='utf-8') as mf:
            json.dump(metadata_list, mf, ensure_ascii=False, indent=2)

    print(f"✅ Finalizado: {len(groups)} frases generadas en '{OUTPUT_ROOT}'")

if __name__ == '__main__':
    main()


✅ Finalizado: 9 frases generadas en 'imagenes_por_frase'


In [None]:
import os
import json

# ————— CONFIGURACIÓN —————
CARPETA_SALIDA = "imagenes_por_frase"
OUTPUT_FILE   = "inputs.txt"
# ————————————————

# 1) Leemos todas las carpetas de frase
entries = []
for frase in sorted(os.listdir(CARPETA_SALIDA)):
    path_frase = os.path.join(CARPETA_SALIDA, frase)
    metadata_path = os.path.join(path_frase, "metadata.json")
    if not os.path.isdir(path_frase) or not os.path.isfile(metadata_path):
        continue

    # 2) Cargamos el metadata.json de esa frase
    with open(metadata_path, "r", encoding="utf-8") as mf:
        metadata = json.load(mf)

    # 3) Añadimos ruta absoluta de cada imagen
    for item in metadata:
        # Ruta completa al PNG
        item["path"] = os.path.join(path_frase, item["archivo"])
        entries.append(item)

# 4) Ordenamos globalmente por el start time
entries = [e for e in entries if e.get("start") is not None]
entries.sort(key=lambda e: e["start"])

# 5) Escribimos inputs.txt usando el duration de CADA imagen
with open(OUTPUT_FILE, "w", encoding="utf-8") as fout:
    for e in entries:
        ruta = e["path"]
        dur  = e.get("duration", 0.1)  # fallback 0.1s si falta duration
        fout.write(f"file '{ruta}'\n")
        fout.write(f"duration {dur}\n")

    # FFmpeg necesita repetir la última imagen sin duration
    if entries:
        fout.write(f"file '{entries[-1]['path']}'\n")

print(f"✅ '{OUTPUT_FILE}' regenerado con duraciones individuales por imagen.")



✅ 'inputs.txt' regenerado con duraciones individuales por imagen.


In [None]:
import os
import sys
import subprocess
import shutil
#esto ya lo genera bien bien
def build_video_constant_fps(
    inputs_txt="inputs.txt",
    audio_file="audio.mp3",
    output_file="salida.mp4",
    fps=30,
    temp_dir="__frames_temp__"
):
    # 1. Validar existencia
    if not os.path.isfile(inputs_txt):
        print(f"❌ No existe {inputs_txt}"); sys.exit(1)
    if not os.path.isfile(audio_file):
        print(f"❌ No existe {audio_file}"); sys.exit(1)

    # 2. Leer inputs.txt → lista (ruta, duración)
    seq = []
    with open(inputs_txt, "r", encoding="utf-8") as f:
        lines = [l.strip() for l in f if l.strip()]
    i = 0
    while i < len(lines):
        if lines[i].startswith("file"):
            path = lines[i].split("'",2)[1]
            dur = None
            if i+1<len(lines) and lines[i+1].startswith("duration"):
                dur = float(lines[i+1].split()[1])
                i += 2
            else:
                i += 1
            seq.append((path, dur))
        else:
            i += 1

    # 3. Preparar carpeta temporal limpia
    if os.path.isdir(temp_dir): shutil.rmtree(temp_dir)
    os.makedirs(temp_dir)

    # 4. Expandir cada imagen n_frames = round(dur * fps)
    frame_idx = 0
    for img_path, dur in seq:
        if dur is None or dur <= 0:
            continue
        n_frames = max(1, int(round(dur * fps)))
        for _ in range(n_frames):
            dst = os.path.join(temp_dir, f"frame_{frame_idx:06d}.png")
            # copiar para evitar posibles issues con hard links
            shutil.copy(img_path, dst)
            frame_idx += 1

    if frame_idx == 0:
        print("❌ Ningún frame generado."); sys.exit(1)
    print(f"ℹ️  Generados {frame_idx} frames en {temp_dir}/")

    # 5. Comando FFmpeg con mapeo explícito y start_number
    cmd = [
        "ffmpeg", "-y",
        "-framerate", str(fps),
        "-start_number", "0",
        "-i", os.path.join(temp_dir, "frame_%06d.png"),
        "-i", audio_file,
        "-map", "0:v:0",        # video del primer input
        "-map", "1:a:0",        # audio del segundo input
        "-c:v", "libx264",
        "-pix_fmt", "yuv420p",
        "-c:a", "aac",
        "-shortest",
        "-movflags", "+faststart",
        output_file
    ]

    print("🔧 Ejecutando FFmpeg:")
    print(" ".join(cmd))
    try:
        proc = subprocess.run(cmd, check=True, capture_output=True, text=True)
        print(f"✅ Video generado: {output_file}")
    except subprocess.CalledProcessError as e:
        print("❌ FFmpeg falló (retcode={}):".format(e.returncode))
        print(e.stderr)
        sys.exit(1)

    # 6. Limpiar carpeta
    shutil.rmtree(temp_dir)
    print(f"🧹 Carpeta temporal {temp_dir} eliminada.")

if __name__ == "__main__":
    build_video_constant_fps()


ℹ️  Generados 940 frames en __frames_temp__/
🔧 Ejecutando FFmpeg:
ffmpeg -y -framerate 30 -start_number 0 -i __frames_temp__/frame_%06d.png -i audio.mp3 -map 0:v:0 -map 1:a:0 -c:v libx264 -pix_fmt yuv420p -c:a aac -shortest -movflags +faststart salida.mp4
✅ Video generado: salida.mp4
🧹 Carpeta temporal __frames_temp__ eliminada.


# Sección nueva

# GENERACION DE IMAGENES PARA VIDEOS AUTOMATICOS Y RENDERIZADO 🎬

In [None]:
import os
import json
import string
from datetime import timedelta
from PIL import Image, ImageDraw, ImageFont
# ESTA ES LA PRIMERA VERCION DE LA GENERACION DE IMAGNES SIRVE DE EJEMPLO PARA LOS OTROS
# —————— CONFIGURACIÓN ——————
ANCHO, ALTO       = 1080, 1920
CARPETA_SALIDA    = "imagenes_por_frase"
SRT_FILE          = "audio.srt"             # tu archivo SRT de Whisper
JSON_ALIGNED      = "audio.aligned.json"    # JSON alineado de WhisperX
FUENTE_PATH       = "/content/PoetsenOne-Regular.ttf"
TAMANO_FUENTE     = 120
COLOR_FONDO       =  (0, 0, 0)
COLOR_TEXTO       =  (255,   255,   255)
# ————————————————————————
def generar_offset_inicial(aligned_words, carpeta_salida, ancho, alto, color_fondo):
    """
    Genera imagen negra inicial si el audio no empieza desde 0s.
    Si la primera palabra tiene start == None, se usa 0.0 como valor.
    """
    primer_start = None

    for palabra in aligned_words:
        start = palabra.get("start")
        if start is not None:
            primer_start = start
            break
        else:
            primer_start = 0.0
            break  # asumimos que la primera palabra sin timestamp empieza desde cero

    if primer_start is not None and primer_start > 0.1:  # ignorar silencios muuuy cortos
        carpeta_offset = os.path.join(carpeta_salida, "frase_0000")
        os.makedirs(carpeta_offset, exist_ok=True)

        # Crear imagen negra
        img = Image.new("RGB", (ancho, alto), color_fondo)
        nombre_img = "offset_000.png"
        ruta_img   = os.path.join(carpeta_offset, nombre_img)
        img.save(ruta_img)

        # Guardar metadata
        metadata = [{
            "archivo":  nombre_img,
            "start":    0.0,
            "end":      primer_start,
            "duration": primer_start
        }]
        with open(os.path.join(carpeta_offset, "metadata.json"), "w", encoding="utf-8") as mf:
            json.dump(metadata, mf, ensure_ascii=False, indent=2)

        print(f"🕒 Offset generado correctamente hasta {primer_start:.2f}s")
    else:
        print("ℹ️ No se generó offset: empieza muy cerca de 0s.")


# helper: convierte "HH:MM:SS,mmm" → segundos (float)
def parse_timestamp(ts: str) -> float:
    h, m, rest = ts.split(":", 2)
    s, ms      = rest.split(",", 1)
    return int(h)*3600 + int(m)*60 + int(s) + int(ms)/1000.0


# función para dibujar texto envuelto y centrado
def draw_text_multiline(draw, text, font, max_width_px, start_y):
    words = text.strip().split()
    lines = []
    current_line = ""
    for word in words:
        test_line = f"{current_line} {word}".strip()
        if draw.textlength(test_line, font=font) <= max_width_px:
            current_line = test_line
        else:
            lines.append(current_line)
            current_line = word
    if current_line:
        lines.append(current_line)

    # centrar verticalmente
    line_height = font.getbbox("A")[3] - font.getbbox("A")[1] + 15
    total_height = line_height * len(lines)
    y = start_y - total_height // 2

    for line in lines:
        line_width = draw.textlength(line, font=font)
        x = (ANCHO - line_width) // 2
        draw.text((x, y), line, font=font, fill=COLOR_TEXTO)
        y += line_height

# 1) Parsear SRT para obtener la lista de palabras finales de cada frase
final_words = []
with open(SRT_FILE, "r", encoding="utf-8") as f:
    raw_blocks = f.read().strip().split("\n\n")

for block in raw_blocks:
    lines = block.splitlines()
    if len(lines) < 3:
        continue
    text = " ".join(lines[2:]).strip()
    if not text:
        continue
    last = text.split()[-1]
    # limpiar puntuación y normalizar
    last_clean = last.strip(string.punctuation).lower()
    final_words.append(last_clean)

# 2) Cargar palabras alineadas desde JSON, sea lista o segmentos
with open(JSON_ALIGNED, "r", encoding="utf-8") as f:
    data = json.load(f)

aligned_words = []
generar_offset_inicial(aligned_words, CARPETA_SALIDA, ANCHO, ALTO, COLOR_FONDO)

if isinstance(data, dict) and "segments" in data:
    # Caso típico de WhisperX con segments y palabras dentro
    for seg in data["segments"]:
        if isinstance(seg, dict) and "words" in seg:
            aligned_words.extend(seg["words"])

elif isinstance(data, list) and all(isinstance(item, dict) and "word" in item for item in data):
    # Lista directa de palabras
    aligned_words = data

else:
    raise ValueError("Formato de audio.aligned.json no reconocido.")


# 3) Preparar carpeta de salida y fuente
os.makedirs(CARPETA_SALIDA, exist_ok=True)
fuente = ImageFont.truetype(FUENTE_PATH, TAMANO_FUENTE)

# 4) Recorrer aligned_words y construir imágenes frase a frase
phrase_idx      = 0
phrase_folder   = None
metadata        = []
text_accum      = ""
previous_end    = None
word_counter    = 0
outputs_created = 0

for w in aligned_words:
    if phrase_idx >= len(final_words):
        break  # ya no hay más frases definidas en el SRT

    word_text       = w["word"]
    start, end      = w.get("start"), w.get("end")
    key_last        = final_words[phrase_idx]
    # normalizar palabra actual
    cleaned = word_text.strip(string.punctuation).lower()

    # Inicializar carpeta de frase si es la primera palabra
    if phrase_folder is None:
        phrase_folder = os.path.join(CARPETA_SALIDA, f"frase_{phrase_idx:04d}")
        os.makedirs(phrase_folder, exist_ok=True)

    # Acumular texto y crear imagen
    text_accum += word_text + " "
    img = Image.new("RGB", (ANCHO, ALTO), COLOR_FONDO)
    draw = ImageDraw.Draw(img)
    draw_text_multiline(draw, text_accum.strip(), fuente, ANCHO - 100, ALTO // 2)

    nombre_img = f"palabra_{word_counter:03d}.png"
    ruta_img   = os.path.join(phrase_folder, nombre_img)
    img.save(ruta_img)

    # Calcular duración
    if start is not None and end is not None:
        if previous_end is None:
            dur = end - start
        else:
            dur = end - previous_end
        previous_end = end
    else:
        dur = None

    metadata.append({
        "archivo":  nombre_img,
        "start":    round(start, 3) if start else None,
        "end":      round(end, 3)   if end   else None,
        "duration": round(dur, 3)   if dur   else None
    })

    word_counter += 1

    # Si esta palabra es la última de la frase, cerrar y resetear
    if cleaned == key_last:
        for item in metadata:
          if item['start'] is None:
              item['start'] = 0.0  # reemplazamos None por 0.0
          if item['end'] is None:
              item['end'] = 0.0    # reemplazamos None por 0.0

        # guardar metadata.json
        with open(os.path.join(phrase_folder, "metadata.json"), "w", encoding="utf-8") as mf:
            json.dump(metadata, mf, ensure_ascii=False, indent=2)
        # reset para la siguiente frase
        phrase_idx   += 1
        phrase_folder = None
        metadata      = []
        text_accum    = ""
        previous_end  = None
        word_counter  = 0
        outputs_created += 1

print(f"✅ Generadas {outputs_created} frases con imágenes y metadata.")


ℹ️ No se generó offset: empieza muy cerca de 0s.


OSError: cannot open resource

In [None]:
import os
import json
from PIL import Image, ImageDraw, ImageFont
# eSTE CODFIGO GENERA UNA IMAGEN POR CADA PALABRA
# ——————— CONFIGURACIÓN ———————
ANCHO, ALTO       = 1080, 1920
CARPETA_SALIDA    = "imagenes_por_frase"
ARCHIVO_JSON      = "audio.aligned.json"
FUENTE_PATH       = "/content/fuentes/Roboto-Bold.ttf"  # ← ya existe y es válida
TAMANO_FUENTE     = 120
COLOR_FONDO       = (0, 0, 0)
COLOR_TEXTO       = (255, 255, 255)
# ————————————————

# 1) Cargo JSON
with open(ARCHIVO_JSON, "r", encoding="utf-8") as f:
    data = json.load(f)

# 2) Determino la lista de segmentos
if isinstance(data, dict) and "segments" in data:
    segments = data["segments"]
elif isinstance(data, list):
    segments = data
else:
    raise ValueError("JSON no tiene formato esperado (ni dict['segments'] ni list)")

# 3) Preparo carpeta de salida y fuente
os.makedirs(CARPETA_SALIDA, exist_ok=True)
fuente = ImageFont.truetype(FUENTE_PATH, TAMANO_FUENTE)

# 4) Recorro cada frase/segmento
for i, segment in enumerate(segments):
    # Extraigo lista de palabras (puede venir como lista de dicts o de strings)
    if isinstance(segment, dict) and "word" in segment:
        palabras = segment["word"]
    elif isinstance(segment, dict) and "text" in segment:
        # fallback: solo texto sin timestamps
        palabras = [{"word": w} for w in segment["text"].split()]
    else:
        # si no viene en ninguno de esos formatos, lo salto
        continue

    # Carpeta para esta frase
    carpeta_frase = os.path.join(CARPETA_SALIDA, f"frase_{i:04d}")
    os.makedirs(carpeta_frase, exist_ok=True)

    texto_acumulado = ""
    previous_end   = None
    metadata       = []

    # 5) Por cada palabra, genero imagen + metadato
    for j, palabra in enumerate(palabras):
        # Normalizo el texto y tiempos
        if isinstance(palabra, dict):
            w      = palabra.get("word", "")
            start  = palabra.get("start", None)
            end    = palabra.get("end", None)
        else:
            w      = str(palabra)
            start  = end = None

        texto_acumulado += w + " "

        # — Crear imagen
        img  = Image.new("RGB", (ANCHO, ALTO), COLOR_FONDO)
        draw = ImageDraw.Draw(img)
        bbox = draw.textbbox((0,0), texto_acumulado.strip(), font=fuente)
        tw   = bbox[2] - bbox[0]
        th   = bbox[3] - bbox[1]
        x    = (ANCHO - tw) // 2
        y    = (ALTO  - th) // 2
        draw.text((x, y), texto_acumulado.strip(), font=fuente, fill=COLOR_TEXTO)

        # — Guardar imagen
        nombre_img = f"palabra_{j:03d}.png"
        ruta_img   = os.path.join(carpeta_frase, nombre_img)
        img.save(ruta_img)

        # — Calcular duración real (si hay timestamps)
        if end is not None:
            if previous_end is None:
                dur = (end - start) if start is not None else None
                ini = start
            else:
                dur = end - previous_end
                ini = previous_end
            previous_end = end
        else:
            ini = dur = None

        metadata.append({
            "archivo": nombre_img,
            "start":   ini,
            "end":     end,
            "duration": dur
        })

    # 6) Escribo metadata.json
    with open(os.path.join(carpeta_frase, "metadata.json"), "w", encoding="utf-8") as mf:
        json.dump(metadata, mf, ensure_ascii=False, indent=2)

print("✅ Imágenes y metadata generadas con éxito.")


✅ Imágenes y metadata generadas con éxito.


In [None]:
import os
import json

# ————— CONFIGURACIÓN —————
CARPETA_SALIDA = "imagenes_por_frase"
OUTPUT_FILE   = "inputs.txt"
# ————————————————

# 1) Leemos todas las carpetas de frase
entries = []
for frase in sorted(os.listdir(CARPETA_SALIDA)):
    path_frase = os.path.join(CARPETA_SALIDA, frase)
    metadata_path = os.path.join(path_frase, "metadata.json")
    if not os.path.isdir(path_frase) or not os.path.isfile(metadata_path):
        continue

    # 2) Cargamos el metadata.json de esa frase
    with open(metadata_path, "r", encoding="utf-8") as mf:
        metadata = json.load(mf)

    # 3) Añadimos ruta absoluta de cada imagen
    for item in metadata:
        # Ruta completa al PNG
        item["path"] = os.path.join(path_frase, item["archivo"])
        entries.append(item)

# 4) Ordenamos globalmente por el start time
entries = [e for e in entries if e.get("start") is not None]
entries.sort(key=lambda e: e["start"])

# 5) Escribimos inputs.txt usando el duration de CADA imagen
with open(OUTPUT_FILE, "w", encoding="utf-8") as fout:
    for e in entries:
        ruta = e["path"]
        dur  = e.get("duration", 0.1)  # fallback 0.1s si falta duration
        fout.write(f"file '{ruta}'\n")
        fout.write(f"duration {dur}\n")

    # FFmpeg necesita repetir la última imagen sin duration
    if entries:
        fout.write(f"file '{entries[-1]['path']}'\n")

print(f"✅ '{OUTPUT_FILE}' regenerado con duraciones individuales por imagen.")



✅ 'inputs.txt' regenerado con duraciones individuales por imagen.


In [None]:
import os
import sys
import subprocess
import shutil
#esto ya lo genera bien bien
def build_video_constant_fps(
    inputs_txt="inputs.txt",
    audio_file="audio.mp3",
    output_file="salida.mp4",
    fps=30,
    temp_dir="__frames_temp__"
):
    # 1. Validar existencia
    if not os.path.isfile(inputs_txt):
        print(f"❌ No existe {inputs_txt}"); sys.exit(1)
    if not os.path.isfile(audio_file):
        print(f"❌ No existe {audio_file}"); sys.exit(1)

    # 2. Leer inputs.txt → lista (ruta, duración)
    seq = []
    with open(inputs_txt, "r", encoding="utf-8") as f:
        lines = [l.strip() for l in f if l.strip()]
    i = 0
    while i < len(lines):
        if lines[i].startswith("file"):
            path = lines[i].split("'",2)[1]
            dur = None
            if i+1<len(lines) and lines[i+1].startswith("duration"):
                dur = float(lines[i+1].split()[1])
                i += 2
            else:
                i += 1
            seq.append((path, dur))
        else:
            i += 1

    # 3. Preparar carpeta temporal limpia
    if os.path.isdir(temp_dir): shutil.rmtree(temp_dir)
    os.makedirs(temp_dir)

    # 4. Expandir cada imagen n_frames = round(dur * fps)
    frame_idx = 0
    for img_path, dur in seq:
        if dur is None or dur <= 0:
            continue
        n_frames = max(1, int(round(dur * fps)))
        for _ in range(n_frames):
            dst = os.path.join(temp_dir, f"frame_{frame_idx:06d}.png")
            # copiar para evitar posibles issues con hard links
            shutil.copy(img_path, dst)
            frame_idx += 1

    if frame_idx == 0:
        print("❌ Ningún frame generado."); sys.exit(1)
    print(f"ℹ️  Generados {frame_idx} frames en {temp_dir}/")

    # 5. Comando FFmpeg con mapeo explícito y start_number
    cmd = [
        "ffmpeg", "-y",
        "-framerate", str(fps),
        "-start_number", "0",
        "-i", os.path.join(temp_dir, "frame_%06d.png"),
        "-i", audio_file,
        "-map", "0:v:0",        # video del primer input
        "-map", "1:a:0",        # audio del segundo input
        "-c:v", "libx264",
        "-pix_fmt", "yuv420p",
        "-c:a", "aac",
        "-shortest",
        "-movflags", "+faststart",
        output_file
    ]

    print("🔧 Ejecutando FFmpeg:")
    print(" ".join(cmd))
    try:
        proc = subprocess.run(cmd, check=True, capture_output=True, text=True)
        print(f"✅ Video generado: {output_file}")
    except subprocess.CalledProcessError as e:
        print("❌ FFmpeg falló (retcode={}):".format(e.returncode))
        print(e.stderr)
        sys.exit(1)

    # 6. Limpiar carpeta
    shutil.rmtree(temp_dir)
    print(f"🧹 Carpeta temporal {temp_dir} eliminada.")

if __name__ == "__main__":
    build_video_constant_fps()


ℹ️  Generados 1640 frames en __frames_temp__/
🔧 Ejecutando FFmpeg:
ffmpeg -y -framerate 30 -start_number 0 -i __frames_temp__/frame_%06d.png -i audio.mp3 -map 0:v:0 -map 1:a:0 -c:v libx264 -pix_fmt yuv420p -c:a aac -shortest -movflags +faststart salida.mp4
✅ Video generado: salida.mp4
🧹 Carpeta temporal __frames_temp__ eliminada.


# GENERACION DE VIDEOS AUTOMATICOS 2 🎬

# Set up

In [None]:
# Instalar ffmpeg y su wrapper en Python
!apt update -qq && apt install -y ffmpeg
!pip install -q ffmpeg-python

# --- CONFIGURACIÓN INICIAL ---
import os

# Rutas de archivos
video_entrada = "video_base.mp4"
video_salida = "video_final.mp4"

# Subtítulo o texto a colocar (esto se puede cambiar por uno generado automáticamente)
texto = "Este es un texto que irá encima del video!"

# Parámetros visuales
fuente = "Arial"
color_texto = "white"
tamaño_fuente = 48
pos_x = 20
pos_y = "(h - text_h - 20)"  # Desde abajo hacia arriba

# Otras configuraciones
modo_vertical = True  # True para videos tipo TikTok/Instagram


32 packages can be upgraded. Run 'apt list --upgradable' to see them.
[1;33mW: [0mSkipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)[0m
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 32 not upgraded.


# RENDERIZAR VIDEO MANERA 1

In [None]:
#borrar:
import os

# Lista de archivos a eliminar
archivos = ["audio.mp3", "audioL.mp3", "audio.txt", "audio.srt", "video_final.mp4", "audio.aligned.json", "inputs.txt"]

for archivo in archivos:
    if os.path.exists(archivo):
        os.remove(archivo)
        print(f"✅ Archivo eliminado: {archivo}")
    else:
        print(f"❌ No existe: {archivo}")


✅ Archivo eliminado: audio.mp3
✅ Archivo eliminado: audioL.mp3
✅ Archivo eliminado: audio.txt
✅ Archivo eliminado: audio.srt
❌ No existe: video_final.mp4
✅ Archivo eliminado: audio.aligned.json
✅ Archivo eliminado: inputs.txt


In [None]:
# Instalar ffmpeg (si no estuviera instalado, en Colab generalmente ya viene)


import json
import subprocess

def obtener_duracion_audio(audio_path):
    """Obtiene la duración del audio usando ffprobe."""
    result = subprocess.run(
        ["ffprobe", "-v", "error", "-show_entries", "format=duration", "-of", "json", audio_path],
        capture_output=True,
        text=True
    )
    duracion = float(json.loads(result.stdout)["format"]["duration"])
    return duracion

def limpiar_texto(texto):
            texto = re.sub(r"[^\w\sáéíóúüñÁÉÍÓÚÜÑ.,;:¡!¿?\"'()\-\[\]{}<>…–—°%€$@#&+=*/\\]", "", texto)
            texto = texto.replace(":", "")   # Escapar los dos puntos
            texto = texto.replace("'", "")   # Escapar comillas simples
            texto = texto.replace('"', '')   # Escapar comillas dobles
            texto = texto.replace("\\", "") # Escapar backslashes
            return texto

def generar_filtros_drawtext(frases, font_path="/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf"):
    """
    Genera un filtro drawtext para cada palabra (con su tiempo de inicio y fin),
    aplicando un efecto de fade in/out al mostrar la palabra.
    """
    filtros = []
    for f in frases:
        # Procesar y limpiar el texto, luego envolvemos el texto entre comillas simples
        texto = limpiar_texto(f.get("word", ""))
        texto = f"'{texto}'"

        start = float(f["start"])
        end = float(f["end"])
        fade_dur = 0.05  # Duración del fade in/out, ajustable

        fade_in_end = start + fade_dur
        fade_out_start = end - fade_dur

        # Generar la expresión alpha
        alpha_expr = (
            f"if(lt(t,{start:.3f}),0,"
            f"if(lt(t,{fade_in_end:.3f}),(t-{start:.3f})/{fade_dur:.3f},"
            f"if(lt(t,{fade_out_start:.3f}),1,"
            f"if(lt(t,{end:.3f}),({end:.3f}-t)/{fade_dur:.3f},0))))"
        )
        # Escapar las comas dentro de la expresión para que no se interpreten como separador de filtros
        alpha_expr = alpha_expr.replace(",", r"\,")

        filtro = (
            f"drawtext=fontfile='{font_path}':"
            f"text={texto}:"
            f"fontcolor=white:fontsize=60:"
            f"x=(w-text_w)/2:"
            f"y=(h-text_h)/2:"
            f"alpha='{alpha_expr}'"
        )
        filtros.append(filtro)
    # Unir los filtros drawtext con comas para aplicarlos sobre el mismo video.
    filtro_completo = ",".join(filtros)
    return filtro_completo

def main():
    # Definir rutas de entrada/salida
    audio = "audio.mp3"
    json_file = "audio.aligned.json"
    output = "video_final.mp4"

    # Obtener la duración del audio para crear el video base
    duracion_audio = obtener_duracion_audio(audio)
    # Se añade 1 segundo extra a la duración del video base (ajustable)
    duracion_video = duracion_audio + 1

    # Leer el archivo JSON con la alineación de las palabras
    with open(json_file, "r", encoding="utf-8") as f:
        frases = json.load(f)

    # Generar el filtro de drawtext
    drawtext_filter = generar_filtros_drawtext(frases)
    print("Filtro drawtext generado:\n", drawtext_filter)

    # Crear un video base negro con la duración calculada, resolución 1280x720 a 30fps
    comando_crear_base = [
        "ffmpeg",
        "-y",  # Sobrescribir si existe
        "-f", "lavfi",
        "-i", f"color=size=1280x720:duration={duracion_video}:rate=30:color=black",
        "base.mp4"
    ]

    print("Creando video base...")
    resultado = subprocess.run(comando_crear_base, stderr=subprocess.PIPE, text=True)
    if resultado.returncode != 0:
        print("Error al crear el video base:", resultado.stderr)
        return

    # Combinar video base, audio y aplicar el filtro de texto
    comando_ffmpeg = [
        "ffmpeg",
        "-y",  # Sobrescribir salida
        "-i", "base.mp4",
        "-i", audio,
        "-vf", drawtext_filter,
        "-c:v", "libx264",
        "-tune", "stillimage",
        "-c:a", "aac",
        "-shortest",
        output
    ]

    print("Ejecutando ffmpeg para generar el video final...")
    resultado = subprocess.run(comando_ffmpeg, stderr=subprocess.PIPE, text=True)
    if resultado.returncode != 0:
        print("❌ Error al generar el video:")
        print(resultado.stderr)
    else:
        print("✅ Video generado con éxito:", output)

if __name__ == "__main__":
    main()


Filtro drawtext generado:
 drawtext=fontfile='/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf':text='High':fontcolor=white:fontsize=60:x=(w-text_w)/2:y=(h-text_h)/2:alpha='if(lt(t\,0.000)\,0\,if(lt(t\,0.050)\,(t-0.000)/0.050\,if(lt(t\,1.882)\,1\,if(lt(t\,1.932)\,(1.932-t)/0.050\,0))))',drawtext=fontfile='/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf':text='quality':fontcolor=white:fontsize=60:x=(w-text_w)/2:y=(h-text_h)/2:alpha='if(lt(t\,1.953)\,0\,if(lt(t\,2.003)\,(t-1.953)/0.050\,if(lt(t\,3.271)\,1\,if(lt(t\,3.321)\,(3.321-t)/0.050\,0))))',drawtext=fontfile='/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf':text='on':fontcolor=white:fontsize=60:x=(w-text_w)/2:y=(h-text_h)/2:alpha='if(lt(t\,3.784)\,0\,if(lt(t\,3.834)\,(t-3.784)/0.050\,if(lt(t\,4.358)\,1\,if(lt(t\,4.408)\,(4.408-t)/0.050\,0))))',drawtext=fontfile='/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf':text='the':fontcolor=white:fontsize=60:x=(w-text_w)/2:y=(h-text_h)/2:alpha='if(lt(t\,4.428)\,0\,if(lt(t\