Purpose

Please grant a star if you found this repo helpful :)

Purpose

Produces an accurate LLM friendly JSON/~~Toon~~ transcription of a video for editing tasks. While minimizing token cost.

About the Project

This tool focuses on creating structured, machine-readable transcriptions of video content for editing workflows. It combines audio transcription and video-to-text analysis using tools such as Soniox for speech recognition and a two-phase Gemini Video-to-Text (VTT) transcription system for video analysis.

Audio Transcription

The Soniox ASR tool is used to process speech, generating both diarized speaker transcriptions and timestamps.

Video Transcription (Two-Phase Gemini VTT)

Gemini VTT splits video analysis to a two-phase process:

Phase 1 (Coarse Analysis): Processes the video at 2 fps using a low token count per frame (~66 tokens). This phase identifies general timeline events and marks video segments requiring further refinement.
Phase 2 (Refined Analysis): Uses the marked output from Phase 1 to process only the highlighted segments. It runs at 10 fps with a high token count per frame (~258 tokens), focusing on fine-grained analysis for critical sections of the video, ensuring both precision and efficiency.

These tools work together to produce a unified timeline, which includes transcribed speech, speaker information, and visual events. This structured JSON output makes it easier to understand, index, or process video content.

Example:

Video: https://www.pinterest.com/pin/448600812903681285/
Context: "A talking head video, of a guy talking about motivation and inspiration."
Full Output: https://pastebin.com/JTARvpxb
Output Snippet:

{
  "version": "vidscriber.v1",
  // Top level summary of the video
  "summary": "A man in a black hoodie speaks directly to the camera about the importance of setting reminders for goals, emphasizing that it's about the 'next logical step' rather than constant motivation. He suggests physical notebooks or digital apps as reminder tools and concludes by stressing the importance of celebrating each step completed.",
  // Array of utterances and their associated video nodes
  "speech_timeline": [
    {
      "id": "utt_0",
      "time": "[0.300->2.400]", // Start and end time of the utterance
      "speaker": "1", // ID of the speaker
      "language": "en", 
      "text": "Next, this is huge, and this is set reminders.",
      // Video nodes that contain this utterance
      "video_nodes": [
        "state_0 (Speaker discusses goal reminders and celebrating progress) [0.300->2.400]",
        "state_0.span_0 (Introduction to setting reminders) [0.300->2.400]"
      ],
      // Array of words in the utterance with accurate timestamps
      "words": [
        "[0.300->0.480] Next,",
        "[0.540->0.600] this",
        "[0.660->0.720] is",
        "[0.840->1.140] huge,",
        "[1.140->1.200] and",
        "[1.320->1.380] this",
        "[1.440->1.500] is",
        "[1.620->1.680] set",
        "[1.920->2.400] reminders."
      ]
    },
    ... Rest of the the utterances
  ],
  // Array of video nodes
  "video_timeline": [
    {
      "id": "state_0",
      "kind": "state", // Type of node
      "time": "[0.000->46.900]", // Start and end time of the node
      "label": "Speaker discusses goal reminders and celebrating progress", // General title
      // Tags that describe the node
      "tags": [
        "presentation",
        "speaker:onscreen",
        "gaze:camera",
        "object:microphone",
        "background:shelves"
      ],
      // Child nodes that are inside this node
      "children": [
        {
          "id": "state_0.span_0",
          "kind": "span",
          "time": "[0.000->4.900]",
          "label": "Introduction to setting reminders",
          "tags": [
            "topic:reminders",
            "emotion:serious",
            "emotion:smiling",
            "gesture:touching_face"
          ]
        },
        ... Rest of the child nodes
      ]
    }
    ... Rest of the top level nodes (In this case there is only one)
  ]
}

How To Run:

Set up .env following .env.example.
You can run the tool in your IDE using the src/ide/index.js file:
- Assign a video link or a file path to the uri field.
- Describe the video's context in the context field (Optional, to make transcription more accurate).
- Insure the RECOMPUTE_FROM field is set to "download" for new files.
- Run the file.
- You can see the output of each step in the ./cache folder.

Steps

Normalze video and audio quality
Use ASR model to transcribe and diarize speech (and maybe other sound events).
User a Video Understanding model to transcribe events and visuals.
Combine outputs into a single timeline.
~~Translate JSON timeline into Toon format.~~ (Not doing since the transcription structure is too complex for Toon to be beneficial)
Define functions for google cloud functions integration.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
src		src
.env.example		.env.example
.gcloudignore		.gcloudignore
.gitignore		.gitignore
README.md		README.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Purpose

About the Project

Audio Transcription

Video Transcription (Two-Phase Gemini VTT)

Example:

How To Run:

Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Purpose

About the Project

Audio Transcription

Video Transcription (Two-Phase Gemini VTT)

Example:

How To Run:

Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages