# YoutubeCC (`yt-cc`) - A Youtube Closed Captions (`.json3`) parser

`yt-cc` is a Python library that parses Youtube Closed Captions (`.json3`) files. It is a simple and easy-to-use library that can be used to query precise parts of the video transcript or iterate over the entire transcript.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path

from yt_dlp import YoutubeDL
from yt_cc import YoutubeCC

In [3]:
DATA_PATH = Path.cwd() / ".tmp"

## Downloading the Closed Captions with `yt-dlp`

To download the closed captions of a Youtube video, you can use the `yt-dlp` command-line tool. You can install `yt-dlp` using `pip`:

```bash
pip install yt-dlp
```

To download the closed captions of a Youtube video in the `.json3` format, you can use the following command:

```bash
yt-dlp --write-auto-sub --skip-download --sub-format json3 <video-url>
```

You can also download the closed captions with the `yt-dlp` Python API:

In [4]:
YoutubeDL(
    params = dict(
        paths=dict(home=str(DATA_PATH)),
        skip_download=True,
        outtmpl="%(id)s.%(ext)s",
        subtitlesformat="json3",
        writeautomaticsub=True,
    )
).download(["https://www.youtube.com/watch?v=oHWuv1Aqrzk"])

yt_cc_file_path = next(DATA_PATH.glob("*.json3"))
yt_cc_file_path

[youtube] Extracting URL: https://www.youtube.com/watch?v=oHWuv1Aqrzk
[youtube] oHWuv1Aqrzk: Downloading webpage
[youtube] oHWuv1Aqrzk: Downloading ios player API JSON
[youtube] oHWuv1Aqrzk: Downloading android player API JSON




[youtube] oHWuv1Aqrzk: Downloading m3u8 information
[info] oHWuv1Aqrzk: Downloading subtitles: en
[info] oHWuv1Aqrzk: Downloading 1 format(s): 248+251
Deleting existing file /home/arthur/Documents/02.workspace/02.active/clips-analytics/yt-cc/.tmp/oHWuv1Aqrzk.en.json3
[info] Writing video subtitles to: /home/arthur/Documents/02.workspace/02.active/clips-analytics/yt-cc/.tmp/oHWuv1Aqrzk.en.json3
[download] Destination: /home/arthur/Documents/02.workspace/02.active/clips-analytics/yt-cc/.tmp/oHWuv1Aqrzk.en.json3
[download] 100% of   71.66KiB in 00:00:00 at 1.28MiB/s


PosixPath('/home/arthur/Documents/02.workspace/02.active/clips-analytics/yt-cc/.tmp/oHWuv1Aqrzk.en.json3')

## Parse the Closed Captions with `yt-cc`

In [5]:
youtube_caption = YoutubeCC(yt_cc_file_path)
youtube_caption

In [6]:

for i, line_cc in enumerate(youtube_caption):
    print(line_cc)
    if i > 5:
        break

LineCC(event_id=1.0, start_time_ms=0.0, duration_ms=218599.0, window_id=nan, window_style_id=1.0, window_position_id=1.0, append=nan, segments=[])
LineCC(event_id=nan, start_time_ms=2820.0, duration_ms=5279.0, window_id=1.0, window_style_id=nan, window_position_id=nan, append=nan, segments=['is', ' there', ' cool', ' small', ' projects', ' like', ' uh'])
LineCC(event_id=nan, start_time_ms=5510.0, duration_ms=2589.0, window_id=1.0, window_style_id=nan, window_position_id=nan, append=1.0, segments=['\n'])
LineCC(event_id=nan, start_time_ms=5520.0, duration_ms=6180.0, window_id=1.0, window_style_id=nan, window_position_id=nan, append=nan, segments=['archive', ' sanity', ' and', ' and', ' so', ' on', ' that', " you're"])
LineCC(event_id=nan, start_time_ms=8089.0, duration_ms=3611.0, window_id=1.0, window_style_id=nan, window_position_id=nan, append=1.0, segments=['\n'])
LineCC(event_id=nan, start_time_ms=8099.0, duration_ms=5580.0, window_id=1.0, window_style_id=nan, window_position_id=nan

In [7]:
youtube_caption.lines

Unnamed: 0,event_id,start_time_ms,duration_ms,window_id,window_style_id,window_position_id,append
0,1.0,0,218599.0,,1.0,1.0,
1,,2820,5279.0,1.0,,,
2,,5510,2589.0,1.0,,,1.0
3,,5520,6180.0,1.0,,,
4,,8089,3611.0,1.0,,,1.0
...,...,...,...,...,...,...,...
197,,212580,3900.0,1.0,,,
198,,214850,1630.0,1.0,,,1.0
199,,214860,3739.0,1.0,,,
200,,216470,2129.0,1.0,,,1.0


In [8]:
youtube_caption.segments

Unnamed: 0,text,asr_confidence,offset_ms,pen_id,start_time_ms,line_id
0,is,248.0,,,2820,1
1,there,248.0,599.0,,3419,1
2,cool,248.0,839.0,,3659,1
3,small,248.0,1020.0,,3840,1
4,projects,248.0,1380.0,,4200,1
...,...,...,...,...,...,...
763,sounds,248.0,1260.0,,216120,199
764,kind,240.0,1320.0,,216180,199
765,of,248.0,1500.0,,216360,199
766,\n,,,,216470,200


In [9]:
print(youtube_caption.get_text(start_time_ms=0, end_time_ms=10000))

is there cool small projects like uh
archive sanity and and so on that you're
thinking about the the the
