<span style="font-size:1.5em;color:red;">**NOTE**</span>: YouTube's auto-generated captions are updated from time to time, so it's possible that running this code in the future will result in slightly different transcripts (or timestamps) from those used in the analyses. The transcripts in `data/raw/` were fetched from the YouTube API on 4/25/2022.

# Imports

In [1]:
from datetime import datetime as dt

from youtube_transcript_api import YouTubeTranscriptApi as youtube

from khan_helpers.constants import RAW_DIR

Experiment & Participant classes, helper functions, and variables used across multiple notebooks can be found in `/mnt/code/khan_helpers/khan_helpers`, or on GitHub, [here](https://github.com/contextlab/efficient-learning-khan/tree/master/code/khan_helpers).<br />You can also view source code directly from the notebook with:<br /><pre>    from khan_helpers.functions import show_source<br />    show_source(foo)</pre>

# Define functions

In [2]:
def get_transcript_data(video_id):
    transcript_list =  youtube.list_transcripts(video_id)
    auto_transcript = transcript_list.find_generated_transcript(['en'])
    return auto_transcript.fetch()

In [3]:
def transcript_to_str(transcript_data):
    transcript_lines = []
    for chunk in transcript_data:
        text = chunk['text'].replace('\n', ' ')
        timestamp = chunk['start']
        ts_str = dt.fromtimestamp(timestamp).strftime("%M:%S.%f").rstrip('0')
        if ts_str.endswith('.'):
            ts_str += '0'
            
        transcript_lines.append(ts_str)
        transcript_lines.append(text)
        
    # timestamp resolution is ms, rounding just deals with floating 
    # point error
    end_time = round(chunk['start'] + chunk['duration'], 3)
    end_time_str = dt.fromtimestamp(end_time).strftime("%M:%S.%f").rstrip('0')
    if end_time_str.endswith('.'):
        end_time_str += '0'
    
    transcript_lines.append(end_time_str)
    return '\n'.join(transcript_lines)

# Set constants

In [4]:
FORCES_VIDEO_ID = 'FEF6PxWOvsk'
BOS_VIDEO_ID = 'i-NNWI8Ccas'

FORCES_TRANSCRIPT_PATH = RAW_DIR.joinpath('forces_transcript_timestamped.txt')
BOS_TRANSCRIPT_PATH = RAW_DIR.joinpath('bos_transcript_timestamped.txt')

# Get automatically generated lecture transcripts from YouTube's API

In [5]:
forces_transcript_data = get_transcript_data(FORCES_VIDEO_ID)
forces_transcript = transcript_to_str(forces_transcript_data)
print(forces_transcript)

00:00.03
what I want to do in this video is give
00:01.829
a very high-level overview of the four
00:04.02
fundamental forces four fundamental
00:07.62
forces of the universe and I'm going to
00:09.54
start with gravity I'm going to start
00:12.719
with gravity and it might surprise some
00:15.57
of you that gravity is actually the
00:17.43
weakest of the four fundamental forces
00:19.32
that's surprising because you say wow
00:21.3
that's what keeps us glued not glued but
00:23.76
it keeps us from jumping off the planet
00:25.65
it's what keeps the moon in orbit around
00:27.24
the earth the earth in orbit around the
00:29.55
Sun the Sun in orbit around the the
00:33.18
center of the Milky Way galaxy so it's
00:35.43
it's a little bit surprising that it's
00:37.23
actually the weakest of the forces and
00:39.6
and that starts to make sense when you
00:42.809
actually think about things on maybe
00:44.28
more of a human scale or a molecular
00:46.14
scale or even an atomic scale even o

In [6]:
bos_transcript_data = get_transcript_data(BOS_VIDEO_ID)
bos_transcript = transcript_to_str(bos_transcript_data)
print(bos_transcript)

00:00.03
let's imagine we have a huge cloud of
00:02.399
hydrogen atoms floating in space Hugh
00:05.67
and I say huge cloud huge both in
00:07.44
distance and in mass if you were to
00:09.69
combine all of the hydrogen atoms it
00:11.79
would just be this really really massive
00:14.549
thing so you have this huge cloud well
00:16.92
we know that gravity would make the
00:19.17
atoms actually attracted to each other
00:21.42
instantly we normally don't think about
00:23.13
the gravity of atoms but it would slowly
00:25.949
affect these atoms and they'd slowly
00:28.199
draw close to each other it would slowly
00:30.779
condense they'd slowly they slowly move
00:33.63
towards the center of mass of all of the
00:36.329
atoms
00:36.84
they'd slowly move in and so if we fast
00:39.69
forward if we fast forward this cloud is
00:42.84
going to get denser and denser it's
00:45.3
going to get denser and denser and the
00:46.95
hydrogen atoms are going to start
00:48.329
bumping into each othe

In [7]:
# FORCES_TRANSCRIPT_PATH.write_text(forces_transcript)
# BOS_TRANSCRIPT_PATH.write_text(bos_transcript)