# CREATING LEX TRANSCRIPT DATABASE

In this notebook, we will guide you through the process of creating a Lex Transcript database in MongoDB.

We will use as input the data from the Youtube videos that are available both in the JSON file `LexFridman_videos.json` but also in the MongoDB collection that we created in `create_video_db.ipynb`. The output will be a new MongoDB collection and a storage folder with a JSON file for each transcript.

In [43]:
import os
import json
import random
import mongo_utils as mu
from pymongo import UpdateOne
from pymongo.errors import BulkWriteError

### MONGODB CONNECTION

In [2]:
client = mu.connect_to_mongodb()

In [3]:
db_name = 'lex_podcast'
db = mu.get_database(client, db_name)
db.list_collection_names()

['LexFridmanPodcast']

Let's retrieve the videos from the collection using the function `get_documents` defined in the `mongo_utils.py` file.

In [4]:
colletion_name = 'LexFridmanPodcast'
collection = db[colletion_name]
# Get all the videos in the collection
videos = mu.get_documents(collection, limit=None)
print(f"Number of videos retrieved: {len(videos)}")
print(f"Keys of the videos: {videos[random.choice(list(videos.keys()))].keys()}")
idx = random.choice(list(videos.keys()))
print(f"Example of video:")
print(f"Title of the video: {videos[idx]['title']}")
print(f"Timestamps of the video: {videos[idx]['timestamps']}")
print(f"Description of the video: {videos[idx]['description']}")

Number of videos retrieved: 436
Keys of the videos: dict_keys(['description', 'duration', 'published_at', 'tags', 'timestamps', 'title', 'video_id'])
Example of video:
Title of the video: Travis Stevens: Judo, Olympics, and Mental Toughness | Lex Fridman Podcast #223
Timestamps of the video: {'0:00': '- Introduction', '4:39': '- What is Judo?', '12:27': "- Travis's signature throw", '17:52': '- Fundamentals', '19:44': '- Throws', '32:36': '- Gripping', '41:09': '- Weight cutting', '1:10:22': '- Injuries', '1:14:22': '- Jiu-Jitsu', '1:18:05': '- Lex on his judo competition experience', '1:21:30': '- Levels of mastery', '1:34:41': '- Matches', '1:48:42': '- Travis inspired Lex to practice judo', '1:54:56': '- London 2012 Olympic games', '2:36:33': '- 2016 Olympic games', '3:10:56': '- Mixed team competition', '3:18:21': '- The value of epic throws', '3:21:49': '- Shohei Ono', '3:28:11': '- Chess', '3:33:14': '- The coach', '3:39:50': '- Advice for young people'}
Description of the video:

### DEFINE YOUTUBE FUNCTIONS

In order to get the transcripts from the Youtube videos, we will use the `youtube-transcript-api` library.

In [5]:
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import JSONFormatter, TextFormatter, PrettyPrintFormatter

In [11]:
def get_transcript(video_id: str, style: str = None) -> tuple[list, bool]:
    """
    Get the transcript for a YouTube video.
    :param video_id: The ID of the YouTube video.
    :param style: The style of the transcript. Can be 'json', 'text', 'pretty' or None.
    :return: The transcript and a boolean indicating if the transcript was retrieved successfully.
    """
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)

        # Potentially format the transcript
        formatter1 = JSONFormatter()
        formatter2 = TextFormatter()
        formatter3 = PrettyPrintFormatter()
        if not style:
            return transcript, True
        elif style.lower()=='json':
            return formatter1.format_transcript(transcript), True
        elif style.lower()=='text':
            return formatter2.format_transcript(transcript), True
        elif style.lower()=='pretty':
            return formatter3.format_transcript(transcript), True
    except Exception as e:
        print(f"Error retrieving transcript for {video_id}: {str(e)}")
        return None, False

Let's retrieve the transcript for a random video. Pay attention to the format of the transcript.

In [16]:
# choosing a random video
idx = random.choice(list(videos.keys()))
video_id = idx
video_data = videos[video_id]

In [17]:
print(f"Title of the video:\n{video_data['title']}")
transcript, success = get_transcript(video_id=video_id, style=None)
print(f"Transcript retrieved successfully: {success}")
if success:
    print(f"Length of the transcript: {len(transcript)}")
    print(f"Transcript: {transcript}")

Title of the video:
Glenn Loury: Race, Racism, Identity Politics, and Cancel Culture | Lex Fridman Podcast #285
Transcript retrieved successfully: True
Length of the transcript: 5918


### STRUCTURING THE DATA

We retrieve some important information now regarding the videos and the transcripts. We need now to organise them in a new structured way in order to create a Lex Transcript database.
1. We have the video ID, the title, the date and time of the video, the full text and the transcript.
2. The transcript is a list of dictionaries with the following structure:
    - `start`: The start time of the transcript in seconds.
    - `duration`: The duration of the transcript in seconds.
    - `text`: The text of the transcript.
3. We also have as optional information the timestamps. Timestamps are dictionaries with the following structure:
    - `start`: The start time of the timestamp in HH:MM:SS format.
    - `timestamp`: The description of the timestamp.


Let's write a function that organises the data into a new structured way.

In [18]:
def parse_timestamp(timestamp: str) -> int:
    """
    Parse a timestamp in the format HH:MM:SS or MM:SS to seconds.
    :param timestamp: The timestamp to parse.
    :return: The timestamp in seconds.
    """
    parts = timestamp.split(':')
    if len(parts) == 2:
        return int(parts[0]) * 60 + int(parts[1])
    elif len(parts) == 3:
        return int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2])
    else:
        raise ValueError(f"Invalid timestamp format: {timestamp}")
    

def create_transcript_structure(video_data: dict, transcript: list) -> dict:
    """
    Create a structured transcript for a video. The bulk of this function deals with dividing 
    the transcript into sections based on the timestamps.
    :param video_data: The video data.
    :param transcript: The transcript of the video.
    :return: The structured transcript.
    """
    # Initialize the structure with the video data
    structure = {
        "video_id": video_data['video_id'],
        "title": video_data['title'],
        "date": video_data['published_at'].split('T')[0],
        "transcript": [],
        "full_text": ""
    }
    timestamps = video_data.get('timestamps', {})
    # If there are timestamps, we will use them to create the transcript structure
    if timestamps:
        # Parse the timestamps and sort them by the start time
        timestamp_seconds = [(parse_timestamp(t), t, desc.lstrip('- ').strip()) for t, desc in timestamps.items()]
        timestamp_seconds.sort(key=lambda x: x[0])
        # Initialize the current section index and text
        current_section_index = 0
        current_section_text = []
        # Iterate over the transcript
        for line in transcript:
            start_time = float(line['start'])
            # If the start time is greater than the next timestamp, we save the current section
            while (current_section_index + 1 < len(timestamp_seconds) and 
                   start_time >= timestamp_seconds[current_section_index + 1][0]):
                # Save the current section
                if current_section_text:
                    structure['transcript'].append({
                        "timestamp": timestamp_seconds[current_section_index][1],
                        "section": timestamp_seconds[current_section_index][2],
                        "text": ' '.join(current_section_text)
                    })
                # Move to the next section
                current_section_index += 1
                # Reset the current section text    
                current_section_text = []
            # Add the text to the current section
            current_section_text.append(line['text'])

        # Add the last section
        if current_section_text:
            structure['transcript'].append({
                "timestamp": timestamp_seconds[current_section_index][1],
                "section": timestamp_seconds[current_section_index][2],
                "text": ' '.join(current_section_text)
            })
    else:
        # If no timestamps, create a single section with the full transcript
        structure['transcript'].append({
            "timestamp": "0:00",
            "section": "Full Transcript",
            "text": ' '.join(line['text'] for line in transcript)
        })

    structure['full_text'] = ' '.join(line['text'] for line in transcript)

    return structure

Here is an example of the transcript structure for the random video chosen before.

In [31]:
transcript_structure = create_transcript_structure(video_data=video_data, transcript=transcript)
print(f"Transcript structure keys: {transcript_structure.keys()}")
for key in transcript_structure.keys():
    if key == 'transcript':
        for section in transcript_structure[key]:
            print(f"{section['timestamp']} - {section['section']}:\n{section['text'][:150]}...\n")
    print(f"{key}:\n{transcript_structure[key]}")

Transcript structure keys: dict_keys(['video_id', 'title', 'date', 'transcript', 'full_text'])
video_id:
YbJZnShMQAo
title:
Glenn Loury: Race, Racism, Identity Politics, and Cancel Culture | Lex Fridman Podcast #285
date:
2022-05-14
0:00 - Introduction:
i hate affirmative action i don't just disagree with it i don't just think it's against the 14th amendment i hate it the hatred comes from an understa...

1:10 - Martin Luther King Jr.:
martin luther king juniors i have a dream speech i think is the greatest speech in american history if i may i'd like to read a few words of it sure a...

9:58 - History of slavery:
what do you learn about human nature by looking at the history of slavery in america oh my so what does that tell you about people well i think of two...

24:36 - Equality of outcome:
on this topic of equality in uh the 21st century so what does equality mean today if you start to think about this idea of equality of outcome or the ...

40:59 - Math and economics:
mathematics

### CREATING AND POPULATING THE NEW COLLECTION

The only step we are missing is to create a new collection in the already existing database and populate it with the structured data for all the videos.

As for the videos collection, we will create a new collection by inserting one initial document and using the `createDB_from_data`function in the `mongo_utils.py` file.

In [36]:
video_list = list(videos.keys())
print(len(video_list))
first_video = video_list.pop()
print(first_video)
print(len(video_list))
videos[first_video]

436
piHkfmeU7Wo
435


{'description': '',
 'duration': 'PT57M54S',
 'published_at': '2018-05-29T13:16:25Z',
 'tags': None,
 'timestamps': None,
 'title': 'Christof Koch: Consciousness | Lex Fridman Podcast #2',
 'video_id': 'piHkfmeU7Wo'}

In [37]:
db_name = 'lex_podcast'     # the same database we used before
collection_name = 'Podcast_transcripts'     # new collection name (new collection)
# retrieve the transcript for the first video
transcript, success = get_transcript(video_id=first_video, style=None)
if success:
    # create the structured document
    initial_document = create_transcript_structure(video_data=videos[first_video], transcript=transcript)
    print(f"Initial document created with keys: {initial_document.keys()}")
    # create the database
    db = mu.createDB_from_data(
        client=client, 
        database_name=db_name, 
        collection_name=collection_name, 
        initial_document=initial_document, 
        custom_id=first_video
    )
    if db is None:
        # this means that the database or the collection already exists
        db = mu.get_database(client, db_name)
else:
    print(f"Error retrieving transcript for {first_video}: {transcript}")
        

Initial document created with keys: dict_keys(['video_id', 'title', 'date', 'transcript', 'full_text'])
Database 'lex_podcast' already exists.
Database 'lex_podcast' and collection 'Podcast_transcripts' created successfully.
Inserted document with ID: piHkfmeU7Wo
Database 'lex_podcast' created with collection 'Podcast_transcripts'.


In [38]:
# Check if the database and the collection were created
print(f"Databases in the client: {client.list_database_names()}")
print(f"Collections in the database '{db_name}': {db.list_collection_names()}")

collection = db[collection_name]
all_documents = list(collection.find())
print(f"Number of documents in the collection: {len(all_documents)}")
if len(all_documents) <= 10:
    print("All documents in the collection:")
    for doc in all_documents:
        print(doc)

Databases in the client: ['admin', 'config', 'lex_podcast', 'local']
Collections in the database 'lex_podcast': ['Podcast_transcripts', 'LexFridmanPodcast']
Number of documents in the collection: 1
All documents in the collection:
{'_id': 'piHkfmeU7Wo', 'video_id': 'piHkfmeU7Wo', 'title': 'Christof Koch: Consciousness | Lex Fridman Podcast #2', 'date': '2018-05-29', 'transcript': [{'timestamp': '0:00', 'section': 'Full Transcript', 'text': "as part of MIT course success zero nine nine on artificial general intelligence I got a chance to sit down with Christophe Coe who's one of the seminal figures in neurobiology in neuroscience and generally in the study of consciousness he is the president the chief scientific officer of the Allen Institute for brain science in Seattle from 1986 to 2013 he was the professor at Caltech before that he was at MIT he is extremely well sited over a hundred thousand citations his research his writing his ideas have had big impact on the scientific communit

Now we need to populate the whole collection with the rest of the videos. We won't use a for loop because the documents are pretty large. Instead, we are going to create a new function that will process and store the videos in batches.

In [46]:
def process_and_store_videos(collection, videos, local_storage: bool = False, verbose: bool = False):
    """
    Process and store videos in batches.
    :param collection: The MongoDB collection to store the videos.
    :param videos: The videos to process and store.
    :param local_storage: Whether to store the videos locally as JSON files.
    :param verbose: Whether to print verbose output.
    """
    if local_storage:
        json_storage_dir = 'JSON_storage'
        os.makedirs(json_storage_dir, exist_ok=True)

    bulk_operations = []
    missing_transcripts = []
    processed_count = 0
    for video_id, video_data in videos.items():
        if verbose:
            print(f"Processing video: {video_id}: {video_data['title']}")
        # retrieve the transcript for the video
        transcript, success = get_transcript(video_id=video_id)
        # check if the transcript was retrieved successfully and if it has more than 5 lines
        if success and len(transcript) > 5:
            # create the structured document
            structure = create_transcript_structure(video_data=video_data, transcript=transcript)

            if local_storage:
                # Save JSON file only if it is specified
                filename = f"{video_id}.json"
                with open(os.path.join(json_storage_dir, filename), 'w', encoding='utf-8') as f:
                    json.dump(structure, f, ensure_ascii=False, indent=2)
            
            # Prepare upsert operation
            bulk_operations.append(
                UpdateOne(
                    {"_id": video_id},
                    {"$set": structure},
                    upsert=True
                )
            )
            processed_count += 1
            print(f'Processed transcript number {processed_count}')
            if verbose:
                print(f"Processed transcript for {video_data['title']}")
        else:
            # if the transcript is not retrieved successfully, we add the video ID to the missing transcripts list and won't insert it in the database
            print(f"Transcript for {video_data['title']} not retrieved successfully")
            missing_transcripts.append(video_id)

        # Perform bulk upsert every 50 operations or at the end
        if len(bulk_operations) >= 50 or video_id == list(videos.keys())[-1]:
            if verbose:
                print(f"Performing bulk upsert for {len(bulk_operations)} videos")
            try:
                result = collection.bulk_write(bulk_operations)
                print(f"Bulk upsert result: {result.bulk_api_result}")
            except BulkWriteError as bwe:
                print(f"Bulk write error: {bwe.details}")
            bulk_operations = []

    return missing_transcripts

NB: note that this function is able to save the transcript also in a local folder called `JSON_storage`. If this folder does not exist, it will be created. The transcripts will be saved in JSON format. The reason for saving the transcripts is reproducibility (also for people not using MongoDB).

In [48]:
# Delete the video we inserted before from videos dictionary
del videos[first_video]
# Process and store the rest of the videos
missing_transcripts = process_and_store_videos(
    collection=collection, 
    videos=videos, 
    local_storage=True, 
    verbose=False,
)

Processed transcript number 1
Processed transcript number 2
Processed transcript number 3
Processed transcript number 4
Processed transcript number 5
Processed transcript number 6
Processed transcript number 7
Processed transcript number 8
Processed transcript number 9
Processed transcript number 10
Processed transcript number 11
Processed transcript number 12
Processed transcript number 13
Processed transcript number 14
Processed transcript number 15
Processed transcript number 16
Processed transcript number 17
Processed transcript number 18
Processed transcript number 19
Processed transcript number 20
Processed transcript number 21
Processed transcript number 22
Processed transcript number 23
Processed transcript number 24
Processed transcript number 25
Processed transcript number 26
Processed transcript number 27
Processed transcript number 28
Processed transcript number 29
Processed transcript number 30
Processed transcript number 31
Processed transcript number 32
Processed transcr

In [49]:
print(f"Number of missing transcripts: {len(missing_transcripts)}")
print(f"Missing transcripts: {missing_transcripts}")

Number of missing transcripts: 16
Missing transcripts: ['aJoRMFWn2Jk', 'pNlfHgHJweQ', '1XGiTDWfdpM', 'j4PEu4sVD40', 'H_szemxPcTI', 'Vrz8YDl9CeA', 'FUS6ceIvUnI', 'FKh8hjJNhWc', 'Pl3x4GINtBQ', 'HrehEWYj16s', 'plcc6E-E1uU', 'NOReE-3EBhI', 'LAyZ8IYfGxQ', 'nWTvXbQHwWs', 'vNOTDn3D_RI', 'q0mokx-iiws']


In [54]:
all_transcripts = mu.get_documents(collection)
print(f"Number of transcripts in the collection: {len(all_transcripts)}")

Number of transcripts in the collection: 420
