# CREATING LEX VIDEOS DATABASE

In this notebook, we will show how to create a database from a JSON file containing information about each video on the LexFridman YouTube channel.
The JSON file is available in the data folder of this repository.
In the code below, we show how to insert those video data into a MongoDB collection.

In [1]:
import json
import random
import mongo_utils as mu

### MONGODB CONNECTION

First, let's connect to the MongoDB client. We are going to use functions from the `mongo_utils.py` file.

In [2]:
client = mu.connect_to_mongodb()

In [3]:
mu.check_connection(client)

True

### LOADING DATA FROM JSON

Let's load the JSON file and print some information about the data.

If you are wondering how to get a JSON file like this, you can check the https://github.com/CharlieNestor/retrieve_video_info_YouTube_channel repository in my personal GitHub page. It contains a project whose goal is to download the videos information from a YouTube channel and store them either in JSON file or in a MongoDB collection.

In [4]:
# Load the JSON file
with open('LexFridman_videos.json', 'r') as file:
    data = json.load(file)

print(f"Type of data: {type(data)}")
print(f"Number of videos: {len(data)}")
print(f"Fields of each video: {data[random.choice(list(data.keys()))].keys()}")
print(f"Sample of video: \n{data[random.choice(list(data.keys()))]}")

Type of data: <class 'dict'>
Number of videos: 806
Fields of each video: dict_keys(['description', 'duration', 'published_at', 'tags', 'timestamps', 'title', 'video_id'])
Sample of video: 
{'description': "The talks at the Deep Learning School on September 24/25, 2016 were amazing. I clipped out individual talks  from the full live streams and provided links to each below in case that's useful for people who want to watch specific talks several times (like I do). Please check out the official website (http://www.bayareadlschool.org) and full live streams below.\n\nHaving read, watched, and presented deep learning material over the past few years, I have to say that this is one of the best collection of introductory deep learning talks I've yet encountered. Here are links to the individual talks and the full live streams for the two days:\n\n1. Foundations of Deep Learning (Hugo Larochelle, Twitter) - https://youtu.be/zij_FTbJHsk\n2. Deep Learning for Computer Vision (Andrej Karpathy, O

We are interested only in the videos from the Lex Fridman Podcast.

In [5]:
# Filter videos with 'Lex Fridman Podcast' in the title
podcast_videos = {}
for video_id, video_data in data.items():
    if 'Lex Fridman Podcast' in video_data['title']:
        if '#' in video_data['title']:
            podcast_videos[video_id] = video_data

print(f"Number of Lex Fridman podcast videos: {len(podcast_videos)}")

Number of Lex Fridman podcast videos: 436


The field video_id will be one of the main keys in this database and in future works, hence we need to make sure that it is present in all the documents.

In [6]:
for key, value in podcast_videos.items():
    if 'video_id' not in value:
        print(key)
        print(value['title'])
        print(value.keys())
        podcast_videos[key]['video_id'] = key

J7aiEwp1x9k
Craig Jones: Jiu Jitsu, $2 Million Prize, CJI, ADCC, Ukraine & Trolling | Lex Fridman Podcast #439
dict_keys(['title', 'published_at', 'description', 'duration', 'tags', 'timestamps'])
Kbk9BiPhm7o
Elon Musk: Neuralink and the Future of Humanity | Lex Fridman Podcast #438
dict_keys(['description', 'duration', 'published_at', 'tags', 'timestamps', 'title'])
TXabC2Ave74
Neil Adams: Judo, Olympics, Winning, Losing, and the Champion Mindset | Lex Fridman Podcast #427
dict_keys(['description', 'duration', 'published_at', 'tags', 'timestamps', 'title'])
iAlwZyRUOVM
Kimbal Musk: The Art of Cooking, Tesla, SpaceX, Zip2, and Family | Lex Fridman Podcast #417
dict_keys(['description', 'duration', 'published_at', 'tags', 'timestamps', 'title'])
5t1vTLU7s40
Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #416
dict_keys(['description', 'duration', 'published_at', 'tags', 'timestamps', 'title'])
qa-wl8_wpZA
Serhii Plokhy: History of Ukraine, 

### CREATING AND POPULATING THE DATABASE

First, we are going to choose one video and use it to create the database and the collection. Then we will proceed to insert the remaining videos.

In [7]:
video_list = list(podcast_videos.keys())
print(len(video_list))
first_video = video_list.pop()
print(first_video)
print(len(video_list))
podcast_videos[first_video]

436
Gi8LUnhP5yU
435


{'description': '',
 'duration': 'PT1H22M58S',
 'published_at': '2018-04-19T14:11:52Z',
 'tags': None,
 'timestamps': None,
 'title': 'Max Tegmark: Life 3.0 | Lex Fridman Podcast #1',
 'video_id': 'Gi8LUnhP5yU'}

Let's define the name of the database and the collection that we are going to use.

In [8]:
db_name = 'lex_podcast'
collection_name = 'LexFridmanPodcast'

Here we are going to instantiate the database and the collection. Either create it or get it if it already exists.

In [9]:
db = mu.createDB_from_data(client, db_name, collection_name, data[first_video], custom_id=first_video)
if db is None:
    db = mu.get_database(client, db_name)
print(db)

Database 'lex_podcast' already exists.
Collection 'LexFridmanPodcast' already exists in database 'lex_podcast'.
Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True, serverselectiontimeoutms=5000), 'lex_podcast')


In [10]:
# Check if the database and the collection were created
print(f"Databases in the client: {client.list_database_names()}")
print(f"Collections in the database '{db_name}': {db.list_collection_names()}")

collection = db[collection_name]
all_documents = list(collection.find())
print(f"Number of documents in the collection: {len(all_documents)}")
if len(all_documents) <= 10:
    print("All documents in the collection:")
    for doc in all_documents:
        print(doc)

Databases in the client: ['admin', 'config', 'lex_podcast', 'local']
Collections in the database 'lex_podcast': ['LexFridmanPodcast']
Number of documents in the collection: 1
All documents in the collection:
{'_id': 'Gi8LUnhP5yU', 'description': '', 'duration': 'PT1H22M58S', 'published_at': '2018-04-19T14:11:52Z', 'tags': None, 'timestamps': None, 'title': 'Max Tegmark: Life 3.0 | Lex Fridman Podcast #1', 'video_id': 'Gi8LUnhP5yU'}


Let's insert the remaining videos into the collection.

In [11]:
for video_id in video_list:
    mu.insert_document(collection, document=podcast_videos[video_id], key=video_id)

In [12]:
all_documents = list(collection.find())
print(f"Number of documents in the collection after insertion: {len(all_documents)}")

Number of documents in the collection after insertion: 436
