# Running ETL to Add MLOps World Data

In [None]:
%load_ext autoreload
%autoreload 2


## Dataset: MLOps World Videos

We draw videos from four playlists, representing MLOps World conferences in 2021 and 2022.

In [None]:
import json
import pprint

pp = pprint.PrettyPrinter(indent=2)

with open("data/mlops-world-playlists.json", "r") as f:
    playlists = json.load(f)

pp.pprint(playlists)


In [None]:
url = f"https://www.youtube.com/playlist?list={playlists[0]['playlistId']}"
url


## Plan of Attack:
1. Collect the IDs for all of the videos by calling the YouTube API with `requests`.
2. Grab the subtitles for each video using the `youtube-transcript-api` Python SDK.
3. Reorganize the subtitles using `srt`.
4. Pass them up to MongoDB (using `pymongo`) for storage.

But that's going to mean adding a bunch of dependencies to our project,
and we don't need them anywhere else,
like in the chatbot application.

Plus, once we start adding more kinds of ETL,
like Markdown files or SQL databases,
everything gets worse:
the chance of conflict goes up, the setup gets harder.

Is there a better way?

## Yes, let's use Modal for ETL (and everything else).

Modal is a serverless runtime based on snappy, scalable, and Pythonic creation and deployment of containerized applications.

In [None]:
import etl.videos as videos
from etl.shared import display_modal_image


Containers are based on images.

In [None]:
display_modal_image(videos.image)


Applications are defined not by images but by `Stub`s,
which combine images with configuration, like mounted volumes and secrets,
and manage application lifecycles.

In [None]:
videos.stub


There's a lot of depth there, but for now,
the important thing is that a `Stub`
lets us run certain functions in containers on Modal's infrastructure.

In [None]:
pp.pprint(videos.stub.registered_functions)


Modal is a cloud service, so you'll need an account.

if you don't have an acoount,you can get one by running the command below:
```bash
!make modal-token
``````
Follow the instructions, adding the token to your `.env.dev` file.

Then you can authenticate with the command below:

In [None]:
%env ENV=dev
!make modal-auth


Let's see it in action.

# First, we collect the IDs for the videos we want to add to our chatbot.

The YouTube API is a huge pain to use.

The official SDK isn't much better, and it requires OAuth.

So let's just pull the info we need with `requests`.

We'll use a function defined in our `videos` module:

In [None]:
videos.get_playlist_videos.get_raw_f()


But we'll execute it `remote`ly, inside the container defined by `videos.image`.

In [None]:
with videos.stub.run():
    mlops_world_videos = videos.get_playlist_videos.remote(playlists[0]["playlistId"])


In [None]:
print(len(mlops_world_videos))
mlops_world_videos[0]


Because we're executing in a container,
we can execute in multiple containers concurrently without any extra work.

We just `.map` instead of calling `.remote`:

In [None]:
with videos.stub.run():
    videos_by_playlist = videos.get_playlist_videos.map(pl["playlistId"] for pl in playlists[1:])
    [mlops_world_videos.extend(playlist_videos) for playlist_videos in videos_by_playlist]


In [None]:
len(mlops_world_videos)


We'll persist the results to a file so we can use them later if needed.

In [None]:
with open("data/mlops-world-videos.json", "w") as f:
    json.dump(mlops_world_videos, f)


In [None]:
with open("data/mlops-world-videos.json", "r") as f:
    mlops_world_videos = json.load(f)


# Now, let's get the subtitles for each video.

Again, the YouTube API is a pain and there just so bappens to be a tightly-specialized Python library
for getting subtitles from YouTube videos.

Not worth polluting a global environment for, but as an addition to a sub-application specific container, it's perfect.

<small>Note that we use a new feature of `Modal.Function`s: a `concurrency_limit`.
Always be polite when using pirate APIs!</small>

In [None]:
with videos.stub.run():
    transcripts = videos.get_transcript.map([video["id"] for video in mlops_world_videos])
    transcripts = [transcript for transcript in transcripts if transcript is not None]


In [None]:
transcripts[0][:10]


# Now we merge these subtitle lines into chapter transcripts.

Subtitles on YouTube are optimized for displaying alongside a video,
but we want something more like a written transcript.

That'll be easier for our language model to understand.

But we also want to respect the substructure of videos -- YouTube videos are often broken into chapters,
which typically contain discrete topics or content.

Preserving this structure will also make it easier to provide contextualized information to the language model.

We need to get the chapter information from the YouTube API, so let's do another `map`ped API call.

In [None]:
with videos.stub.run():
    chapters_by_video = list(videos.get_chapters.map([video["id"] for video in mlops_world_videos]))


In [None]:
print(f"Number of Chapters: {len(chapters_by_video)}")
print(f"Number of Videos: {len(mlops_world_videos)}")
# some videos don't have chapters, so we added in a fake chapter for those
real_chapters = [chapters for chapters in chapters_by_video if chapters[0]['title'] != 'Full Video']
print(f"Fraction of Videos with Real Chapters: {round(len(real_chapters) / len(mlops_world_videos), 2)}")
real_chapters[:2]


Now, let's take the transcript for each chapter
and convert into a more generic `Document` format
(courtesy of `langchain`).

But what's really cool here is that we've now got over 💯 documents,
and Modal will run the generation process concurrently across all of them --
on up to 100 containers at once!

This example isn't the most exciting, sure --
but imagine we had to transcribe the videos ourselves with a STT model!

In [None]:
with videos.stub.run():
    chapters_by_video = list(videos.stub.add_transcript.map(chapters_by_video, transcripts))
    video_ids, video_titles = zip(*map(lambda dct: dct.values(), mlops_world_videos))  # "transpose": list of two-entry dicts -> tuple of lists
    documents = list(videos.stub.create_documents.map(chapters_by_video, video_ids, video_titles))


In [None]:
documents[2][0]


In [None]:
from etl.shared import unchunk

flat_documents = unchunk(documents)


In [None]:
flat_documents[2]


# Finally, let's send the data into a structured store.

We're using MongoDB,
but for small scale the database you use doesn't matter so much.

We might've used Redis or Postgres,
or any general database that also has vector indexing,
or we might've gone directly to a vector database like
Milvus, Weaviate, Vespa, Pinecone, or Chroma.

Schemaless-ness is nice for a simple project,
and Mongo's JSON documents map neatly onto JSON-y
Pydantic models, like those used in LangChain.

In [None]:
from etl import shared


display_modal_image(shared.image)


You'll need to set up an instance on MongoDB Atlas to proceed.

See the instructions under `setup/`.

In [None]:
!make secrets


We'll drop any existing instances of the collection,
then insert our new documents.

We use a concurrency limit and bulk writes to control the load on the database.

In [None]:
db, collection = "fsdl-dev", "ask-fsdl"  # we run this in a dev database

# drop the collection if it exists, it's just dev
!modal run app.main::drop_docs --db {db} --collection {collection}


# add the documents to the database
with shared.stub.run():
    # split into 10 chunks, matching concurrency limit of adder
    chunked_documents = shared.chunk_into(flat_documents, 10)
    list(
        shared.add_to_document_db.map(
            chunked_documents, kwargs={"db": db, "collection": collection}
        )
    )


To check we did everything right, let's do a quick query to pull out a document.

In [None]:
with shared.stub.run():
    # pull only YouTube videos
    query = {"metadata.source": {"$regex": "youtube", "$options": "i"}}
    # project out the text field, it can get large
    projection = {"text": 0}
    # get just one result to show it worked
    result = shared.query_one.remote(query, projection, db=db, collection=collection)

pp.pprint(result)


In [None]:
from IPython.display import YouTubeVideo

id_str, time_str = result["metadata"]["source"].split("?v=")[-1].split("&t=")
YouTubeVideo(id_str, start=int(time_str.strip("s")), width=800, height=400)


# Okay but for real though, this is ML _Ops_ World: how do we do this outside of a Jupyter Notebook?

Our `videos.stub` has a `LocalEntrypoint`.


In [None]:
videos.stub.registered_entrypoints


`LocalEntrypoint`s let us define mixed local-remote executions
like the one we just ran above and run them from the command line.

In [None]:
videos.main



```bash
!modal run etl/videos.py --json-path data/mlops-world-videos.json
```