# Indexing for the Ask Lex ChatGPT Plugin

This notebook works through the indexing process for processing videos from a YouTube channel (in this case Lex Fridman) and storing them inside a Pinecone vector DB to be used by a retrieval ChatGPT plugin.

To begin we install prerequisite libraries and setup our API keys.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install -qU openai pod-gpt datasets git+https://github.com/openai/whisper.git py-dotenv

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.6/71.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m69.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m443.2/443.2 kB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.2/57.2 kB[0m [31m5.8 MB/s[0m eta [36m

Now enter API keys. Note that this will cost some money for creating the embeddings via OpenAI unless within their free usage credits.

In [4]:
import os
import py_dotenv

py_dotenv.read_dotenv("drive/MyDrive/Colab Notebooks/.env")

# get openai api key at platform.openai.com
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
# get pinecone api key at app.pinecone.com
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
# find your environment next to the api key in pinecone console
PINECONE_ENV = os.getenv("PINECONE_ENVIRONMENT")

# only used if transcribing
YOUTUBE_API_KEY = os.getenv("YOUTUBE_API_KEY")

## Downloading and Transcribing (Optional)

This section is completely optional as you can just jump ahead to the next section and use the prebuilt `jamescalam/lex-transcripts` dataset. If choosing to run this section note that it will likely take multiple days.

In [None]:
import pod_gpt

channel = pod_gpt.Channel(
    channel_id='UCSHZKyawb77ixDdsGog4iWA',  # Lex Fridman YT channel
    api_key=YOUTUBE_API_KEY  # Google YouTube API
)

You can return a specific number of videos by setting `max_results` parameter. To return **all** videos just remove the parameter to use `channel.get_videos_info()`.

In [None]:
channel.get_videos_info(max_results=2)

Load the Whisper audio transcription model:

In [None]:
import torch
import whisper

# prep whisper model
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

model = whisper.load_model("large").to(device)

Transcribe audio:

In [None]:
channel.transcribe_videos(model)

Save transcribed audio to local JSONL file:

In [None]:
channel.save()

## Using Prebuilt Dataset

Rather than going through the process above, we can skip ahead a little with a prebuilt Lex dataset from Hugging Face. We load it like so:

In [5]:
from datasets import load_dataset

data = load_dataset(
    'jamescalam/lex-transcripts',
    split='train'
)
data

Downloading and preparing dataset json/jamescalam--lex-transcripts to /root/.cache/huggingface/datasets/jamescalam___json/jamescalam--lex-transcripts-6a9688b7915283fe/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/jamescalam___json/jamescalam--lex-transcripts-6a9688b7915283fe/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


Dataset({
    features: ['video_id', 'channel_id', 'title', 'published', 'transcript', 'source'],
    num_rows: 499
})

## Indexing

Now that we have the dataset ready we can begin indexing it in Pinecone using OpenAI's `text-embedding-ada-002` model. To begin we initialize a `pod_gpt` `indexer`:

In [None]:
import pod_gpt

indexer = pod_gpt.Indexer(
    openai_api_key=OPENAI_API_KEY,
    pinecone_api_key=PINECONE_API_KEY,
    pinecone_environment=PINECONE_ENV,
    index_name="ask-lex"
)

Now we index our data:

In [None]:
from tqdm.auto import tqdm

for row in tqdm(data):
    row['published'] = row['published'].strftime('%Y%m%d')
    indexer(pod_gpt.VideoRecord(**row))

  0%|          | 0/499 [00:00<?, ?it/s]

Once complete we can move on to building the remainder of the plugin. Please see [this video](https://youtu.be/bAQ6VRewf0w) for a more detailed walkthrough.