In [None]:
import txtai

In [None]:
txtai.__version__

'5.0.0'

Install prerequisite packages.

# YouTube Indexing and Queries

In this notebook we will work through an example of indexing and querying the YouTube video transcriptions data. We start by loading the dataset.

In [None]:
from datasets import load_dataset
import pandas as pd

In [None]:
ytt = load_dataset(
    "./data",
    split="train",
)
ytt



Downloading and preparing dataset json/data to /home/matu/.cache/huggingface/datasets/json/data-02fffa30d1cc87c7/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset json downloaded and prepared to /home/matu/.cache/huggingface/datasets/json/data-02fffa30d1cc87c7/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab. Subsequent calls will reuse this data.


Dataset({
    features: ['url', 'title', 'text', 'start_second', 'end_second'],
    num_rows: 8806
})

Each sample includes video-level information (ID, title, url and thumbnail) and snippet-level information (text, start_second, end_second).

In [None]:
for x in ytt:
    print(x)
    break

{'url': 'https://www.youtube.com/watch?v=8SF_h3xF3cE&t=0s', 'title': 'Lesson 1: Practical Deep Learning for Coders 2022', 'text': "  Welcome to practical deep learning for coders lesson one.  This is version five of this course.  And it's the first do one we've done in two years.  So we've got a lot of cool things to cover.  It's amazing how much has changed.  Here is a XKCD from the end of 2015.  Who here is saying XKCD comics before?  Pretty much everybody, not surprising.  So the basic joke here is I'll let you read it,  and then I'll come back to it.  So it can be hard to tell what's easy and what's nearly impossible.", 'start_second': 0.0, 'end_second': 58.0}


# generate the TXTAI database with vectors

The next step is indexing this dataset in Pinecone, for this we need a sentence transformer model (to encode the text into embeddings), and a Pinecone index.

We will initialize the sentence transformer first.

In [None]:
from txtai.embeddings import Embeddings

In [None]:
embeddings = Embeddings(
                        {
                            "path": "sentence-transformers/multi-qa-mpnet-base-dot-v1",
                            "content": True,
                            "objects": True
                        }
                       )

In [None]:
embeddings.index([(uid, data, None) for uid, data in enumerate(ytt)])

In [None]:
pd.DataFrame(embeddings.search('what is a convolutional neural network?',limit=50)).sort_values('score')

Unnamed: 0,id,text,score
49,5707,"Let's more discuss in the next lesson, we'll talk a lot about neural net architecture details. But the details we'll focus on are what happens to the inputs at the very first stage and what happens to the outputs at the very last stage. We'll talk a bit about what happens in the middle, but a lot less. And the reason why is it's a thing that you put into the inputs that's going to change for every single data set you do. And what do you want to happen to the outputs that are going to...",0.566333
48,3331,"It's modifying things together and adding them up. So there'd be one more step to do to make this a layer of a neural network which is if this had any negatives, we'd place them with yours. That's why matrix multiplication is the critical foundation or mathematical operation in basically all of deep learning. So the GPUs that we use, the thing that they are good at is this, matrix multiplication. They have special cores called tensor cores which can basically only do one thing which ...",0.566635
47,8424,"And what you see in the in the next column is a version of the image where the horizontal lines being recognized. And another one where the vertical lines are being recognized. And if you think back to that Xyla and Fergus paper that talked about what the layers of a neural net does, this is an absolutely an example of something that we know that the first layer of a neural network tends to learn how to do. Now, how did I do this? I did this using something called a convolution. And so...",0.56664
46,1316,"And then it takes those as inputs to a next layer. It does the same thing. It multiplies them a bunch of times and adds them up. And it does that a few times. And that's called a neural network. Now, the model, therefore, is not going to do anything useful. And this leads weights to very carefully, it shows it. And so the way it works is that we actually start out with these weights as being random. So initially, this thing doesn't do anything useful at all.",0.566887
45,2750,"We could have a look at zero dot model dot stages dot zero dot blocks dot one dot MLP dot F C one and parameters and other big bunch of numbers. So what's going on here? What are these numbers and where it is that they come from and how come these numbers can figure out whether something is a basset hound or not? Okay. So to answer that question, we're going to have a look at a Kaggle notebook. How does a neural network really work? I've got a local version of it here, which I'm go...",0.568176
44,2760,"So we start out with a very, very flexible, in fact, an infinitely flexible as we discussed function, a neural network. And we get it to do a particular thing, which is to recognize the patterns in the data examples we give it. So let's do a much simpler example than a neural network. Let's do a quadratic. So let's create a function F, which is 3x squared plus 2x plus 1. Okay. So it's a quadratic with coefficient 3, 2, and 1. So we can plot that function F and give it a title.",0.568182
43,8410,"And I mentioned that there are other things that can go in the middle as well, but we haven't really talked about what those other things are. So I thought we might look at one of the most important and interesting version of things that can go in the middle. But what you'll see is it turns out it's actually just another kind of matrix modification, which might not be obvious at first, but I'll explain. We're going to look at something called a convolution, and convolutions are at the h...",0.56865
42,1322,"going to do anything useful. And this leads weights to very carefully, it shows it. And so the way it works is that we actually start out with these weights as being random. So initially, this thing doesn't do anything useful at all. So what we do, the way Arthur Samuel described it back in the late 50s, the inventor of machine learning, is he said, OK, let's take the inputs and the weights, put them through our model. He wasn't talking particularly about neural networks. He's ju...",0.569121
41,4195,"I tend to use this one because it's less typing. So you can see now we've got these concatenated rows. So head is the first few rows. So we've now got some documents to do an LP with. Now, the problem is, as you know from the last lesson, neural networks work with numbers. We've got to take some numbers and we're going to multiply them by matrices. We're going to replace the negatives with zeros, net them up, and we're going to do that a few times. That's our neural network.",0.569592
40,1323,"And this leads weights to very carefully, it shows it. And so the way it works is that we actually start out with these weights as being random. So initially, this thing doesn't do anything useful at all. So what we do, the way Arthur Samuel described it back in the late 50s, the inventor of machine learning, is he said, OK, let's take the inputs and the weights, put them through our model. He wasn't talking particularly about neural networks. He's just like, whatever model you li...",0.570379


In [None]:
ytt['url'][8413]

'https://www.youtube.com/watch?v=htiNBPxcXgo&t=2757s'

In [None]:
embeddings.save()

In [None]:
https://www.youtube.com/watch?v=8SF_h3xF3cE&list=PLfYUBJiXbdtSvpQjSnJJ_PmDQB_VyT5iU&index=1&t=1713s

In [None]:
'https://www.youtube.com/watch?v=8SF_h3xF3cE&t=1713s'

In [None]:
from sentence_transformers import SentenceTransformer

retriever = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')
retriever

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

We can see the embedding dimension of `768` above, we will need this when creating our Pinecone index.

In [None]:
embed_dim = retriever.get_sentence_embedding_dimension()
embed_dim

768

Now we can initialize our index.

In [None]:
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.01,
 'namespaces': {'': {'vector_count': 11298}}}

# Querying

When query we encode our text with the same retriever model and pass it to the Pinecone `query` endpoint.

In [None]:
pd.set_option('display.max_colwidth',500)

In [None]:
query = "What is deep learning?"

xq = retriever.encode([query]).tolist()

In [None]:
xc = index.query(xq, top_k=5,
                 include_metadata=True)
for context in xc['results'][0]['matches']:
    print(context['metadata']['text'], end="\n---\n")

 terms of optimization but what's the algorithm for updating the parameters or updating whatever the state of the network is and then the the last part is the the data set like how do you actually represent the world as it comes into your machine learning system so I think of deep learning as telling us something about what does the model look like and basically to qualify as deep I
---
 any theoretical components any theoretical things that you need to understand about deep learning can be sick later for that link again just watched the word doc file again in that I mentioned the link also the second channel is my channel because deep learning might be complete deep learning playlist that I have created is completely in order okay to the other
---
 under a rock for the last few years you have heard of the deep networks and how they have revolutionised computer vision and kind of the standard classic way of doing this is it's basically a classic supervised learning problem you are givi