# YouTubeSearch

Create vector store of the content of a YouTube video and make it into information that user can search using LLM.

## Installing libraries

In [1]:
!pip install python-dotenv
!pip install -qU langchain
!pip install -qU huggingface_hub
!pip install transformers
!pip install lark qdrant-client
!pip install sentence_transformers

!pip install youtube-transcript-api
!pip install pytube
!pip install yt_dlp
!pip install pydub

Collecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.0
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.1/49.1 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinu

## Get data

### Get Hugging Face token

In [2]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv(filename="/content/drive/MyDrive/token/api_token.env")) # read local .env file

# os.environ['HUGGINGFACEHUB_API_TOKEN'] = 'HF_API_KEY'

### Load Google flan-t5-large llm from Hugging Face Hub

In [3]:
from langchain import HuggingFaceHub

flan_t5 = HuggingFaceHub(
    repo_id="google/flan-t5-large",
    model_kwargs={"temperature":1e-10}
)

### Get list of URLs of youtube videos

Using the playlist of DeepLearning.AI by Andrew Ng, Machine Learning Engineering for Produciton (MLOps) consisting of 40 videos in total which has ~5 hours of video.

In [4]:
file = open("/content/drive/MyDrive/data/LangChain/urls.txt", "r")
urls = file.readlines()
urls = [url.strip("\n") for url in urls]
# urls

### Load YouTube Documents in LangChain

Using YouTube Loader, load the transcript of the videos as docs

In [5]:
from langchain.document_loaders import YoutubeLoader
docs = []
for url in urls:
    loader = YoutubeLoader.from_youtube_url(url, add_video_info=True)
    docs.extend(loader.load())
len(docs)

40

### Split Transcript

As the transcript can be very long for the model to create embeddings for, we split the transcipt into smaller chunks.

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150,
                                            #    separators=["\n\n", "\n", "\. "]
                                               )
splits = text_splitter.split_documents(docs)
len(splits)

206

## Create Vector Store

### Get embedding model from Hugging Face Hub using the integrated langchain embedding library

In [7]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings_model_name = "sentence-transformers/all-MiniLM-L6-v2"

embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

### Create embeddings and using Qdrant vector store (in memory)

In [8]:
from langchain.vectorstores import Qdrant
vectordb = Qdrant.from_documents(splits,
                                 embeddings,
                                 location=":memory:",
                                 collection_name="youtube")

### Use Converstaional Chain for chat bot

In [11]:
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

As the ConversationBufferMemory has open issue in github with the usage of "return_source_documents" in the ConversationalRetrievalChain, we use a child class as below as directed.

In [12]:
class AnswerConversationBufferMemory(ConversationBufferMemory):
    def save_context(self, inputs, outputs) -> None:
        return super(AnswerConversationBufferMemory, self).save_context(inputs,{'response': outputs['answer']})

In [13]:
memory = AnswerConversationBufferMemory(memory_key="chat_history", return_messages=True)

In [14]:
qa = ConversationalRetrievalChain.from_llm(flan_t5,
                                           vectordb.as_retriever(search_type="mmr"),
                                           memory=memory,
                                           return_source_documents=True)

In [17]:
query = "What is ML Ops"
result = qa({"question": query})
result["answer"]

'systematic ways to think about scoping data modeling and deployment and also software tools to support the best'

In [18]:
result["source_documents"][0]

Document(page_content="i've learned on how to define effective machine learning projects throughout this course you also learn about ml ops or machine learning operations which is an emerging discipline that comprises a set of tools and principles to support progress through the machine learning project life cycle but especially these three steps for example at landing ai where on co we used to do a lot of these steps manually which is okay but slow but after building an emma ops 2 called landing lens for computer vision applications all these steps became much quicker the key idea in ml ops is that systematic ways to think about scoping data modeling and deployment and also software tools to support the best practices so that's it in this course we're going to start at the end goal start from deployment and then work our way backwards as you already know being the deploy a system is one of the most important and valuable skills in machine learning today so let's go on to the next vide