## install library dependencies

In [1]:
!pip install -q youtube-transcript-api langchain_community langchain_core langchain chromadb langchain_huggingface langchain_groq

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.7/485.7 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m70.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m438.9/438.9 kB[0m [31m34.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.3/19.3 MB[0m [31m78.8 MB/s[0m eta [36m0:00:0

## import neccessary libraries

In [2]:
import re
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.vectorstores import Chroma
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.runnables import RunnableLambda, RunnableParallel, RunnablePassthrough
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv

## load environment vairables

In [4]:
load_dotenv()

True

## initialize model

In [5]:
model = ChatGroq(model="llama3-8b-8192", temperature=0.5)

In [6]:
model.invoke("what is ml?")

AIMessage(content='ML can refer to several things, depending on the context:\n\n1. **Machine Learning**: Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that involves training algorithms to learn from data, recognize patterns, and make predictions or decisions without being explicitly programmed. ML is a key technology behind many applications, such as image and speech recognition, natural language processing, and recommender systems.\n2. **Microlearning**: Microlearning (ML) is a learning strategy that involves breaking down complex topics into shorter, bite-sized chunks, often between 3-10 minutes long. This approach is designed to help learners absorb information quickly and efficiently, using various formats such as videos, podcasts, and interactive simulations.\n3. **Master of Laws**: Master of Laws (ML) is a postgraduate law degree that is often required for those who want to specialize in a particular area of law or become a legal academic.\n4. **Mileage Log*

## load embedding model

In [7]:
embedding = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## define output parser

In [8]:
parser = StrOutputParser()

## create prompt template

In [9]:
prompt = PromptTemplate(
    template="""
      You are a helpful assistant.
      Answer ONLY from the provided transcript context.
      If the context is insufficient, just say you don't know.

      <context>
      {context}
      </context>

      Question: {question}
    """,
    input_variables = ['context', 'question']
)

## Implement RAG

### create yt video transcripter

In [10]:
def extract_video_id(url):
    """
    Extract video ID from various YouTube URL formats
    """
    patterns = [
        r'(?:youtube\.com\/watch\?v=|youtu\.be\/|youtube\.com\/embed\/)([^&\n?#]+)',
        r'youtube\.com\/watch\?.*v=([^&\n?#]+)'
    ]

    for pattern in patterns:
        match = re.search(pattern, url)
        if match:
            return match.group(1)
    return None

In [11]:
def get_youtube_transcript(youtube_url):
    """
    Extract transcript from a YouTube video URL

    Args:
        youtube_url (str): YouTube video URL

    Returns:
        dict: Contains transcript text, raw transcript data, and metadata
    """
    try:
        # Extract video ID from URL
        video_id = extract_video_id(youtube_url)
        if not video_id:
            return {"error": "Invalid YouTube URL"}

        # Get transcript
        transcript_list = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])
        text_transcript = " ".join(transcript['text'] for transcript in transcript_list)

        return {
            "video_id": video_id,
            "transcript_text": text_transcript,
            "timestamped_transcript": transcript_list,
            "total_segments": len(transcript_list)
        }

    except Exception as e:
        return {"error": f"Failed to get transcript: {str(e)}"}


In [12]:
# Example usage
if __name__ == "__main__":
    # Test the function
    youtube_url = "https://www.youtube.com/watch?si=eRSOsvbm2lOvraWk&v=JxgmHe2NyeY&feature=youtu.be"  # Example URL
    result = get_youtube_transcript(youtube_url)

    if "error" in result:
        print(f"Error: {result['error']}")
    else:
        print(f"Video ID: {result['video_id']}")
        print(f"Total segments: {result['total_segments']}")
        print("\nTranscript:")
        print(result['transcript_text'][:500] + "..." if len(result['transcript_text']) > 500 else result['transcript_text'])

        print("\nFirst few timestamped entries:")
        for i, entry in enumerate(result['timestamped_transcript'][:3]):
            print(f"{i+1}. [{entry['start']:.2f}s] {entry['text']}")

Video ID: JxgmHe2NyeY
Total segments: 9542

Transcript:
so today's session what all things we are basically going to discuss so first of all we going to discuss about different types of machine learning algorithm like how many different types of machine learning algor understand the purpose of taking this session is to clear the interviews okay clear the interviews once you go for a data science interviews and all the main purpose is to clear the interviews I've seen people who knew machine learning algorithms in a proper way okay they were definitel...

First few timestamped entries:
1. [6.64s] so today's session what all things we
2. [8.44s] are basically going to discuss so first
3. [10.08s] of all we going to discuss about


### Create text splitter

In [13]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

In [14]:
document_chunks = text_splitter.create_documents([result['transcript_text']])

In [17]:
len(document_chunks)

443

### create vectorstore and retriever

In [54]:
vectorstores = Chroma.from_documents(
    documents=document_chunks,
    embedding=embedding
)

In [57]:
vectorstores.get(include=['embeddings'])

{'ids': ['f6f66a88-5aaa-4c1b-bd40-48114c16e3ed',
  '9f5c59ef-9737-4a31-8dcd-f6dd78d9f4da',
  '67a443c5-17f7-406d-b7fa-fecbca756ee0',
  '0038b86f-3282-4375-85c8-20012ce66a7e',
  '3c9d9481-c7dc-4f9e-8260-588ee7223967',
  '4dc39d18-5dba-43bf-b840-da0520d0f158',
  'd5021573-b8a3-4a9e-8736-11458154efac',
  '4ab7f5c9-1395-4c35-bf13-2917a69fbd83',
  '9da98fe6-12e0-4b30-9a06-dba6cfe1a392',
  'a500969a-8687-4383-8776-a05a08421529',
  '904ef2f3-0503-4773-acfd-085d4b9e1e76',
  '5cf5418c-2410-4685-8e9b-bc6850080f7e',
  'ca1c35fd-4f38-4333-a748-ab1a5ea9c618',
  'fc633057-4ceb-4f77-8a38-ae88bd3f631e',
  '23489a22-3dc8-47d3-b337-458052598dd1',
  '53da2c36-7df6-484e-b809-d53fccfdb7bb',
  '8d849cfe-0b0f-4dae-8eba-f4bfa16a11df',
  '19760a88-c822-49ce-ab7d-320054344f96',
  '411be411-557e-46b6-8f55-db70080a1ae3',
  '50950781-c0e1-405e-bc40-d95355f9fd65',
  '94a562aa-a5f9-4280-a53b-100fadff5b48',
  '34202583-75bc-478a-82f5-f3ace11557cd',
  '0cb788b0-d965-451f-8c6a-d67f37f1c35e',
  '7dadc43c-4cc4-4779-9b92-

In [18]:
def create_retriever(document_chunks, embedding):
  """
  create vectorstore from document chunks
  """
  vectorstore = Chroma.from_documents(
    documents=document_chunks,
    embedding=embedding
  )
  return vectorstore.as_retriever(search_type="similarity", search_kwargs={'k': 3})

## Create Chain

In [19]:
def format_docs(retrieved_docs):
  context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)
  return context_text

In [20]:
def create_chain(url):
  # first extract transcript
  print("transcripting the video ...\n", "="*50)
  result = get_youtube_transcript(url)

  # create chunks of documents
  print("creating chunks of documents ...\n", "="*50)
  document_chunks = text_splitter.create_documents([result['transcript_text']])

  # create retriever
  print("creating retriever ...\n", "="*50)
  retriever = create_retriever(document_chunks, embedding)

  # create chain
  print("creating chain ...\n", "="*50)
  parallel_chain = RunnableParallel({
    'context': retriever | RunnableLambda(format_docs),
    'question': RunnablePassthrough()
  })
  final_chain = parallel_chain | prompt | model | parser
  return final_chain

## Testing the chain

In [21]:
chain = create_chain(url="https://www.youtube.com/watch?si=eRSOsvbm2lOvraWk&v=JxgmHe2NyeY&feature=youtu.be")

transcripting the video ...
creating chunks of documents ...
creating retriever ...
creating chain ...


In [22]:
chain.invoke('Can you summarize the video')

'The speaker is summarizing the video by saying that they have efficiently covered many topics, including K-hierle clustering, solid score DB clustering, and kin hierle clustering. They mention that they will cover SVM, XG boost, and PCA in the next session. The speaker also wants to explain the definition of bias and variance, which they will cover in a future session.'

In [25]:
print(chain.invoke("summarize the video and provide me the indepth knowledge"))

Based on the provided transcript, here is a summary of the video and some in-depth knowledge:

**Summary:** The video is about introducing the concept of artificial intelligence (AI) and machine learning. The speaker explains the difference between blackbox and whitebox models, and mentions that unsupervised machine learning is a type of machine learning where the model doesn't have a specific output. The speaker also mentions K-means clustering as an example of unsupervised machine learning.

**In-depth knowledge:**

* **Artificial Intelligence (AI):** AI is a process of creating applications that can perform tasks without human intervention. It involves creating applications that can make decisions, perform tasks, and interact with humans.
* **Machine Learning:** Machine learning is a subset of AI that involves training models to make predictions or take actions based on data.
* **Unsupervised Machine Learning:** Unsupervised machine learning is a type of machine learning where the m

In [26]:
print(chain.invoke("provide me the list of topics coverd in the video"))

Based on the provided transcript context, the topics covered in the video are:

1. Kin Hierle Clustering
2. DB Clustering
3. SVM
4. SVR
5. XG Boost
6. PCA
7. Bias and Variance
8. Naive Bayes
9. K-Nearest Neighbors (KNN) Algorithm
10. Introduction to Machine Learning
11. AI vs ML vs DL vs Data Science
12. Supervised vs Unsupervised Machine Learning
13. Linear Regression

Note that the speaker also mentions that they will cover hyperparameter tuning and practical examples for each algorithm, but these topics are not explicitly listed as separate points.


In [27]:
print(chain.invoke("what is K-Nearest Neighbors with mathematical concept"))

According to the provided transcript context, K-Nearest Neighbors (KNN) works as follows:

* KNN takes the K nearest closest points (in this case, K=5) to a given point.
* The distance used is the Manhattan distance, which is calculated as: `|X2 - X1| + |Y2 - Y1|`
* The point is then categorized based on the majority category of its K nearest neighbors. In the example given, the point is categorized as belonging to the Red category since the majority of its K nearest neighbors (3 out of 5) belong to the Red category.

This is the mathematical concept of K-Nearest Neighbors as discussed in the transcript.
