<a href="https://colab.research.google.com/github/Phaneendraaa/RAG_YT/blob/main/RAG_YT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install sentence-transformers faiss-cpu langchain mistralai mistral-common transformers accelerate openai-whisper yt-dlp

Collecting faiss-cpu
  Using cached faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting mistralai
  Using cached mistralai-1.9.1-py3-none-any.whl.metadata (33 kB)
Collecting mistral-common
  Using cached mistral_common-1.6.3-py3-none-any.whl.metadata (3.3 kB)
Collecting openai-whisper
  Using cached openai_whisper-20250625.tar.gz (803 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting yt-dlp
  Using cached yt_dlp-2025.6.30-py3-none-any.whl.metadata (174 kB)
Collecting eval-type-backport>=0.2.0 (from mistralai)
  Using cached eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from tor

In [2]:
!pip install ffmpeg


Collecting ffmpeg
  Using cached ffmpeg-1.4-py3-none-any.whl
Installing collected packages: ffmpeg
Successfully installed ffmpeg-1.4


In [3]:
from yt_dlp import YoutubeDL
import os
import whisper
def download_audio(youtube_url, filename="audio", cookies_file="cookies.txt"):
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
        'outtmpl': filename,  # no .mp3 extension here
        'quiet': False,
        'noplaylist': True,
    }
    if cookies_file:
        ydl_opts['cookiefile'] = cookies_file

    with YoutubeDL(ydl_opts) as ydl:
        ydl.download([youtube_url])

def transcribe_youtube_audio(youtube_url, cookies_file="cookies.txt"):
    base_audio_name = "audio"
    mp3_audio_file = base_audio_name + ".mp3"

    print("Downloading audio...")
    download_audio(youtube_url, filename=base_audio_name, cookies_file=cookies_file)

    if not os.path.exists(mp3_audio_file):
        raise FileNotFoundError(f"Audio file not found: {mp3_audio_file}")

    print("Loading Whisper model...")
    model = whisper.load_model("base")

    print("Transcribing audio...")
    result = model.transcribe(mp3_audio_file)
    text = result["text"]

    print("Deleting audio file...")
    os.remove(mp3_audio_file)

    return text


In [4]:
#url="https://youtu.be/yF7wP7--nls?si=YI2c3tSZCMSFeSKa"
#text = transcribe_youtube_audio(url)
#print(text)

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from mistralai import Mistral
def retrieve_relevant_docs(query,embed_model,index,docs,top_k=3):
    query_embedding = embed_model.encode([query])
    distances, indices = index.search(np.array(query_embedding), top_k)
    return [docs[i] for i in indices[0]]

def training(video_url):
  content = transcribe_youtube_audio(video_url)
  splitter = RecursiveCharacterTextSplitter(
      chunk_size=500,
      chunk_overlap=50,
  )
  docs = splitter.split_text(content)
  embed_model = SentenceTransformer("all-MiniLM-L6-v2")
  embeddings = embed_model.encode(docs)
  dimension = embeddings.shape[1]
  index = faiss.IndexFlatL2(dimension)
  index.add(np.array(embeddings))
  return docs,embed_model,index

def query_with_rag(query,docs,embed_model,index):
  api_key = "z6JUEWBz0GqVQvYNycWr2qINTxbTUtvt"
  client = Mistral(api_key=api_key)
  context = "\n\n".join(retrieve_relevant_docs(query,embed_model,index,docs))
  messages = [
        {"role": "system", "content": "You are an assistant that answers questions based on transcripts."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
    ]
  response = client.chat.complete(
        model="mistral-medium",
        messages=messages
    )
  return response.choices[0].message.content

In [7]:
cache = {}

In [8]:
def first_train(url):
  if url in cache:
    return cache[url]
  docs,embed_model,index = training(url)
  cache[url] = (docs,embed_model,index)
  return cache[url]

In [14]:
url = "https://youtu.be/Wa8_nLwQIpg?si=1-9Nmqp9jvfezeDU"
docs,embed_model,index = first_train(url)
query = "what are dopamine pathways?"
ans  = query_with_rag(query,docs,embed_model,index)
print(ans)

Dopamine pathways are neural routes in the brain where dopamine is transmitted. Based on the transcript, there are three main dopamine pathways mentioned:

1. **Mesostriatal or Nigrostriatal Pathway**: This is the largest dopamine pathway, stretching from the substantia nigra to the striatum. It is primarily associated with movement and is implicated in disorders like Parkinson's disease.

2. **Mesolimbic Pathway**: This pathway extends from the ventral tegmental area to the nucleus accumbens and other limbic structures. It is often linked to the processing of rewarding experiences and is involved in the brain's reward system.

3. **Mesocortical Pathway**: This pathway runs from the ventral tegmental area throughout the cerebral cortex. It plays a role in various cognitive functions, including motivation, emotional response, and memory.

These pathways are crucial for understanding the diverse roles of dopamine in the brain, from movement to reward processing and cognitive functions.
