Skip to content

A RAG-based AI combines a language model with external information retrieval. It first fetches relevant documents from a knowledge source, then generates answers using that context. This improves accuracy, keeps responses grounded in real data, and allows dynamic updates without retraining.

Notifications You must be signed in to change notification settings

Piyu242005/RAG-Based-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

11 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽ“ RAG-Based AI Course Assistant

Author: Piyush Ramteke

A Retrieval-Augmented Generation (RAG) system that transforms video course content into an intelligent, searchable knowledge base. This project enables users to ask natural language questions about video content and receive contextual answers with precise timestamps.


๐Ÿ“‹ Table of Contents


๐ŸŒŸ Overview

This RAG-based AI system processes video tutorials (specifically the Sigma Web Development Course) and creates a semantic search engine that allows learners to:

  • Ask questions in natural language
  • Get answers with specific video references
  • Navigate directly to relevant timestamps in videos
  • Search across multiple video lectures simultaneously

The system leverages OpenAI Whisper for speech-to-text transcription, BGE-M3 embeddings for semantic understanding, and Ollama LLMs (like Llama 3.2) for generating human-like responses.


โœจ Features

Feature Description
๐ŸŽฌ Video to Audio Conversion Automatically extracts audio from video files using FFmpeg
๐Ÿ—ฃ๏ธ Speech-to-Text Transcription Uses Whisper large-v2 model with Hindi-to-English translation
๐Ÿ“ Chunk-based Processing Splits transcriptions into timestamped segments for precise retrieval
๐Ÿ” Semantic Search Uses BGE-M3 embeddings for meaning-based search (not just keywords)
๐Ÿค– AI-Powered Responses Generates contextual answers using local LLMs via Ollama
โฑ๏ธ Timestamp Navigation Provides exact timestamps for relevant content
๐Ÿ’พ Persistent Storage Saves embeddings using joblib for fast subsequent queries

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         RAG PIPELINE                                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚
โ”‚  โ”‚  Videos  โ”‚โ”€โ”€โ”€โ–บโ”‚  Audio   โ”‚โ”€โ”€โ”€โ–บโ”‚  JSON    โ”‚โ”€โ”€โ”€โ–บโ”‚Embeddingsโ”‚      โ”‚
โ”‚  โ”‚  (.mp4)  โ”‚    โ”‚  (.mp3)  โ”‚    โ”‚(transcr.)โ”‚    โ”‚  (.joblib)โ”‚     โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
โ”‚       โ”‚               โ”‚               โ”‚               โ”‚             โ”‚
โ”‚       โ–ผ               โ–ผ               โ–ผ               โ–ผ             โ”‚
โ”‚    FFmpeg         Whisper        Chunking      BGE-M3 Model        โ”‚
โ”‚                   large-v2                                          โ”‚
โ”‚                                                                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                        QUERY PIPELINE                               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚
โ”‚  โ”‚  User    โ”‚โ”€โ”€โ”€โ–บโ”‚  Query   โ”‚โ”€โ”€โ”€โ–บโ”‚ Cosine   โ”‚โ”€โ”€โ”€โ–บโ”‚   LLM    โ”‚      โ”‚
โ”‚  โ”‚  Query   โ”‚    โ”‚ Embeddingโ”‚    โ”‚Similarityโ”‚    โ”‚ Response โ”‚      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
โ”‚       โ”‚               โ”‚               โ”‚               โ”‚             โ”‚
โ”‚       โ–ผ               โ–ผ               โ–ผ               โ–ผ             โ”‚
โ”‚  Natural Lang     BGE-M3         Top-5 Chunks    Llama 3.2         โ”‚
โ”‚                                                                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”„ Pipeline Workflow

Stage 1: Video to Audio Conversion

# video_to_mp3.py
FFmpeg extracts audio โ†’ Creates .mp3 files with structured naming

Stage 2: Audio Transcription

# mp3_to_json.py
Whisper model โ†’ Transcribes audio โ†’ Generates timestamped JSON chunks

Stage 3: Embedding Generation

# preprocess_json.py
BGE-M3 model โ†’ Creates embeddings โ†’ Stores in embeddings.joblib

Stage 4: Query Processing

# process_incoming.py
User query โ†’ Semantic search โ†’ LLM generates contextual response

๐Ÿ› ๏ธ Technologies Used

Category Technology
Speech Recognition OpenAI Whisper (large-v2)
Embeddings BGE-M3 (via Ollama)
LLM Llama 3.2 / DeepSeek-R1 (via Ollama)
Audio Processing FFmpeg
Data Processing Pandas, NumPy, Scikit-learn
Storage Joblib
API Server Ollama (localhost:11434)
Language Python 3.x

๐Ÿ“ฆ Installation

Prerequisites

  1. Python 3.8+ installed
  2. FFmpeg installed and in PATH
  3. Ollama installed and running

Step 1: Clone the Repository

git clone <repository-url>
cd "Rag Based AI"

Step 2: Install Python Dependencies

pip install whisper pandas numpy scikit-learn joblib requests

Step 3: Install Ollama Models

# Install embedding model
ollama pull bge-m3

# Install LLM (choose one)
ollama pull llama3.2
# or
ollama pull deepseek-r1

Step 4: Install Whisper

cd whisper
pip install -e .

๐Ÿš€ Usage

1. Convert Videos to Audio

Place your video files in the videos/ folder and run:

python video_to_mp3.py

2. Transcribe Audio Files

python mp3_to_json.py

This creates JSON files with timestamped transcriptions in the jsons/ folder.

3. Generate Embeddings

python preprocess_json.py

This creates embeddings.joblib containing all chunk embeddings.

4. Ask Questions

python process_incoming.py

Example interaction:

Ask a Question: Where is HTML concluded in this course?

Response: HTML is concluded in Video 13 titled "Entities, Code tag and more on HTML". 
You can find the conclusion at around 8:40 (520 seconds). The instructor also 
mentions in Video 14 "Introduction to CSS" at the beginning (around 0:05) that 
HTML has been completed. I recommend watching Video 13 from timestamp 8:40 onwards 
for the HTML conclusion!

๐Ÿ“ Project Structure

Rag Based AI/
โ”‚
โ”œโ”€โ”€ video_to_mp3.py        # Converts videos to MP3 audio files
โ”œโ”€โ”€ mp3_to_json.py         # Transcribes audio using Whisper
โ”œโ”€โ”€ preprocess_json.py     # Creates embeddings from transcriptions
โ”œโ”€โ”€ process_incoming.py    # Main query processing script
โ”‚
โ”œโ”€โ”€ embeddings.joblib      # Stored embeddings database
โ”œโ”€โ”€ prompt.txt             # Last generated prompt (for debugging)
โ”œโ”€โ”€ response.txt           # Last LLM response (for debugging)
โ”‚
โ”œโ”€โ”€ Audios/                # Converted audio files (.mp3)
โ”œโ”€โ”€ jsons/                 # Transcription JSON files
โ”‚   โ”œโ”€โ”€ 01_Installing VS Code & How Websites Work.mp3.json
โ”‚   โ”œโ”€โ”€ 02_Your First HTML Website.mp3.json
โ”‚   โ”œโ”€โ”€ ... (18 video transcriptions)
โ”‚
โ”œโ”€โ”€ whisper/               # OpenAI Whisper submodule
โ”‚   โ”œโ”€โ”€ whisper/           # Core Whisper library
โ”‚   โ”œโ”€โ”€ tests/             # Test files
โ”‚   โ””โ”€โ”€ notebooks/         # Jupyter notebooks
โ”‚
โ””โ”€โ”€ README.md              # This file

๐Ÿ’ก Use Cases

๐ŸŽ“ Educational Platforms

Use Case Description
Course Navigation Help students find specific topics in lengthy video courses
Study Assistant Answer questions about course content with precise references
Revision Helper Quickly locate topics for exam preparation
Content Discovery Search across multiple lectures simultaneously

๐Ÿข Enterprise Applications

Use Case Description
Training Videos Make corporate training searchable
Meeting Recordings Find specific discussions in recorded meetings
Webinar Archives Search through past webinars efficiently
Knowledge Base Create searchable video documentation

๐Ÿ“บ Content Creators

Use Case Description
Viewer Support Help viewers find specific content
Content Indexing Automatic chapter generation for videos
FAQ Automation Auto-answer common viewer questions
Accessibility Make video content accessible via text search

๐Ÿ”ฌ Research Applications

Use Case Description
Lecture Archives Search through academic lecture recordings
Interview Analysis Find specific quotes in recorded interviews
Conference Videos Navigate through conference presentations
Podcast Search Make podcast episodes searchable

โš™๏ธ How It Works

1. Transcription Process

The Whisper model processes audio files and generates timestamped segments:

{
    "number": "1",
    "title": "Installing VS Code & How Websites Work",
    "start": 0.0,
    "end": 3.5,
    "text": "From today's video, we will start the Sigma Web Development course."
}

2. Embedding Generation

Each text chunk is converted to a 1024-dimensional vector using BGE-M3:

embedding = create_embedding([chunk_text])  # Returns [1024] vector

3. Semantic Search

User queries are embedded and compared using cosine similarity:

similarities = cosine_similarity(all_embeddings, [query_embedding])
top_5_chunks = get_top_n(similarities, n=5)

4. Response Generation

Top chunks are formatted into a prompt and sent to the LLM:

prompt = f"""
Here are video subtitle chunks: {relevant_chunks}
User question: {user_query}
Answer with video references and timestamps...
"""
response = llm.generate(prompt)

โš™๏ธ Configuration

Change Embedding Model

In preprocess_json.py and process_incoming.py:

"model": "bge-m3"  # Change to preferred embedding model

Change LLM Model

In process_incoming.py:

"model": "llama3.2"  # Options: llama3.2, deepseek-r1, mistral, etc.

Adjust Number of Retrieved Chunks

In process_incoming.py:

top_results = 5  # Increase for more context, decrease for speed

Change Whisper Model

In mp3_to_json.py:

model = whisper.load_model("large-v2")  # Options: tiny, base, small, medium, large, large-v2

๐Ÿ”ฎ Future Improvements

  • Web Interface - Create a Streamlit/Gradio UI for easier interaction
  • Multi-language Support - Extend beyond Hindi-English translation
  • Real-time Processing - Process videos as they're uploaded
  • GPU Acceleration - Optimize for faster embedding generation
  • Vector Database - Replace joblib with Chroma/Pinecone for scalability
  • Caching Layer - Cache common queries for faster responses
  • API Endpoints - Create REST API for integration with other systems
  • Video Player Integration - Direct links to video timestamps
  • Batch Processing - Handle multiple queries simultaneously
  • Fine-tuning - Custom model training on domain-specific content

๐Ÿค Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Please read CODE_OF_CONDUCT.md for guidelines.


๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

  • OpenAI Whisper - For the excellent speech recognition model
  • Ollama - For making local LLM deployment easy
  • BGE-M3 - For the powerful multilingual embedding model
  • Sigma Web Development Course - The course content used for demonstration

๐Ÿ“ž Contact


Made with โค๏ธ by Piyush Ramteke for better learning experiences

About

A RAG-based AI combines a language model with external information retrieval. It first fetches relevant documents from a knowledge source, then generates answers using that context. This improves accuracy, keeps responses grounded in real data, and allows dynamic updates without retraining.

Resources

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages