<a href="https://colab.research.google.com/github/CosmicMicra/Rag-based-content-generation/blob/main/Rag_based_study.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧠 RAG + Gemini study

---

### 👨‍🏫 Objective

To build an AI assistant that can **analyze the transcript of any YouTube educational video**, and automatically generate:
- A **detailed summary**
- **10 flashcards** for memory retention
- **10 MCQs** to test understanding
- **5 external links** to expand learning

---


### 📦 Install Required Libraries

We install necessary dependencies like:
- `google-genai` for Gemini API access
- `youtube-transcript-api` to extract transcripts
- `faiss-cpu` and `sentence-transformers` for RAG
---

In [None]:
!pip uninstall -qy jupyterlab jupyterlab-lsp
!pip install -qU 'google-genai==1.7.0'
!pip install --upgrade -q youtube-transcript-api
!pip install --upgrade -q google-generativeai
!pip install faiss-cpu -q
!pip install --upgrade -q sentence-transformers
!pip install hf_xet

[0mCollecting hf_xet
  Downloading hf_xet-1.1.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (879 bytes)
Downloading hf_xet-1.1.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: hf_xet
Successfully installed hf_xet-1.1.2


### 🔐 Load Gemini API Key from Kaggle Secrets

Using the secret API key stored on Kaggle to securely authenticate with the Gemini API.

---

In [None]:
#import google.generativeai as genai
#from google.generativeai import types
#from IPython.display import Markdown, HTML, display
#from kaggle_secrets import UserSecretsClient

#GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
#genai.configure(api_key=GOOGLE_API_KEY)

In [None]:
import google.generativeai as genai
from google.generativeai import types
from IPython.display import Markdown, HTML, display
from google.colab import userdata
api_key = userdata.get('GOOGLE_API_KEY')

# Configure the API with the key
genai.configure(api_key=api_key)

### 🎥 YouTube Transcript Extraction

Extracting the transcript of any YouTube video using `youtube-transcript-api`. The text is returned as chunks to support passage-level retrieval.

---

In [None]:
from youtube_transcript_api import YouTubeTranscriptApi
from urllib.parse import urlparse, parse_qs

def get_video_id(url):
    query = urlparse(url)
    if query.hostname == 'youtu.be':
        return query.path[1:]
    if query.hostname in ('www.youtube.com', 'youtube.com'):
        return parse_qs(query.query).get('v', [None])[0]
    return None

def get_transcript(video_url):
    video_id = get_video_id(video_url)
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        return [t['text'] for t in transcript]  # return list for chunking
    except Exception as e:
        return f"Transcript not available: {e}"

### 📚 Chunking + Embedding for RAG

We split the transcript into groups of 5 lines, embed them using `sentence-transformers`, and store the vectors in a FAISS index.

---


In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

embed_model = SentenceTransformer('all-MiniLM-L6-v2')

def chunk_and_embed(transcript_chunks):
    chunks = [" ".join(transcript_chunks[i:i+5]) for i in range(0, len(transcript_chunks), 5)]
    embeddings = embed_model.encode(chunks)
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(np.array(embeddings))
    return chunks, index, embeddings


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### 🧠 Generate Learning Content using RAG + Gemini

This function implements Retrieval-Augmented Generation:
- Retrieve top-5 chunks relevant to the query
- Add few-shot prompt examples
- Call Gemini to generate summary, flashcards, MCQs, and links

---


In [None]:
def generate_learning_content(query, chunks, index, embeddings):
    # RAG: Retrieve top 5 relevant chunks
    query_embed = embed_model.encode([query])
    D, I = index.search(query_embed, 5)
    relevant = "\n".join([chunks[i] for i in I[0]])

    few_shot_examples = """
    Example 1:
    Transcript:
    Neural networks are made of layers of neurons. Each neuron takes input, does some math, and passes it on.

    Summary:
    Neural networks consist of interconnected neurons organized in layers that process data through mathematical transformations.

    ASCII format Diagram:

    ```
    +---------------+     +---------------+     +---------------+
    | Input Layer   |     | Hidden Layer  |     | Output Layer  |
    +---------------+     +---------------+     +---------------+
           |                   |                   |
           v                   v                   v
    +---------------+     +---------------+     +---------------+
    | Neuron 1 (I1) | --> | Neuron 1 (H1) | --> | Neuron 1 (O1) |
    +---------------+     +---------------+     +---------------+
           | \                 | \                 |
           |  \                |  \                |
           |   \               |   \               |
           v    \              v    \              v
    +---------------+     +---------------+     +---------------+
    | Neuron 2 (I2) | --> | Neuron 2 (H2) | --> | Neuron 2 (O2) |
    +---------------+     +---------------+     +---------------+
           |     \            |     \            |
           |      \           |      \           |
           |       \          |       \          |
           v        \         v        \         v
    +---------------+     +---------------+     +---------------+
    | Neuron 3 (I3) | --> | Neuron 3 (H3) | --> | Neuron 3 (O3) |
    +---------------+     +---------------+     +---------------+
           |
           v
          ...
    ```

    Explanation:

    Layers: The diagram shows three main layers:
        Input Layer: Receives the initial data. (I1, I2, I3, ...)
        Hidden Layer: Performs intermediate calculations. (H1, H2, H3, ...) Neural networks can have multiple hidden layers.
        Output Layer: Produces the final result. (O1, O2, O3, ...)
    Neurons: Each layer consists of neurons (represented as boxes).
    Connections (Arrows): The arrows represent the connections between neurons, where data and weights are passed.
    Data Flow: Data flows from the input layer, through the hidden layer(s), and finally to the output layer.
    ...: The dots indicate that there can be more neurons in each layer.


    Flashcards:
    Q: How neural network process the data?\nA: Neural network process the data through mathematical transformations.

    MCQs:
    Q: What is a neural network composed of?
    a) Trees
    b) Layers of neurons ✅
    c) Genes
    d) Tables

    Links:
    - https://www.ibm.com/topics/neural-networks

    Example 2:
    Transcript:
    The concept of a decision tree involves creating a model that splits data based on certain features to make decisions. At each decision node, a condition is evaluated, and data is routed to the next node until a final decision is made at the leaf.

    Summary:
    A decision tree is a flowchart-like model where data is split based on feature conditions at decision nodes, ultimately reaching a final decision at the leaf nodes.

    ASCII format Diagram:

    ```
                   +---------------+
                   |   Root Node   |
                   +---------------+
                         |
              +----------+----------+
              |                     |
      +---------------+     +---------------+
      | Decision Node |     | Decision Node |
      +---------------+     +---------------+
              |                     |
        +-----+-----+          +-----+-----+
        |           |          |           |
    +---------------+     +---------------+
    |   Leaf Node   |     |   Leaf Node   |
    +---------------+     +---------------+

    ```
    Explanation:

    Root Node: The starting point of the decision tree.

    Decision Nodes: These nodes represent points where data is split based on certain conditions.

    Leaf Nodes: These represent the final decision made after evaluating all conditions along the tree.

    Splitting Conditions: At each decision node, data is routed based on specific conditions, such as a threshold value or category.

    Flashcards:
    Q: What does a decision tree use to make decisions?
    A: A decision tree splits data based on feature conditions at decision nodes to make final decisions at leaf nodes.

    MCQs:
    Q: What is a key feature of a decision tree?
    a) Linear relationships
    b) Data splitting based on conditions ✅
    c) Random selection of data
    d) Single-layer structure

    Links:

    https://www.towardsdatascience.com/understanding-decision-trees-20613db75dbb

    Example 3:
    Transcript:
    A for loop is a control structure that allows a block of code to be repeated multiple times. It continues to execute until a specific condition is no longer true.

    Summary:
    A for loop repeats a block of code a set number of times or until a condition fails.

    ASCII format Diagram:

    ```

    +--------------------------+
    | Start                    |
    +--------------------------+
                 |
                 v
       +--------------------+
       | Initialize counter |
       +--------------------+
                |
                v
       +----------------------+
       | Check condition      |
       +----------------------+
                |
           +----+----+
           |         |
           v         v
      +---------+  +---------+
      | Execute |  | Exit    |
      +---------+  +---------+
           |
           v
      +-------------+
      | Update Counter|
      +-------------+
           |
           v
        +----------------------+
        | Check condition      |
        +----------------------+
     ```

    Explanation:

     Start: Marks the beginning of the loop.

     Initialize Counter: Sets the starting value of the counter (e.g., i = 0).

     Check Condition: Evaluates whether the loop should continue (e.g., i < 5).

     Execute: If the condition is true, the block of code is executed.

     Update Counter: After each iteration, the counter is updated (e.g., i++).

     Exit: If the condition is false, the loop exits.

     Flashcards:
     Q: How does a for loop work?
     A: A for loop repeats a block of code until a specified condition is no longer true.

     MCQs:
     Q: What is the purpose of a for loop?
     a) To execute code once
     b) To repeat code multiple times ✅
     c) To execute code conditionally
     d) To exit the program

     Links:

     https://www.programiz.com/python-programming/for-loop
    """

    prompt = f"""
    You are a helpful AI assistant.
    {few_shot_examples}

    Now based on this transcript:
    {relevant}

    Generate:
    1. A detailed summary
    2. Provide ASCII format Diagram
    3. 10 flashcards (Q&A)
    4. 10 MCQs with 4 options each, mark the correct one
    5. 5 external links to explore more
    """

    model = genai.GenerativeModel("gpt-3.5-turbo")
    response = model.generate_content(prompt)
    return response.text


    # Replace Gemini API call with OpenAI API call
    #response = client.chat.completions.create(
       # model="gpt-3.5-turbo",  # or "gpt-4" if you have access
        #messages=[{"role": "user", "content": prompt}],
       # temperature=0.7,
       # max_tokens=1500  # Adjust as needed)
    #return response.choices[0].message.content

### 🚀 Run Full Pipeline

Input a YouTube link and run the full flow:
1. Extract transcript
2. Chunk + embed + index
3. Query Gemini for educational content

---


In [None]:
youtube_link = "https://youtu.be/5sLYAQS9sWQ?si=Pwz-R4z3qC-rVKJ7"

transcript_chunks = get_transcript(youtube_link)
if isinstance(transcript_chunks, str):
    print(transcript_chunks)
else:
    chunks, index, embeddings = chunk_and_embed(transcript_chunks)
    output = generate_learning_content("summarize and generate learning materials", chunks, index, embeddings)
    print(output)


Transcript not available: 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=5sLYAQS9sWQ! This is most likely caused by:

YouTube is blocking requests from your IP. This usually is due to one of the following reasons:
- You have done too many requests and your IP has been blocked by YouTube
- You are doing requests from an IP belonging to a cloud provider (like AWS, Google Cloud Platform, Azure, etc.). Unfortunately, most IPs from cloud providers are blocked by YouTube.

There are two things you can do to work around this:
1. Use proxies to hide your IP address, as explained in the "Working around IP bans" section of the README (https://github.com/jdepoix/youtube-transcript-api?tab=readme-ov-file#working-around-ip-bans-requestblocked-or-ipblocked-exception).
2. (NOT RECOMMENDED) If you authenticate your requests using cookies, you will be able to continue doing requests for a while. However, YouTube will eventually permanently ban the account that you have u

### ✅ Summary of Key Concepts Used

---

This notebook uses several cutting-edge GenAI concepts:

- **Few-shot prompting**: Guided the Gemini model with example outputs to generate structured summaries, flashcards, MCQs, and links from video transcripts.
- **Document understanding**: Processed and analyzed YouTube video transcripts to extract key information for educational content creation.
- **Long context window**: Enabled Gemini to handle large prompts, including few-shot examples and transcript chunks, for coherent content generation.
- **Gen AI evaluation**: Assessed the quality of generated summaries, flashcards, and MCQs, likely through manual review, to ensure educational value.
- **Retrieval augmented generation (RAG)**: Retrieved relevant transcript chunks to enhance Gemini's generation of contextually accurate learning materials.
- **Vector search/vector store/vector database**: Used FAISS to store and search transcript embeddings for efficient retrieval of relevant content.
- **Embeddings**: Converted transcript chunks into semantic vectors using `sentence-transformers` to enable similarity-based retrieval for RAG.
---
