# üß† RAG + Gemini study

---

### üë®‚Äçüè´ Objective

To build an AI assistant that can **analyze the transcript of any YouTube educational video**, and automatically generate:
- A **detailed summary**
- **10 flashcards** for memory retention
- **10 MCQs** to test understanding
- **5 external links** to expand learning

---


### üì¶ Install Required Libraries

We install necessary dependencies like:
- `google-genai` for Gemini API access
- `youtube-transcript-api` to extract transcripts
- `faiss-cpu` and `sentence-transformers` for RAG
---

In [46]:

!pip install -qU 'google-genai==1.7.0'
!pip install --upgrade -q youtube-transcript-api
!pip install --upgrade -q google-generativeai
!pip install faiss-cpu -q
!pip install --upgrade -q sentence-transformers
!pip install hf_xet
!pip install python-dotenv
!pip install matplotlib
!pip install scikit-learn
%matplotlib inline




### üîê Load Gemini API Key

Using the secret API key stored on Kaggle to securely authenticate with the Gemini API.

---

In [47]:
import google.generativeai as genai
from google.generativeai import types
from IPython.display import Markdown, HTML, display
import os
from dotenv import load_dotenv
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Load API key from .env file
load_dotenv()
api_key = os.getenv('GOOGLE_API_KEY')

# Configure the API with the key
genai.configure(api_key=api_key)

print("‚úÖ API key loaded from .env file!")

‚úÖ API key loaded from .env file!


### üé• YouTube Transcript Extraction

Extracting the transcript of any YouTube video using `youtube-transcript-api`. The text is returned as chunks to support passage-level retrieval.

---

In [48]:
from youtube_transcript_api import YouTubeTranscriptApi
from urllib.parse import urlparse, parse_qs

def get_video_id(url):
    query = urlparse(url)
    if query.hostname == 'youtu.be':
        return query.path[1:]
    if query.hostname in ('www.youtube.com', 'youtube.com'):
        return parse_qs(query.query).get('v', [None])[0]
    return None

def get_transcript(video_url):
    video_id = get_video_id(video_url)
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['en-US', 'en'])
        return [t['text'] for t in transcript]  # return list for chunking
    except Exception as e:
        return f"Transcript not available: {e}"

### üìö Chunking + Embedding for RAG

We split the transcript into groups of 5 lines, embed them using `sentence-transformers`, and store the vectors in a FAISS index.

---


In [49]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

embed_model = SentenceTransformer('all-MiniLM-L6-v2')

def chunk_and_embed(transcript_chunks):
    chunks = [" ".join(transcript_chunks[i:i+5]) for i in range(0, len(transcript_chunks), 5)]
    embeddings = embed_model.encode(chunks)
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(np.array(embeddings))
    return chunks, index, embeddings


### üß† Generate Learning Content using RAG + Gemini

This function implements Retrieval-Augmented Generation:
- Retrieve top-5 chunks relevant to the query
- Add few-shot prompt examples
- Call Gemini to generate summary, flashcards, MCQs, and links

---


In [39]:
#def generate_learning_content(query, chunks, index, embeddings):

    # RAG: Retrieve top 5 relevant chunks
    print("Performing RAG retrieval...")
    query_embed = embed_model.encode([query])
    D, I = index.search(query_embed, 5)
    relevant = "\n".join([chunks[i] for i in I[0]])
    print(f"Retrieved {len(I[0])} relevant chunks")

    n_chunks = len(chunks)
    n_topics = min(3, max(2, n_chunks // 2)) if n_chunks >= 2 else 1

    if n_chunks < n_topics:
        n_topics = n_chunks  # KMeans can't have more clusters than data points

    if n_topics >= 1:
        kmeans = KMeans(n_clusters=n_topics, random_state=42)
        cluster_labels = kmeans.fit_predict(embeddings)

        topic_counts = Counter(cluster_labels)
        plt.figure(figsize=(6, 6))
        plt.pie(topic_counts.values(), labels=[f'Topic {i+1}' for i in topic_counts.keys()], autopct='%1.1f%%')
        plt.show()
    else:
        print("‚ùå Not enough data to form clusters.")





    prompt = f"""

Based on this

Generate:

1. A detailed summary
2. ASCII diagram  
3. 10 flashcards (Q&A format)
4. 10 MCQs with 4 options each, mark correct with ‚úÖ
5. 5 external links

Transcript:
{relevant}
"""

    model = genai.GenerativeModel("gemini-1.5-pro-latest")
    
    print("Generating content...")
    response = model.generate_content(
        prompt,
        generation_config=genai.types.GenerationConfig(
            max_output_tokens=4000,
            temperature=0.7
        )
    )
    
    print("‚úÖ Content generated successfully!")
    return response.text

In [31]:
#def generate_learning_content(query, chunks, index, embeddings):
    from sklearn.cluster import KMeans
    from sklearn.feature_extraction.text import TfidfVectorizer
    from collections import Counter
    import matplotlib.pyplot as plt
    
    # RAG: Retrieve top 5 relevant chunks
    print("Performing RAG retrieval...")
    query_embed = embed_model.encode([query])
    D, I = index.search(query_embed, 5)
    relevant = "\n".join([chunks[i] for i in I[0]])
    print(f"Retrieved {len(I[0])} relevant chunks")

    # K-MEANS: Find topics
    n_chunks = len(chunks)
    n_topics = min(3, max(2, n_chunks // 2)) if n_chunks >= 2 else 1

    if n_chunks < n_topics:
        n_topics = n_chunks

    if n_topics >= 1:
        kmeans = KMeans(n_clusters=n_topics, random_state=42)
        cluster_labels = kmeans.fit_predict(embeddings)

        # Get topic names
        topic_names = []
        for topic_id in range(n_topics):
            topic_chunks = [chunks[i] for i in range(len(chunks)) if cluster_labels[i] == topic_id]
            topic_text = ' '.join(topic_chunks)
            
            vectorizer = TfidfVectorizer(max_features=10, stop_words='english')
            try:
                tfidf_matrix = vectorizer.fit_transform([topic_text])
                feature_names = vectorizer.get_feature_names_out()
                tfidf_scores = tfidf_matrix.toarray()[0]
                best_term_idx = tfidf_scores.argmax()
                topic_name = feature_names[best_term_idx].capitalize()
                topic_names.append(topic_name)
            except:
                topic_names.append(f"Topic {topic_id + 1}")

        # Show chart
        topic_counts = Counter(cluster_labels)
        plt.figure(figsize=(6, 6))
        plt.pie(topic_counts.values(), labels=topic_names, autopct='%1.1f%%')
        plt.title("üìä Content Topics (K-means + TF-IDF)")
        plt.savefig("topics_chart.png"); plt.show()

    # ENHANCED PROMPT: Use both K-means topics + RAG content
    prompt = f"""
Based on ML analysis showing topics: {', '.join(topic_names)}

Generate:
1. A detailed summary covering these ML-discovered topics
2. ASCII diagram showing relationship between {', '.join(topic_names)}
3. 10 flashcards (focus on {', '.join(topic_names)})
4. 10 MCQs with 4 options each, mark correct with ‚úÖ
5. 5 external links

Transcript:
{relevant}
"""

    model = genai.GenerativeModel("gemini-1.5-pro-latest")
    
    print("Generating content...")
    response = model.generate_content(
        prompt,
        generation_config=genai.types.GenerationConfig(
            max_output_tokens=4000,
            temperature=0.7
        )
    )
    
    # Add ML analysis to output
    ml_summary = f"\n\nüìä ML ANALYSIS RESULTS:\n"
    ml_summary += f"üîç Topics Discovered: {', '.join(topic_names)}\n"
    ml_summary += f"üìà Content Structure: {n_topics} main themes identified\n"
    ml_summary += f"üéØ RAG Retrieved: Top 5 most relevant content chunks\n"
    
    print("‚úÖ Content generated successfully!")
    return response.text + ml_summary

In [50]:
def generate_learning_content(query, chunks, index, embeddings):
    # RAG: Retrieve top 5 relevant chunks
    query_embed = embed_model.encode([query])
    D, I = index.search(query_embed, 5)
    relevant = "\n".join([chunks[i] for i in I[0]])

    few_shot_examples = """
    Example 1:
    Transcript:
    Neural networks are made of layers of neurons. Each neuron takes input, does some math, and passes it on.

    Summary:
    Neural networks consist of interconnected neurons organized in layers that process data through mathematical transformations.

    ASCII format Diagram:

    ```
    +---------------+     +---------------+     +---------------+
    | Input Layer   |     | Hidden Layer  |     | Output Layer  |
    +---------------+     +---------------+     +---------------+
           |                   |                   |
           v                   v                   v
    +---------------+     +---------------+     +---------------+
    | Neuron 1 (I1) | --> | Neuron 1 (H1) | --> | Neuron 1 (O1) |
    +---------------+     +---------------+     +---------------+
           | \                 | \                 |
           |  \                |  \                |
           |   \               |   \               |
           v    \              v    \              v
    +---------------+     +---------------+     +---------------+
    | Neuron 2 (I2) | --> | Neuron 2 (H2) | --> | Neuron 2 (O2) |
    +---------------+     +---------------+     +---------------+
           |     \            |     \            |
           |      \           |      \           |
           |       \          |       \          |
           v        \         v        \         v
    +---------------+     +---------------+     +---------------+
    | Neuron 3 (I3) | --> | Neuron 3 (H3) | --> | Neuron 3 (O3) |
    +---------------+     +---------------+     +---------------+
           |
           v
          ...
    ```

    Explanation:

    Layers: The diagram shows three main layers:
        Input Layer: Receives the initial data. (I1, I2, I3, ...)
        Hidden Layer: Performs intermediate calculations. (H1, H2, H3, ...) Neural networks can have multiple hidden layers.
        Output Layer: Produces the final result. (O1, O2, O3, ...)
    Neurons: Each layer consists of neurons (represented as boxes).
    Connections (Arrows): The arrows represent the connections between neurons, where data and weights are passed.
    Data Flow: Data flows from the input layer, through the hidden layer(s), and finally to the output layer.
    ...: The dots indicate that there can be more neurons in each layer.


    Flashcards:
    Q: How neural network process the data?\nA: Neural network process the data through mathematical transformations.

    MCQs:
    Q: What is a neural network composed of?
    a) Trees
    b) Layers of neurons ‚úÖ
    c) Genes
    d) Tables

    Links:
    - https://www.ibm.com/topics/neural-networks

    Example 2:
    Transcript:
    The concept of a decision tree involves creating a model that splits data based on certain features to make decisions. At each decision node, a condition is evaluated, and data is routed to the next node until a final decision is made at the leaf.

    Summary:
    A decision tree is a flowchart-like model where data is split based on feature conditions at decision nodes, ultimately reaching a final decision at the leaf nodes.

    ASCII format Diagram:

    ```
                   +---------------+
                   |   Root Node   |
                   +---------------+
                         |
              +----------+----------+
              |                     |
      +---------------+     +---------------+
      | Decision Node |     | Decision Node |
      +---------------+     +---------------+
              |                     |
        +-----+-----+          +-----+-----+
        |           |          |           |
    +---------------+     +---------------+
    |   Leaf Node   |     |   Leaf Node   |
    +---------------+     +---------------+

    ```
    Explanation:

    Root Node: The starting point of the decision tree.

    Decision Nodes: These nodes represent points where data is split based on certain conditions.

    Leaf Nodes: These represent the final decision made after evaluating all conditions along the tree.

    Splitting Conditions: At each decision node, data is routed based on specific conditions, such as a threshold value or category.

    Flashcards:
    Q: What does a decision tree use to make decisions?
    A: A decision tree splits data based on feature conditions at decision nodes to make final decisions at leaf nodes.

    MCQs:
    Q: What is a key feature of a decision tree?
    a) Linear relationships
    b) Data splitting based on conditions ‚úÖ
    c) Random selection of data
    d) Single-layer structure

    Links:

    https://www.towardsdatascience.com/understanding-decision-trees-20613db75dbb

    Example 3:
    Transcript:
    A for loop is a control structure that allows a block of code to be repeated multiple times. It continues to execute until a specific condition is no longer true.

    Summary:
    A for loop repeats a block of code a set number of times or until a condition fails.

    ASCII format Diagram:

    ```

    +--------------------------+
    | Start                    |
    +--------------------------+
                 |
                 v
       +--------------------+
       | Initialize counter |
       +--------------------+
                |
                v
       +----------------------+
       | Check condition      |
       +----------------------+
                |
           +----+----+
           |         |
           v         v
      +---------+  +---------+
      | Execute |  | Exit    |
      +---------+  +---------+
           |
           v
      +-------------+
      | Update Counter|
      +-------------+
           |
           v
        +----------------------+
        | Check condition      |
        +----------------------+
     ```

    Explanation:

     Start: Marks the beginning of the loop.

     Initialize Counter: Sets the starting value of the counter (e.g., i = 0).

     Check Condition: Evaluates whether the loop should continue (e.g., i < 5).

     Execute: If the condition is true, the block of code is executed.

     Update Counter: After each iteration, the counter is updated (e.g., i++).

     Exit: If the condition is false, the loop exits.

     Flashcards:
     Q: How does a for loop work?
     A: A for loop repeats a block of code until a specified condition is no longer true.

     MCQs:
     Q: What is the purpose of a for loop?
     a) To execute code once
     b) To repeat code multiple times ‚úÖ
     c) To execute code conditionally
     d) To exit the program

     Links:

     https://www.programiz.com/python-programming/for-loop
    """

    prompt = f"""
    You are a helpful AI assistant.
    {few_shot_examples}

    Now based on this transcript:
    {relevant}

    Generate:
    1. A detailed summary
    2. Provide ASCII format Diagram
    3. 10 flashcards (Q&A)
    4. 10 MCQs with 4 options each, mark the correct one
    5. 5 external links to explore more
    """

    model = genai.GenerativeModel("gemini-1.5-pro-latest")
    response = model.generate_content(prompt)
    return response.text


   

### üìÑ HTML Output Generation & Formatting

Converting the AI-generated learning content into a user-friendly, styled HTML document for enhanced readability and accessibility.

---

In [64]:
def save_html(output):
    import re
    import webbrowser
    
    # Convert to hoverable format
    
    patterns = [
        (r'Q:\s*(.*?)\nA:\s*(.*?)(?=\n\n|\nQ:|\Z)', r'<div class="card">üß† \1<span class="answer">üí° \2</span></div>'),
        (r'\*\*Q:\*\*\s*(.*?)\n\*\*A:\*\*\s*(.*?)(?=\n\n|\n\*\*Q:|\Z)', r'<div class="card">üß† \1<span class="answer">üí° \2</span></div>'),
        (r'Front:\s*(.*?)\nBack:\s*(.*?)(?=\n\n|\nFront:|\Z)', r'<div class="card">üß† \1<span class="answer">üí° \2</span></div>'),
        (r'\d+\.\s*Q:\s*(.*?)\nA:\s*(.*?)(?=\n\n|\n\d+\.|\Z)', r'<div class="card">üß† \1<span class="answer">üí° \2</span></div>'),
        (r'\* \*\*Q:\*\* (.*?)\* \*\*A:\*\* (.*?)(?=\* \*\*Q|\*\*4\.|\Z)', r'<div class="card">üß† \1<span class="answer">üí° \2</span></div>')
    ]
    
    
    for pattern, replacement in patterns:
        output = re.sub(pattern, replacement, output, flags=re.DOTALL)


    output = output.replace('```', '')  
    output = re.sub(r'\*\*(.*?)\*\*', r'<b>\1</b>', output)  
    output = re.sub(r'(https?://[^\s]+)', r'<a href="\1" target="_blank">\1</a>', output)  
    output = output.replace('\n', '<br>') 
    output = re.sub(r'([abcd]\))', r'<br>\1', output) 
    html = f'''<html>
    <head>
    <style>
        .card {{ margin: 10px 0; padding: 10px; background: #e8f4fd; border-radius: 5px; cursor: pointer; }}
        .answer {{ display: none; }}
        .card:hover .answer {{ display: inline; }}
    </style>
    </head>
    <body style="font-family:Arial;padding:40px;max-width:800px;margin:0 auto;line-height:1.6;">
        <h1>Learning Materials</h1>
        <pre style="background:#f5f5f5;padding:20px;border-radius:5px;white-space:pre-wrap;">{output}</pre>
    </body>
    </html>'''
    
    with open("output.html", "w") as f:
        f.write(html)
    webbrowser.open("output.html")
    print("‚úÖ Saved ")

### üöÄ Run Full Pipeline

Input a YouTube link and run the full flow:
1. Extract transcript
2. Chunk + embed + index
3. Query Gemini for educational content

---


In [52]:
youtube_link = "https://youtu.be/T-D1OfcDW1M"


print(f" YouTube link: {youtube_link}")

transcript_chunks = get_transcript(youtube_link)


print(f"Result type: {type(transcript_chunks)}")

if isinstance(transcript_chunks, str):
    print("‚ùå Got error string:")
    print(transcript_chunks)
else:
    print(f"‚úÖ Got transcript! Number of chunks: {len(transcript_chunks)}")
    
    print("Starting chunk_and_embed...")
    chunks, index, embeddings = chunk_and_embed(transcript_chunks)
    print(f"‚úÖ Chunk embedding done! {len(chunks)} chunks created")
    
    print(" Generating content...") #generate_learning_content function implementation
    output = generate_learning_content("summarize and generate learning materials", chunks, index, embeddings)
    print("‚úÖ Content generation complete!")
    
#Output

    from IPython.display import Markdown, display
    
    display(Markdown(output))

save_html(output)

 YouTube link: https://youtu.be/T-D1OfcDW1M
Result type: <class 'list'>
‚úÖ Got transcript! Number of chunks: 92
Starting chunk_and_embed...
‚úÖ Chunk embedding done! 19 chunks created
 Generating content...
‚úÖ Content generation complete!


## Retrieval-Augmented Generation (RAG)

**1. Summary:**

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by combining their generative capabilities with external information retrieval.  Traditional LLMs generate text solely based on their internal knowledge, which can lead to inaccuracies or outdated information. RAG addresses this by first retrieving relevant context from external sources (e.g., databases, websites) based on the user's prompt. This retrieved information is then fed to the LLM, allowing it to generate a more informed and accurate response grounded in external evidence.  This two-step process of retrieval and generation improves the reliability and factual accuracy of LLM outputs.

**2. ASCII Diagram:**

```
+-----------------+     +-----------------+     +-----------------+      +-----------------+
|  User Prompt    | --> |   Retriever     | --> | Relevant Context | --> | LLM (Generator) | --> |    Response     |
+-----------------+     +-----------------+     +-----------------+      +-----------------+
                                                         ^
                                                         |
                                                         +-----------------+
                                                         | External Sources |
                                                         +-----------------+

```


**3. Flashcards:**

Q: What does RAG stand for?
A: Retrieval-Augmented Generation

Q: What is the core idea behind RAG?
A: Combining information retrieval with LLM generation.

Q: What problem does RAG address?
A: Inaccuracies and outdated information in LLM-generated text.

Q: What is the role of the "Retriever" in RAG?
A: To find relevant information from external sources based on the user prompt.

Q: What are examples of external sources used in RAG?
A: Databases, websites, knowledge graphs.

Q: How does RAG improve LLM outputs?
A: By grounding the generated text in external evidence.

Q: What is a "prompt" in the context of LLMs?
A: The user's input or query to the LLM.

Q: What is the "Generation" part of RAG?
A: The LLM generating text based on the retrieved context and the prompt.

Q: What is a key difference between traditional LLMs and RAG?
A: RAG uses external information, while traditional LLMs rely solely on internal knowledge.

Q: Who presented the anecdote about children's questions and LLMs in this context?
A: Marina Danilevsky, Senior Research Scientist at IBM Research.


**4. MCQs:**

Q1: What does the "R" in RAG stand for?
a) Recursive b) Retrieval ‚úÖ c) Recurrent d) Real-time

Q2: RAG primarily aims to improve which aspect of LLMs?
a) Speed b) Creativity c) Accuracy ‚úÖ d) Size

Q3: The "Retriever" component in RAG interacts with:
a) Only the LLM b) Only the user prompt c) External sources ‚úÖ d) Internal LLM parameters

Q4: What is passed to the LLM in RAG?
a) Only the user prompt b) Only retrieved context c) Both the prompt and retrieved context ‚úÖ d) Neither the prompt nor the context

Q5:  What is the final output of the RAG framework?
a) Retrieved context b) User prompt c) LLM-generated response ‚úÖ d)  List of external sources


Q6: Which issue with LLMs does RAG aim to solve?
a) Slow response times b) Difficulty understanding complex prompts c)  Hallucinations and outdated information ‚úÖ d) Limited vocabulary

Q7: What is a benefit of using RAG with LLMs?
a) Reduced computational cost b) Increased creativity c) Improved factual accuracy ‚úÖ d) Simplified model training


Q8: Which of these is NOT a component of RAG?
a) Retriever b) Generator c) Translator ‚úÖ d) External sources


Q9: Marina Danilevsky's anecdote about children's questions highlights:
a) The speed of LLMs b) The limitations of current LLM knowledge ‚úÖ c) The complexity of user prompts d) The need for more powerful hardware

Q10: In RAG, the LLM is instructed to:
a) Generate text immediately b) Retrieve relevant content first ‚úÖ c) Ignore the user prompt d) Focus only on internal knowledge


**5. External Links:**

1. [Retrieval-Augmented Generation (RAG): https://www.promptingguide.ai/techniques/rag](https://www.promptingguide.ai/techniques/rag)
2. [LangChain for LLM Application Development: https://python.langchain.com/en/latest/index.html](https://python.langchain.com/en/latest/index.html)  (Often used to implement RAG)
3. [Haystack: An open-source NLP framework: https://haystack.deepset.ai/](https://haystack.deepset.ai/) (Includes RAG functionality)
4. [LlamaIndex: Connect LLMs to your data: https://gpt-index.readthedocs.io/en/latest/](https://gpt-index.readthedocs.io/en/latest/)  (Another tool for building RAG applications)
5. [IBM Research: https://research.ibm.com/](https://research.ibm.com/) (Explore more about IBM's work in AI and LLMs) 


‚úÖ Saved 


In [65]:
html_file = input("Do you want to save as HTML? (y/n): ")
if html_file.lower() == 'y':
    save_html(output)

‚úÖ Saved 


### ‚úÖ Summary of Key Concepts Used

---

This notebook uses several cutting-edge GenAI concepts:

- **Few-shot prompting**: Guided the Gemini model with example outputs to generate structured summaries, flashcards, MCQs, and links from video transcripts.
- **Document understanding**: Processed and analyzed YouTube video transcripts to extract key information for educational content creation.
- **Long context window**: Enabled Gemini to handle large prompts, including few-shot examples and transcript chunks, for coherent content generation.
- **Gen AI evaluation**: Assessed the quality of generated summaries, flashcards, and MCQs, likely through manual review, to ensure educational value.
- **Retrieval augmented generation (RAG)**: Retrieved relevant transcript chunks to enhance Gemini's generation of contextually accurate learning materials.
- **Vector search/vector store/vector database**: Used FAISS to store and search transcript embeddings for efficient retrieval of relevant content.
- **Embeddings**: Converted transcript chunks into semantic vectors using `sentence-transformers` to enable similarity-based retrieval for RAG.
---
