In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

gemini_long_context_path = kagglehub.competition_download('gemini-long-context')
beastgokul_aimo_book_path = kagglehub.dataset_download('beastgokul/aimo-book')
packagemanager_pm_68852765_at_12_04_2024_04_06_22_path = kagglehub.notebook_output_download('packagemanager/pm-68852765-at-12-04-2024-04-06-22')
google_gemini_1_5_flash_api_api_gemini_1_5_flash_1_path = kagglehub.model_download('google/gemini-1.5-flash-api/Api/gemini-1.5-flash/1')

print('Data source import complete.')


# **Sherlock Holmes X Gemini: The Case of the Missing Formula**

This notebook discusses how **Sherlock Holmes** uses the powerful **Gemini 1.5** to handle large and complex datasets. With the ability to process **2 million tokens**, **Gemini** becomes an invaluable tool for Sherlock in solving challenging problems and organizing vast amounts of unstructured data.



[![Gemini 1.5 Overview](https://img.youtube.com/vi/uGq5UEGfE4o/0.jpg)](https://www.youtube.com/watch?v=uGq5UEGfE4o)

🔗 [Click here to watch the video on YouTube](https://www.youtube.com/watch?v=uGq5UE out! 🚀


Sherlock Holmes sat at his desk, surrounded by a disorganized pile of papers, books, and notes. A new case had arrived—a renowned mathematician had vanished, and with him, a groundbreaking formula. But there was a twist: the only clue left behind was a series of unsolved **Olympiad math problems**.

"*Watson, we need that dataset,*" Sherlock muttered. "*And we need it fast!*"


The **Olympiad dataset** was scattered across various sources—**PDFs**, handwritten notes, and even a **YouTube playlist**. Manually sorting through this mess would take *too long*.

"*Time is of the essence,*" Sherlock said, as he turned to **Gemini 1.5**.


## **The Problem: Too Much Data, Too Little Time**
Sorting through mountains of data can feel impossible. Whether it's:


* PDFs full of useful information but hard to navigate
* Hours of YouTube videos with no easy way to find key points
* Scattered notes that don’t seem to connect.

Doing this manually takes forever and leads to frustration. That’s where Gemini 1.5 comes in—a tool designed to quickly organize and analyze large amounts of information. It makes finding what you need faster and easier.



## **The Discovery: Meet Gemini 1.5 🚀**
Context has come a "long" way with Gemini 1.5 🌌✨.

Handling a large amount of data has always been a challenge for AI models. Until recently, most Large Language Models (LLMs) could only work with about 100k tokens at a time. This was fine for smaller tasks, but it quickly became a problem as datasets grew larger and more complex.

###  **What Makes Gemini 1.5 Different?**

Gemini 1.5 changes the game by increasing its token limit to 2 million—20 times more than older models. This means it can process and "remember" much larger datasets without relying on extra tools like:

RAG (Retrieval-Augmented Generation), or
Vector databases to retrieve information.
By keeping everything in memory, Gemini 1.5 simplifies workflows and handles even massive datasets with ease. It’s the perfect solution for anyone managing complex data, like Sherlock!

<img src="https://i.ibb.co/VM1PDk3/Picsart-24-12-02-15-04-21-486.png" alt="Picsart-24-12-02-15-04-21-486" border="0">

---


### **Data Types Processed by Gemini**

The **Gemini** processes a wide variety of data types, demonstrating its versatility across common file formats used in daily tasks.

- **PDF**: Portable Document Format (`application/pdf`), commonly used for reports, articles, and documents.
- **JavaScript**: Scripts for web applications (`application/x-javascript`, `text/javascript`).
- **Python**: Python code files (`application/x-python`, `text/x-python`), often used for programming and data analysis.
- **TXT**: Plain text files (`text/plain`), typically used for storing raw data or notes.
- **HTML**: HyperText Markup Language files (`text/html`), used for creating web pages.
- **CSS**: Cascading Style Sheets files (`text/css`), used for styling web pages.
- **Markdown**: Markup language for creating formatted text (`text/md`), popular for documentation.
- **CSV**: Comma-Separated Values files (`text/csv`), widely used for data storage and analysis.
- **XML**: Extensible Markup Language files (`text/xml`), used for data storage and transport.
- **RTF**: Rich Text Format files (`text/rtf`), commonly used for text documents with basic formatting.

This diverse set of formats shows how Gemini is capable of handling both structured and unstructured data, making it an ideal solution for various applications across different industries.



# **Solving the Chaos: How Sherlock Uses Gemini 1.5**
With Gemini 1.5’s advanced capabilities, Sherlock was no longer bogged down by the overwhelming data. Instead of relying on inefficient methods, Sherlock could directly upload his raw PDFs and YouTube transcriptions, and within seconds, Gemini 1.5 transformed this chaos into organized, actionable data.

## **Here’s how Sherlock used Gemini 1.5 to solve his problem**:
---

<a href="https://ibb.co/4tQJ4Zp"><img src="https://i.ibb.co/jfpH3vD/Screenshot-2024-12-04-083515.png" alt="Screenshot-2024-12-04-083515" border="0"></a>

The flowchart illustrates how raw data from different sources—PDFs, YouTube Playlist, and text files—can be processed efficiently using Gemini 1.5. It emphasizes:



1. **Diverse Data Source**s: Demonstrates handling various formats (e.g., PDFs, videos, text files) in one pipeline.
2. **Centralized Processing**: Shows how Gemini 1.5 processes the data to streamline the workflow.
3. **Efficiency with Caching**: The processed data is stored in a cache for quick access, reducing redundant computations.
4. **Structured Output**: The result is a well-organized Dataset, ready for analysis or further use.



---
###  **1. Setup: Connecting APIs and Configuring Keys**
Before diving into data processing, let's set up the environment with the required libraries and API keys. This section connects to YouTube's API for transcript retrieval and configures Gemini 1.5 using Generative AI's SDK.

In [None]:
from googleapiclient.discovery import build
from youtube_transcript_api import YouTubeTranscriptApi
import re
import random
import google.generativeai as genai



from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

API_KEY = user_secrets.get_secret("YOUTUBE_API_KEY")
GEMINI_API_KEY = api_key = user_secrets.get_secret("GEMINI_API_KEY")
genai.configure(api_key = GEMINI_API_KEY)
youtube = build('youtube', 'v3', developerKey=API_KEY)

### **2. PDF Upload and Processing**  
Gemini 1.5 simplifies the process of handling PDFs. Instead of manually reading and organizing the data, Sherlock can directly upload PDFs, and Gemini 1.5 processes them automatically, extracting the relevant information with ease.

#### **Steps for PDF Upload and Processing:**

1. **Upload PDF File**  
   - Use the `upload_pdf_to_gemini()` function to upload a PDF file. This function takes the file path and MIME type (which defaults to `'application/pdf'`). Once uploaded, it returns details about the file, including a URI to access it.

2. **Monitor File Processing**  
   - The `wait_for_files_active()` function checks the processing status of uploaded files. It ensures that files are fully processed and ready to use. The function waits for each file to move from the **PROCESSING** state to the **ACTIVE** state, periodically checking the filraises an error.


In [None]:
import time

# Function to upload a PDF file to Gemini
def upload_pdf_to_gemini(path, mime_type='application/pdf'):
    file = genai.upload_file(path, mime_type=mime_type)
    print(f"Uploaded file '{file.display_name}' as: {file.uri}")
    return file

# Function to wait for PDFs to be processed and become active
def wait_for_files_active(files):
    print("Waiting for file processing...")
    for file in files:
        while file.state.name == "PROCESSING":
            print(".", end="", flush=True)
            time.sleep(10)
            file = genai.get_file(file.name)
        if file.state.name != "ACTIVE":
            raise Exception(f"File {file.name} failed to process")
    print("...all files ready!")

In [None]:
import os
pdf_folder = '/kaggle/input/aimo-book/aimo'

pdf_files = os.listdir(pdf_folder)

pdf_files = [f for f in os.listdir(pdf_folder) if f.endswith('.pdf')]

print(f"PDFs to be uploaded: {pdf_files}")

### **3. YouTube Playlist Integration**  
Sherlock used **Gemini 1.5** to automatically fetch and transcribe video content from YouTube playlists, saving time on manual transcription.

#### **Steps for Integration:**

1. **Fetch Video IDs and Titles:**  
   Use the `get_videos_from_playlist()` function to retrieve video IDs and titles from a playlist.

2. **Fetch Video Transcripts:**  
   The `get_video_transcript()` function retrieves and cleans the transcript for each video.

3. **Save Transcripts to File:**  
   The code saves each video’s URL and its transcript to a file named `transcripts.txt`.

In [None]:
# Function to get video IDs and Titles from the AIMO Playlist
def get_videos_from_playlist(playlist_id):
    videos = []
    next_page_token = None

    while True:
        request = youtube.playlistItems().list(
            part='contentDetails,snippet',
            playlistId=playlist_id,
            maxResults=50,
            pageToken=next_page_token
        )
        response = request.execute()

        for item in response['items']:
            video_id = item['contentDetails']['videoId']
            title = item['snippet']['title']
            videos.append({'video_id': video_id, 'title': title})

        next_page_token = response.get('nextPageToken')

        if not next_page_token:
            break

    return videos

# Let's get all the videos from the Playlist
videos = get_videos_from_playlist("PLWg28JU8y6C9MG7A_GrWcBg8dH2fKOEDB")
print(f"Total videos found: {len(videos)}")

In [None]:
# Function to fetch transcript for a single video
def get_video_transcript(video_id):
    try:
        transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
        transcript = transcript_list.find_transcript(['en'])
        transcript_data = transcript.fetch()
        text = ' '.join([item['text'] for item in transcript_data])
        # Clean the text
        text = re.sub(r'\s+', ' ', text).strip()
        return text
    except Exception as e:

        return ""


# Function to get the video URL
def get_video_url(video_id):
    return f"https://www.youtube.com/watch?v={video_id}"

# Retrieve all transcripts and write to a file
output_file = 'transcripts.txt'

with open(output_file, 'w', encoding='utf-8') as f:
    for idx, video in enumerate(videos):
        video_id = video['video_id']
        video_title = video['title']
        video_url = get_video_url(video_id)

        transcript = get_video_transcript(video_id)

        if transcript:
            f.write(f"{video_url}\n")
            f.write(f"{transcript}\n\n")
        else:
            f.write(f"{video_url}\n")
            f.write("Transcript not available.\n\n")

## **A Milestone in AI: Gemini 1.5 vs Conventional LLMs**

This milestone is significant because traditional LLMs were limited to **100k tokens** for context, which meant relying on external systems like **vector stores** for large datasets. In contrast, **Gemini 1.5** can handle **2 million tokens**, enabling it to store all relevant context directly in memory—eliminating the need for third-party systems.

### **See the comparison below:**

| **Feature**         | **Conventional LLMs**    | **Gemini 1.5 Pro**            |
|---------------------|--------------------------|------------------------------|
| **Token Capacity**   | ~100k Tokens             | **2 Million Tokens**         |
| **External Data**    | Requires Vector Store    | **No External Data Needed**  |
| **Data Handling**    | Limited Context          | **Memory-Optimized**         |

With **Gemini 1.5**, Sherlock could now process entire datasets in-memory, making data retrieval faster and more accurate.


<a href="https://ibb.co/JnPVGcr"><img src="https://i.ibb.co/wy1VPdp/Screenshot-2024-11-30-203818.png" alt="Screenshot-2024-11-30-203818" border="0"></a>

In [None]:
model = genai.GenerativeModel('models/gemini-1.5-pro-002')

In [None]:
# Read the contents of transcripts.txt
with open('transcripts.txt', 'r', encoding='utf-8') as file:
    contents = file.read()

response = model.count_tokens(contents)
print(f"Token Count of Youtube Playlist Videos: {response.total_tokens}")

### **PDF Text Extraction**  
Sherlock used this approach to extract text from multiple PDF files automatically.

#### **Steps for PDF Extraction:**

1. **Extract Text from Each PDF:**  
   The `extract_text_from_pdf()` function opens a PDF file, reads its pages, and extracts text. If text is found, it is added to a string. If an error occurs during reading, an empty string is returned.

2. **Process Multiple PDFs:**  
   The script scans a directory of PDF files, using `glob` to find all `.pdf` files. For each PDF, it extracts the text and appends it to a combined string (`pdf_text`).

In [None]:
import PyPDF2
import glob
import os
from PyPDF2.errors import PdfReadError

def extract_text_from_pdf(pdf_path):
    try:
        with open(pdf_path, 'rb') as pdf_file:
            pdf_reader = PyPDF2.PdfReader(pdf_file)
            extracted_text = ""
            for page in pdf_reader.pages:
                text = page.extract_text()
                if text:
                    extracted_text += text
            return extracted_text
    except PdfReadError as e:
        return ""
    except Exception as e:
        return ""

# Directory with all PDF files
pdf_directory = '/kaggle/input/aimo-book/aimo/*.pdf'

# Initialize a string to hold combined text from all PDFs
pdf_text = ""

# Loop through each PDF file in the directory
for pdf_file_path in glob.glob(pdf_directory):
    if os.path.isfile(pdf_file_path):
        text = extract_text_from_pdf(pdf_file_path)
        pdf_text += text

print("Extraction complete.")
pdf_text += contents

In [None]:
response = model.count_tokens(pdf_text)
print(response)

In [None]:
with open('textbook.txt', "w", encoding="utf-8") as f:
      f.write(pdf_text)

### **File Upload and Processing with Gemini 1.5**

Sherlock used **Gemini 1.5** to upload and process text files. Here's how it works:

#### **Steps for File Upload and Processing:**

1. **Upload Text File:**  
   The file is uploaded to Gemini using `genai.upload_file()`. This function takes the file path as input and returns a file object.

2. **Wait for File Processing:**  
   The script waits for the file to finish processing. It repeatedly checks the file's state using `file.state.name` until it transitions from **PROCESSING** to **ACTIVE**.

3. **Notify When Processing is Complete:**  
   Once the file is processed, the URI of the processed file is printed.

In [None]:
file = genai.upload_file("textbook.txt")

while file.state.name == 'PROCESSING':

  print('Waiting for text file to be processed.')

  time.sleep(2)

  file = genai.get_file(text_file.name)



print(f'Text processing complete: {file.uri}')

## **The Magic of Context Caching ✨**

In the fast-paced world of AI, processing the same data repeatedly wastes valuable time and computational resources. Enter **context caching**, a revolutionary feature in **Gemini 1.5** that allows you to **reuse structured content** without recalculating it every time.

With **context caching**, Gemini 1.5 transforms into a **super-efficient powerhouse**. By storing data and instructions in reusable caches, you can skip the repetitive steps and focus on what matters: generating intelligent, high-quality results.

---

## **How It Works: Implementing Context Caching in Gemini 1.5**

Implementing context caching allows you to store processed content and reuse it efficiently. Here's how it's done:

### **Step 1: Create a Cached Content Object**

1. **Model**: Set the model to `"gemini-1.5-flash-002"`.
2. **Unique Identifier**: Use a `display_name` (e.g., 'pdf_file') for cache identification.
3. **System Instructions**: Define the task or behavior of the AI (e.g., generating questions from the text).
4. **Content**: Add the data you want to store in the cache.
5. **TTL (Time-to-Live)**: Specify how long the cache remains active.

In [None]:
from google.generativeai import caching
import datetime


cache = caching.CachedContent.create(

    model="gemini-1.5-flash-002",

    display_name='pdf_file', # used to identify the cache

    system_instruction=(

        "You are a highly skilled educational content creator specializing in generating challenging "
        "numerical problems from provided texts. Your task is to analyze the text and generate 10 unique "
        "questions related to numerical calculations, formulas, or concepts. Each question should be accompanied "
        "by the correct answer and an explanation of the steps involved in reaching the solution."
        "Ensure that each generated problem is clear and concise. Provide explanations for the solutions with a "
        "focus on logical steps, intermediate calculations, and clear formulas. Each batch of generated data should "
        "follow this structure for all 10 questions: Question, Answer, Explanation. Avoid unnecessary content and keep the focus on generating high-quality educational material."

  """Topic,Question,Answer,Explanation
"Geometry","What is the area of a rectangle with length 12 cm and width 8 cm?","96 cm²","Area of a rectangle = length × width. 12 × 8 = 96 cm²."
"Geometry","What is the circumference of a circle with a radius of 7 cm?","44 cm (approx.)","Circumference of a circle = 2 × π × radius. Using π ≈ 3.14, 2 × 3.14 × 7 ≈ 44 cm."
"Geometry","What is the volume of a cube with a side length of 5 cm?","125 cm³","Volume of a cube = side³. 5³ = 125 cm³."
"Geometry","If a triangle has a base of 10 cm and a height of 6 cm, what is its area?","30 cm²","Area of a triangle = 0.5 × base × height. 0.5 × 10 × 6 = 30 cm²."
"Geometry","What is the length of the hypotenuse in a right triangle with legs of 6 cm and 8 cm?","10 cm","Using the Pythagorean theorem: hypotenuse² = 6² + 8². √(36 + 64) = 10 cm."
Topic,Question,Answer,Explanation
"Percentages","What is 25% of 80?","20","25% of 80 is calculated as 0.25 * 80 = 20."
"Percentages","If a product costs $120 after a 20% discount, what was its original price?","$150","If $120 is 80% of the original price, then the original price = $120 / 0.8 = $150."
"Percentages","How much is 15% of 240?","36","15% of 240 is calculated as 0.15 * 240 = 36."
"Percentages","What is the increase from 70 to 98 as a percentage?","40%","Percentage increase = ((98 - 70) / 70) * 100 = 40%."
"Percentages","What is 10% of 350 minus 5% of 200?","15","10% of 350 is 35, and 5% of 200 is 10. The difference is 35 - 10 = 15."
"""
    ),

    contents=[file],

    ttl=datetime.timedelta(minutes=630),

)

### **Step 2: Leverage the Cached Content**

Once the cache is created, it can be used to initialize a Generative Model. This model accesses the cached context, allowing for rapid and efficient output generation without reprocessing the content every time.

By retrieving the preprocessed content directly from the cache, this method significantly reduces runtime and boosts efficiency.

In [None]:
model = genai.GenerativeModel.from_cached_content(cached_content=cache)

### **Why Do We Need Batching?**

When working with large amounts of text, processing everything at once can overwhelm the AI model due to **token limits**. Batching helps us break down large datasets into manageable chunks, ensuring we can handle huge amounts of data without exceeding these limits. 🚀

> By using batching, we can process large texts efficiently and still maintain high performance.

In [None]:
import time
from concurrent.futures import ThreadPoolExecutor

def process_chunk(chunk):
    """
    Generate dataset from a single chunk using caching and system instructions.
    """
    try:
        prompt = f"""
        Create 10 challenging numerical problems based on the following text:

        Text:
        {chunk}

        Each problem should include:
        - Topic
        - Question
        - Answer
        - Explanation

        Respond in a structured plain text format:
        1. Topic: <topic>
           Question: <question>
           Answer: <answer>
           Explanation: <explanation>

        Repeat for all 10 problems.
        """
        response = model.generate_content(prompt)
        return response.text  # Return plain text response
    except Exception as e:
        print(f"Error processing chunk: {e}")
        return ""  # Return empty string on error

def process_text_in_batches(full_text, chunk_size, time_limit, output_file):
    """
    Process text in parallel batches with a time limit and save responses to a text file.
    """
    start_time = time.time()
    chunks = [full_text[i:i+chunk_size] for i in range(0, len(full_text), chunk_size)]

    with open(output_file, "a") as f:
        with ThreadPoolExecutor(max_workers=5) as executor:
            for i, chunk in enumerate(chunks):
                elapsed_time = time.time() - start_time
                if elapsed_time > time_limit:
                    print(f"Time limit reached. Stopping after {i} chunks.")
                    break

                print(f"Processing chunk {i+1}/{len(chunks)}...")
                result = executor.submit(process_chunk, chunk)
                chunk_output = result.result()  # Response text
                if chunk_output.strip():  # Check if the response is not empty
                    f.write(f"Chunk {i+1}:\n")
                    f.write(chunk_output)
                    f.write("\n\n")  # Separate chunks with a blank line

# Main script
chunk_size = 10000  # Adjust chunk size as needed
time_limit_seconds = 10 * 3600  # 10 hours in seconds
output_file = "dataset.txt"

start_time = time.time()
process_text_in_batches(pdf_text, chunk_size, time_limit_seconds, output_file)

elapsed_time = time.time() - start_time
print(f"Script completed in {elapsed_time / 3600:.2f} hours.")
print(f"Dataset saved to '{output_file}'.")


### **⚙️ Transforming Chaos into Order**

Sherlock uploaded the *unstructured data* into **Gemini 1.5**, instructing it to process the **PDFs**, **text files**, and **videos**. In moments, **Gemini 1.5** organized the data into a neat, structured format—**questions**, **answers**, and **explanations** were now clearly categorized.

With this newfound order, Sherlock could now focus on solving the case.

---

### **💡 The Breakthrough**

With the **dataset** now organized, Sherlock quickly analyzed the problems. The **missing formula** was hidden within the questions. By cross-referencing the problems and solutions, he discovered a hidden **code** that pointed to the missing mathematician’s location.

"*Watson, we’ve cracked it!*" Sherlock exclaimed. "*The formula was the key all along!*"


# **Why Gemini 1.5 is a Game-Changer?**

In today’s fast-paced world, where data is growing exponentially, the need for automated systems that can process and organize large volumes of information efficiently is critical. **Gemini 1.5** revolutionizes the way we interact with data, especially in scenarios involving unstructured and semi-structured data, like PDFs, YouTube playlists, and text files.

Before Gemini 1.5, Sherlock Holmes struggled to manage vast amounts of unprocessed data. He had to rely on manual methods and external tools, which were not only time-consuming but also prone to human error. With **Gemini 1.5’s 2 million token capacity**, Sherlock was able to store relevant context in memory, eliminating the need for external databases and drastically improving data retrieval times. This made the entire process of organizing, structuring, and generating insights from raw data faster and more accurate.

By integrating **context caching** and leveraging the full potential of **Gemini 1.5**, Sherlock streamlined his entire workflow, achieving a level of efficiency that would have been impossible with traditional methods. The ability to handle millions of tokens of data directly in memory transformed Sherlock’s approach to problem-solving, allowing him to tackle even the most complex datasets with ease.

In essence, **Gemini 1.5** not only saved time but also enhanced Sherlock’s capability to extract actionable insights, making it an indispensable tool for any modern data-driven process. With **Gemini 1.5**, we see how the future of AI-powered research and data processing lies in systems that are not just reactive, but proactive, turning vast amounts of raw data into usable, organized knowledge almost instantaneously. 🌟



---
### **🎉 The Resolution**

Thanks to **Gemini 1.5**, Sherlock was able to piece together the clues and find the missing mathematician. The **crime** was solved, and the formula was recovered.

---

# **Conclusion**

As the data transformed from a jumbled mess into neat, structured insights, Sherlock couldn’t help but let out a satisfied chuckle. "**Well, Watson,**" he said, "**it seems I’ve solved a case that doesn’t require a magnifying glass. Just a little help from technology.**"


This story demonstrates the **power of Gemini 1.5** in transforming *raw, unstructured data* into **valuable, structured insights**. Whether solving a crime or tackling a complex dataset, **organization** and **structure** are key to success. **Gemini 1.5** is the tool you need to turn **chaos** into **clarity**!


<a href="https://ibb.co/4gw5WCh"><img src="https://i.ibb.co/Vp0PVbf/Leonardo-Phoenix-Create-an-image-of-cartoon-character-aka-bene-3.jpg" alt="Leonardo-Phoenix-Create-an-image-of-cartoon-character-aka-bene-3" border="0"></a>