<a href="https://www.kaggle.com/code/rohandwivedi2005/capstone?scriptVersionId=234995167" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Teacher for google-genai SDK

### **This chatbot is for devlopers who want to learn about the new google-genai SDK or switch from the old sdk to the new SDK**

## Scraping Data

First we will scrape the google-genai documentation website for up-to-date latest information about the google-genai SDK. We will do this using `beautifulsoup` which is a Python Library Used for webscraping to extract data from html files.

In [1]:
pip install requests beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [2]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import time
import json
import os
import re

In [3]:
URL_1 = "https://googleapis.github.io/python-genai/"
URL_2 = "https://googleapis.github.io/python-genai/genai.html"
LIST = [URL_1,URL_2]
OUTPUT_FILE = "GENAI_SDK_DOCS.txt"
REQUEST_DELAY = 1
HEADERS = {'User-Agent': 'SimpleScraper/1.0'}



output_path = os.path.join("/kaggle/working/", OUTPUT_FILE)
print(f"Saving data to: {output_path}")
print("-" * 30)

try:
    with open(output_path, 'w', encoding='utf-8') as f:
        for current_url in LIST:
            time.sleep(REQUEST_DELAY) # Delay *before* request
            response = requests.get(current_url, headers=HEADERS, timeout=15) # Increased timeout slightly
            response.raise_for_status() # Check for HTTP errors (4xx, 5xx)
        
            soup = BeautifulSoup(response.content, 'html.parser')
            text = soup.get_text(separator=' ', strip=True)
            text = ' '.join(text.split())
            if text:
                             
            
                f.write(f"--- URL: {current_url} ---\n") 

                
                f.write(text + "\n")

                
                f.write("\n")
             

                print(f"  -> Text extracted and saved for {current_url}")
                    
            else:
                print(f"  -> No significant text content found for {current_url}. Skipping.")


                
               


except requests.exceptions.RequestException as e:
    print(f"  -> Network/HTTP Error processing {current_url}: {e}")
except Exception as e:
    rint(f"  -> Error processing {current_url}: {e}")

            

except IOError as e:
     print(f"FATAL ERROR: Could not open or write to output file {output_path}: {e}")

print("-" * 30)
print(f"Crawling finished.")
print(f"Data saved in Kaggle environment at: {output_path}")



Saving data to: /kaggle/working/GENAI_SDK_DOCS.txt
------------------------------
  -> Text extracted and saved for https://googleapis.github.io/python-genai/
  -> Text extracted and saved for https://googleapis.github.io/python-genai/genai.html
------------------------------
Crawling finished.
Data saved in Kaggle environment at: /kaggle/working/GENAI_SDK_DOCS.txt


In [4]:
PATH = output_path #Store this path for later use

### This is where the file is stored

In [5]:
import os
for dirname, _, filenames in os.walk('/kaggle/working'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/working/__notebook__.ipynb
/kaggle/working/GENAI_SDK_DOCS.txt


## Embeddings Generation

Next, We will generate embeddings for the data we extracted and store the embeddings in a vector data base here we will use cromadb. But first we will split out text file into smaller chunks to be processed by our Embeddings generation model.

In [6]:
!pip uninstall -qqy jupyterlab kfp  # Remove unused conflicting packages
!pip install -qU "google-genai==1.7.0" "chromadb==0.6.3"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m47.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m4.4 MB/s[0

In [7]:
from google import genai
from google.genai import types

from IPython.display import Markdown

genai.__version__

'1.7.0'

### Fetch API Key

In [8]:
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

### Various Models for generating embeddings

In [9]:

client = genai.Client(api_key=GOOGLE_API_KEY)

for m in client.models.list():
    if "embedContent" in m.supported_actions:
        print(m.name)
        

models/embedding-001
models/text-embedding-004
models/gemini-embedding-exp-03-07
models/gemini-embedding-exp


### Main Function for embeddings generation

This function will automatically retry for any potential api errors additionaly every time this function is called it will take the chunks as an argument and return embeddings for the text.

In [10]:
import chromadb
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.genai import types
from google.api_core import retry

In [11]:
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specify whether to generate embeddings for documents, or queries
    document_mode = True
    

    @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in response.embeddings]

### Split the text into smaller chunks

Here we are splitting the text into chunks of 2000 characters with an overlap of 200 chunks. This is done to keep the context to a minimum . Lets say if the chunks size were larg we will be getting too much unnecessary context for the given query. With short chunks the context will be to the point and precise for the model to answer the questions given by a user

In [12]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [13]:
with open(PATH,"r") as f:
    text = f.read()

In [14]:
CHUNK_SIZE = 2000 # Max number of characters per chunk
CHUNK_OVERLAP = 200 # Number of characters to overlap between chunks

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    length_function=len,
    is_separator_regex=False,
)

chunks = text_splitter.split_text(text)

In [15]:
print(len(chunks))
print(len(chunks[3]))

780
1999


### Create The embeddings database

Finally We will merge everthing we have build by creating the vector database we will send the chunks to our embed function in batches so that we wont have to make unnesessary api calls every time we have taken the batch size to be 100 this is done to keep the chunks size below the limit of the function for more about the api calls and token limit chek the [api documentation](https://ai.google.dev/gemini-api/docs/models)

In [16]:
DB_NAME = "googlecardb"

embed_fn = GeminiEmbeddingFunction()
embed_fn.document_mode = True

chroma_client = chromadb.Client()
db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)
current_id_offset = 0
total_chunks = len(chunks)
API_BATCH_SIZE = 100

batch_size = 100

for i in range(0, total_chunks, API_BATCH_SIZE):
    batch_chunks = chunks[i : i + API_BATCH_SIZE]
    
    batch_ids = [str(current_id_offset + j) for j in range(len(batch_chunks))]
    current_id_offset += len(batch_chunks)

    db.add(documents=batch_chunks,ids=batch_ids)
    

In [17]:
db.count()

780

### Prompting for the chatbot

Add additional prompting to the llm so that it could give the answer in the way you desire. First the query you have given will feth the relevant documents from the vector database we have created this technique is also called RAG(retrival augument generation). The  we will give the the fetched data as context to the llm  so that it can give the answer based on the latest information about the SDK this technique can also be termed as Grounding where the gemini API uses google to feth the lates information about the query and answers based on that latest information provided.

In [18]:
from IPython.display import Markdown
embed_fn.document_mode = False

query = "how to import google-genai sdk"

result = db.query(query_texts=[query], n_results=5)


context = result['documents'][0]

prompt = f"""
Role : You are a teacher for the new google genai sdk whatever the questions you get you will answer the in a teaching manner


Answer the following question based ONLY on the provided context the answers:

Context:
{context}

Question:
{query}

Answer:
"""

response = client.models.generate_content(
     model='gemini-2.0-flash',
     contents=prompt,
    
)

Markdown(response.text)



Okay, let's learn how to import the Google Gen AI SDK in Python.

Based on the documentation, here's the import statement you'll use:

```python
from google import genai
from google.genai import types
```

This imports the main `genai` module and also the `types` submodule, which contains various data types and classes used within the SDK.  You'll likely need both for most tasks.


## Future Possibilities

In the future we could add the gemini cookbook to this project so we could get the latest use cases of the gemini api with explanation
This project aims to enhance the adoption of new SDK.
