## **Nugen Intelligence**
<img src="https://nugen.in/logo.png" alt="Nugen Logo" width="200"/>

Domain-aligned foundational models at industry leading speeds and zero-data retention! To learn more, visit [Nugen](https://docs.nugen.in/introduction)

### **Generating Embeddings with Nugen API**

This lesson demonstrates how to generate embeddings for texts using Nugen embeddings APIs. To do that, we will be following the steps mentioned below:
1. Extract information from Wikipedia
2. Break it into smaller sections
3. Generate high-performance embeddings using the [Nugen API](https://docs.nugen.in/introduction)


With Nugen’s cutting-edge API, you can easily generate embeddings that are optimized for speed and accuracy, enabling faster and more relevant results in your applications.

## Embedding Wikipedia Articles for Semantic Search (with Nugen)

### Purpose
This notebook demonstrates how to:
- Fetch Wikipedia articles programmatically
- Convert article text into vector embeddings using Nugen
- Perform semantic search over Wikipedia content

### Why this matters
Traditional keyword search fails to capture meaning.
Embedding-based search enables:
- Context-aware retrieval
- Better relevance
- Scalable knowledge search systems

### Prerequisites
- Python 3.9+
- Valid Nugen API credentials
- Internet access
- Required Python libraries installed

### How to run
Run each cell from top to bottom.
If an error occurs, check the **Troubleshooting** section at the end.

### **Setup**
**Install Required Libraries**
We'll install the required Python libraries to interact with Wikipedia, split sections, and count tokens.

> Note: It is recommended to use a virtual environment before installing dependencies.

In [1]:
!pip install --quiet mwclient==0.11.0 mwparserfromhell==0.6.6 pandas==1.5.3 tiktoken==0.7.0 openai==1.34.0 requests


[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: C:\Users\Roshni\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


**Import Necessary Libraries**

These libraries help us work with Wikipedia articles, clean and process them, and prepare them for embedding using the Nugen API.

In [2]:
# MediaWiki client for interacting with Wikipedia
import mwclient

# Parser to clean and extract Wikipedia article content
import mwparserfromhell

# Data handling and tabular operations
import pandas as pd

# Standard utilities
import re
import random
import requests

# Token counting for embedding size control
import tiktoken

### **Access the Nugen API**

**API Key Setup**

First, we need to set up the Nugen API to generate embeddings. To do this, you'll need an API key from Nugen. To access free API keys, you can visit [Nugen Dashboard](https://nugen-platform-frontend.azurewebsites.net/dashboard) Once you have your API key, make sure to replace <your_api_key> in the code with the actual key you get from Nugen.

### Secure API Key Setup

For security reasons, API keys should **not be hardcoded**.
Set your Nugen API key as an environment variable before running this notebook.

**Example (Linux / macOS):**
```bash
export NUGEN_API_KEY="your_api_key_here"

In [3]:
url_api_server = "https://api.nugen.in/inference/embeddings"
api_key = "nugen-LCKMzU2MmCZiSXCcOOm3Ow"
headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

### **Get Wikipedia Articles**
**Choosing Wikipedia Articles**

We are going to retrieve articles related to the 2022 Winter Olympics using a Wikipedia category. This section searches for all pages within that category.

In [4]:
CATEGORY_TITLE = "Category:2022 Winter Olympics"
WIKI_SITE = "en.wikipedia.org"

**Extract Article Titles**

We now gather all the article titles under this category.


In [5]:
def titles_from_category(category, max_depth):
    """Return a set of page titles in a given Wiki category and its subcategories."""
    titles = set()
    for cm in category.members():
        if type(cm) == mwclient.page.Page:
            titles.add(cm.name)
        elif isinstance(cm, mwclient.listing.Category) and max_depth > 0:
            deeper_titles = titles_from_category(cm, max_depth=max_depth - 1)
            titles.update(deeper_titles)
    return titles

# Initialize the Wikipedia client
site = mwclient.Site(WIKI_SITE)
category_page = site.pages[CATEGORY_TITLE]
titles = titles_from_category(category_page, max_depth=1)

# Select 20% of the articles for processing you can modify this according to your use case.
sample_size = int(0.2 * len(titles))
sampled_titles = random.sample(list(titles), sample_size)
print(f"Selected {len(sampled_titles)} article titles for processing.")

Selected 34 article titles for processing.


**How It Works**

1. **`titles_from_category` function**: Recursively traverses a Wikipedia category and collects article titles from the category and its subcategories, up to a configurable depth.
2. **`max_depth` parameter**: Limits recursion depth to prevent excessive API calls and infinite category loops.

### **Chunk Documents**
Now that we have our reference documents, we need to prepare them for search.

For this specific example on Wikipedia articles, we'll:

* Discard less relevant-looking sections like External Links and Footnotes
* Clean up the text by removing reference tags (e.g., ), whitespace, and super short sections
* Split each article into sections
* Prepend titles and subtitles to each section's text to preserve semantic context for embedding models
* If a section is long (say, > 1,600 tokens), we'll recursively split it into smaller sections, trying to split along     semantic boundaries like paragraphs

In [6]:
SECTIONS_TO_IGNORE = [
    "See also",
    "References",
    "External links",
    "Further reading",
    "Footnotes",
    "Bibliography",
    "Sources",
    "Citations",
    "Literature",
    "Footnotes",
    "Notes and references",
    "Photo gallery",
    "Works cited",
    "Photos",
    "Gallery",
    "Notes",
    "References and sources",
    "References and notes",
]

def all_subsections_from_section(section, parent_titles, sections_to_ignore):
    """Extract subsections from a Wikipedia section."""
    headings = [str(h) for h in section.filter_headings()]
    title = headings[0]
    if title.strip("=" + " ") in sections_to_ignore:
        return []
    titles = parent_titles + [title]
    full_text = str(section)
    section_text = full_text.split(title)[1]
    if len(headings) == 1:
        return [(titles, section_text)]
    else:
        first_subtitle = headings[1]
        section_text = section_text.split(first_subtitle)[0]
        results = [(titles, section_text)]
        for subsection in section.get_sections(levels=[len(titles) + 1]):
            results.extend(all_subsections_from_section(subsection, titles, sections_to_ignore))
        return results

The all_subsections_from_section function is designed to extract subsections from a specific section of a Wikipedia article. This function is used in the context of processing a page’s text, finding headings, and breaking the content down into smaller chunks (subsections). It helps you organize the text under each heading while ignoring certain sections you don't want to include (like references or external links).



In [7]:
def all_subsections_from_title(
    title: str,
    sections_to_ignore: set[str] = SECTIONS_TO_IGNORE,
    site_name: str = WIKI_SITE,
) -> list[tuple[list[str], str]]:
    """From a Wikipedia page title, return a flattened list of all nested subsections.
    Each subsection is a tuple, where:
        - the first element is a list of parent subtitles, starting with the page title
        - the second element is the text of the subsection (but not any children)
    """
    site = mwclient.Site(site_name)
    page = site.pages[title]
    text = page.text()
    parsed_text = mwparserfromhell.parse(text)
    headings = [str(h) for h in parsed_text.filter_headings()]
    if headings:
        summary_text = str(parsed_text).split(headings[0])[0]
    else:
        summary_text = str(parsed_text)
    results = [([title], summary_text)]
    for subsection in parsed_text.get_sections(levels=[2]):
        results.extend(all_subsections_from_section(subsection, [title], sections_to_ignore))
    return results


## Extracting Subsections from a Wikipedia Article

The function takes a Wikipedia page title and returns all the subsections of that page, along with their corresponding parent titles.
It extracts the page's text, identifies headings, and organizes the content into a structured list of tuples.

Each tuple contains:
1. A list of parent titles (starting with the page title)
2. The text of the subsection (excluding any sub-subsections)

### Function Parameters

1. **title: str**  
   The title of the Wikipedia page to extract subsections from  
   *(e.g., `"Python (programming language)"`)*

2. **sections_to_ignore: set[str]** *(default: `SECTIONS_TO_IGNORE`)*  
   A set of section titles that should be excluded from extraction, such as  
   `"References"` or `"External links"`

3. **site_name: str** *(default: `WIKI_SITE`)*  
   The Wikipedia domain to use  
   *(e.g., `"en.wikipedia.org"` for English Wikipedia)*

### Why This Matters

Large Wikipedia articles are not ideal for direct embedding due to their size and mixed topics.
By splitting articles into logically grouped subsections while preserving hierarchical context, this approach:
- Improves embedding quality
- Enhances semantic search relevance
- Reduces noise from non-informative sections

This structured extraction enables efficient document chunking and improves semantic retrieval accuracy in downstream embedding and RAG workflows.

### Clean Up Sections

After extracting the subsections, we clean the text to remove non-content artifacts such as:
- Reference markers (e.g., `[1]`, `[2]`)
- Extra whitespace
- Formatting noise introduced during wiki parsing

This ensures that each text chunk is embedding-ready and does not introduce irrelevant tokens into the vector store.

The function above splits the articles into smaller sections.

In [9]:
wikipedia_sections = []
for title in titles:
    wikipedia_sections.extend(all_subsections_from_title(title))
print(f"Found {len(wikipedia_sections)} sections in {len(titles)} pages.")

ConnectionError: HTTPSConnectionPool(host='en.wikipedia.org', port=443): Max retries exceeded with url: /w/api.php?meta=siteinfo%7Cuserinfo%7Cuserinfo&siprop=general%7Cnamespaces&uiprop=groups%7Crights%7Cblockinfo%7Chasmsg&continue=&action=query&format=json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000018F71DEBA30>: Failed to establish a new connection: [WinError 10053] An established connection was aborted by the software in your host machine'))

In [None]:
# clean text
def clean_section(section: tuple[list[str], str]) -> tuple[list[str], str]:
    """
    Return a cleaned up section with:
        - <ref>xyz</ref> patterns removed
        - leading/trailing whitespace removed
    """
    titles, text = section
    text = re.sub(r"<ref.*?</ref>", "", text)
    text = text.strip()
    return (titles, text)


wikipedia_sections = [clean_section(ws) for ws in wikipedia_sections]

# Filter out short/blank sections
def keep_section(section: tuple[list[str], str]) -> bool:
    _, text = section
    return len(text) >= 16


original_num_sections = len(wikipedia_sections)
wikipedia_sections = [ws for ws in wikipedia_sections if keep_section(ws)]
print(f"Filtered out {original_num_sections-len(wikipedia_sections)} sections, leaving {len(wikipedia_sections)} sections.")

# Display example data
for ws in wikipedia_sections[:5]:
    print(ws[0])
    print(ws[1][:77] + "...")
    print()

Filtered out 0 sections, leaving 1706 sections.
['Almaty bid for the 2022 Winter Olympics']
{{Short description|Unsuccessful bid by Almaty, Kazakhstan}}
{{Olympic bid|20...

['Almaty bid for the 2022 Winter Olympics', '==History==']
[[File:Medeobanen1.jpg|thumb|230px|right|[[Medeo]], a probable location near ...

['Almaty bid for the 2022 Winter Olympics', '==History==', '===Previous bids===']
{{main|Almaty bid for the 2014 Winter Olympics}}
Almaty bid to host the [[201...

['Almaty bid for the 2022 Winter Olympics', '==Venues==']
The venue plan comprises:...

['Almaty bid for the 2022 Winter Olympics', '==Venues==', '=== City venues ===', '==== Olympic City ====']
*Almaty Olympic Ice Arena - Figure skating, short track
*Non-competitive venu...



### Handle Text Length (Tokens)

Embedding models perform best when input text stays within reasonable token limits.
To achieve this, we measure token counts using a model-compatible tokenizer and
split long sections into balanced chunks while preserving natural text boundaries.

In [19]:
GPT_MODEL = "gpt-4o-mini"
# Used only to select a compatible tokenizer via tiktoken.
# No OpenAI model inference is performed.

def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """
    Count the number of tokens in a text string using a model-specific tokenizer.

    This function is used to ensure text chunks remain within reasonable
    token limits for embedding generation.

    Args:
        text: Input text string.
        model: Model name used only to select the tokenizer.

    Returns:
        Number of tokens in the input text.
    """
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


This function counts the number of tokens (units of text) in a given string based on a specific tokenizer model (in this case, gpt-4o-mini). 

Here's what each part does:

1. Input: A text string.
2. Output: The number of tokens in the string.
How: It uses a tokenizer specific to the given model (like how many words or chunks the model recognizes) to "encode" the text and count its tokens.

In [20]:
def halved_by_delimiter(string: str, delimiter: str = "\n") -> list[str, str]:
    chunks = string.split(delimiter)
    if len(chunks) == 1:
        return [string, ""]
    total_tokens = num_tokens(string)
    halfway = total_tokens // 2
    best_diff = halfway
    for i, _ in enumerate(chunks):
        left = delimiter.join(chunks[: i + 1])
        left_tokens = num_tokens(left)
        diff = abs(halfway - left_tokens)
        if diff >= best_diff:
            break
        best_diff = diff
    left = delimiter.join(chunks[:i])
    right = delimiter.join(chunks[i:])
    return [left, right]

This function splits a large string into two parts at a logical breakpoint (like a sentence or paragraph).

Input:
1. string: A large text string.
2. delimiter: The point where we want to split the string (default is a new line \n).
3. Output: Two parts of the string, left and right.

How: It tries to find the point where the string should be split into two halves based on token count. It looks for the closest match to half the total number of tokens, and then splits the string into two logical parts.

In [21]:
def truncated_string(
    string: str,
    model: str,
    max_tokens: int,
    print_warning: bool = True
) -> str:
    """
    Truncate a text string to a maximum number of tokens using a tokenizer.

    The function encodes the input text into tokens, trims it to the specified
    token limit if necessary, and decodes it back into a string. This ensures
    compatibility with embedding and generation model token constraints.

    Args:
        string: Input text to truncate.
        model: Model name used to select the tokenizer.
        max_tokens: Maximum number of tokens allowed.
        print_warning: Whether to print a warning if truncation occurs.

    Returns:
        The truncated text string.
    """
    encoding = tiktoken.encoding_for_model(model)
    encoded_string = encoding.encode(string)
    truncated = encoding.decode(encoded_string[:max_tokens])

    if print_warning and len(encoded_string) > max_tokens:
        print(
            f"Warning: Truncated string from "
            f"{len(encoded_string)} tokens to {max_tokens} tokens."
        )

    return truncated

This function ensures that text does not exceed a specified token limit by safely truncating it using a tokenizer-compatible encoding and decoding process.

Input:

1. string: The text to be truncated.
2. model: The tokenizer model to use.
3. max_tokens: The maximum number of tokens allowed.
4. print_warning: If the string is shortened, a warning will be printed.

Output: The truncated string (cut down to the allowed number of tokens).
How: It encodes the string into tokens, and if the number of tokens exceeds the limit, it trims the string and prints a warning.

In [22]:
def split_strings_from_subsection(subsection: tuple[list[str], str], max_tokens: int = 1000, model: str = GPT_MODEL, max_recursion: int = 5) -> list[str]:
    titles, text = subsection
    string = "\n\n".join(titles + [text])
    if num_tokens(string) <= max_tokens:
        return [string]
    elif max_recursion == 0:
        return [truncated_string(string, model=model, max_tokens=max_tokens)]
    for delimiter in ["\n\n", "\n", ". "]:
        left, right = halved_by_delimiter(text, delimiter=delimiter)
        if left == "" or right == "":
            continue
        results = []
        for half in [left, right]:
            half_subsection = (titles, half)
            half_strings = split_strings_from_subsection(
                half_subsection,
                max_tokens=max_tokens,
                model=model,
                max_recursion=max_recursion - 1,
            )
            results.extend(half_strings)
        return results
    return [truncated_string(string, model=model, max_tokens=max_tokens)]

# Split sections into chunks
MAX_TOKENS = 1600
wikipedia_strings = []
for section in wikipedia_sections:
    wikipedia_strings.extend(split_strings_from_subsection(section, max_tokens=MAX_TOKENS))

print(f"{len(wikipedia_sections)} Wikipedia sections split into {len(wikipedia_strings)} strings.")

1706 Wikipedia sections split into 2006 strings.


This function breaks down a large section of text into smaller pieces that are below a token limit.

Input:
1. subsection: A tuple containing a list of titles (headers) and the main text.
2. max_tokens: The maximum number of tokens each part can have.
3. model: The tokenizer model.
4. max_recursion: How many times the function can call itself to keep splitting the text if it's too big.

Output: A list of smaller text strings, each below the token limit.

How: It tries to break the text into smaller parts by splitting it at logical places (like paragraph breaks or sentence breaks) and keeps doing so recursively until the chunks are small enough

### **Generate Embeddings**
**Prepare Text for Embedding**

Once the Wikipedia sections have been cleaned and split into token-safe chunks, we convert each chunk into a numerical embedding. These embeddings enable semantic similarity search and retrieval in downstream RAG workflows.

To improve efficiency and reliability, embeddings are generated in batches.


In [None]:
# Split sections into chunks
MAX_TOKENS = 1600
wikipedia_strings = []
for section in wikipedia_sections:
    wikipedia_strings.extend(split_strings_from_subsection(section, max_tokens=MAX_TOKENS))

print(f"{len(wikipedia_sections)} Wikipedia sections split into {len(wikipedia_strings)} strings.")

# Fetch embeddings from Nugen API
BATCH_SIZE = 100
EMBEDDING_MODEL = "nugen-flash-embed"
embeddings = []

for batch_start in range(0, len(wikipedia_strings), BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    batch = wikipedia_strings[batch_start:batch_end]
    print(f"Processing batch {batch_start} to {batch_end-1}")

    payload = {
        "input": batch,
        "model": EMBEDDING_MODEL
    }

    try:
        response = requests.post(url_api_server, json=payload, headers=headers)
        response.raise_for_status()
        data = response.json()
        batch_embeddings = [e['embedding'] for e in data['data']]
        embeddings.extend(batch_embeddings)
    except requests.exceptions.HTTPError as e:
        print(f"HTTP error occurred: {e}")
        print(f"Response content: {response.content}")
    except Exception as e:
        print(f"An error occurred: {e}")

1706 Wikipedia sections split into 2006 strings.
Processing batch 0 to 99
HTTP error occurred: 403 Client Error: Forbidden for url: https://api.nugen.in/inference/embeddings
Response content: b'{"detail":"Could not validate credentials. Reason: API key not found"}'
Processing batch 100 to 199
HTTP error occurred: 403 Client Error: Forbidden for url: https://api.nugen.in/inference/embeddings
Response content: b'{"detail":"Could not validate credentials. Reason: API key not found"}'
Processing batch 200 to 299
HTTP error occurred: 403 Client Error: Forbidden for url: https://api.nugen.in/inference/embeddings
Response content: b'{"detail":"Could not validate credentials. Reason: API key not found"}'
Processing batch 300 to 399
HTTP error occurred: 403 Client Error: Forbidden for url: https://api.nugen.in/inference/embeddings
Response content: b'{"detail":"Could not validate credentials. Reason: API key not found"}'
Processing batch 400 to 499
HTTP error occurred: 403 Client Error: Forbidd

### **Save the Results**

In [26]:
# Save the embeddings
df = pd.DataFrame({"text": wikipedia_strings, "embedding": embeddings})
SAVE_PATH = "winter_olympics_2022.csv"
df.to_csv(SAVE_PATH, index=False)
print(f"Embeddings saved to {SAVE_PATH}")

ValueError: All arrays must be of the same length