## Install dependencies (optional) üîß

This cell lists optional pip install commands to install required packages for this notebook. Uncomment and run if you need to install `vllm`, `openai`, or `tqdm`.


In [1]:
# !pip install vllm openai tqdm

## Start vLLM server (overview) üöÄ

This section contains code to start a vLLM OpenAI-compatible server locally. The server is launched with model-specific arguments and the script waits until the `/v1/models` endpoint responds successfully. Adjust GPU/memory flags as needed for your environment.


### Start vLLM server (start & wait) ‚öôÔ∏è

Starts the server in the background using `subprocess.Popen` and polls the model endpoint until it returns HTTP 200. The code also contains a commented-out option for alternate start parameters.


In [2]:
import subprocess
import time
import requests

# Kill any existing vLLM process
!pkill -f vllm.entrypoints.openai.api_server

# Start vLLM server in background
# vllm_process = subprocess.Popen([
#     "python", "-m", "vllm.entrypoints.openai.api_server",
#     "--model", "Qwen/Qwen2.5-7B-Instruct",
#     "--host", "0.0.0.0",
#     "--port", "8000",
#     "--gpu-memory-utilization", "0.9",
#     "--max-model-len", "4096"
# ], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

vllm_process = subprocess.Popen([
    "python", "-m", "vllm.entrypoints.openai.api_server",
    "--model", "Qwen/Qwen2.5-7B-Instruct",
    "--host", "0.0.0.0",
    "--port", "8000",
    "--gpu-memory-utilization", "0.85",  # Lower from 0.9
    "--max-model-len", "2048",  # Lower from 4096
    "--disable-log-requests",  # Reduce overhead
    "--max-num-seqs", "8",  # Handle more concurrent requests
    "--swap-space", "4"  # Add swap space
], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# Wait for server to be ready
print("Starting vLLM server...")
for i in range(60):
    try:
        response = requests.get("http://localhost:8000/v1/models")
        if response.status_code == 200:
            print("‚úÖ vLLM server is ready!")
            break
    except:
        pass
    time.sleep(2)
    if i % 5 == 0:
        print(f"Waiting... ({i*2}s)")

Starting vLLM server...
Waiting... (0s)
Waiting... (10s)
Waiting... (20s)
Waiting... (30s)
‚úÖ vLLM server is ready!


### Check vLLM server health ‚úÖ

A quick health check that queries `http://localhost:8000/v1/models` to verify the server is responsive. Useful to run after starting or restarting the server.


In [3]:
import subprocess
import time
import requests

response = requests.get("http://localhost:8000/v1/models")

## Optional: Install pandas üßæ

Commented pip line to install `pandas` if the environment does not already have it. Uncomment to run.


In [4]:
# !pip install pandas

## Load dataset CSV into pandas üì•

Loads `studio_results_20260104_1052.csv` into a DataFrame for subsequent processing. Inspect the head to confirm successful load.


In [5]:
import pandas as pd

df = pd.read_csv("studio_results_20260104_1052.csv")
df.head()

Unnamed: 0,title,skills
0,Growth Analyst,"Statistical analysis, SQL, Scripting (Ruby, Py..."
1,Senior Brand Designer (Contract),"Graphic Design, digital design, print design, ..."
2,Accounts Support Specialist,"problem solving, customer support, writing, gr..."
3,Site Reliability Engineer,"Windows Server, Microsoft Azure, PowerShell, S..."
4,Site Reliability Engineer,"Windows Server, Microsoft Azure, PowerShell, S..."


### Create list of anchor titles üìù

Extract the `title` column as a Python list (`skills_list`) which will be used as input anchors for triplet generation.


In [6]:
skills_list = df["title"].tolist()
len(skills_list)

5000

## Setup vLLM-compatible OpenAI client üîå

Import required modules and create an `OpenAI` client pointing to the local vLLM server. `api_key` is set to a dummy value because vLLM is OpenAI-compatible (no real key needed for local server).


In [7]:
from openai import OpenAI
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm.notebook import tqdm
from typing import Optional, Dict, List
import time


# vLLM OpenAI-compatible client
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

### Prompt template for triplet generation ‚úçÔ∏è

Defines `build_prompt(anchor)` which instructs the model to output a JSON object with `positive` and `negative` fields given an anchor sentence. The prompt enforces strict JSON output (no explanations).


In [8]:
def build_prompt(anchor: str) -> str:
    return f"""
You are a dataset generator for semantic similarity training.

Given an ANCHOR sentence, generate:
1. POSITIVE: A sentence with the SAME meaning as the anchor.
2. NEGATIVE: A sentence from the SAME DOMAIN but DIFFERENT meaning.

Rules:
- Do NOT copy anchor text exactly
- Keep language and tone consistent
- Do NOT explain anything
- Output STRICT JSON only

JSON format:
{{
  "positive": "...",
  "negative": "..."
}}

ANCHOR:
{anchor}
"""

### Generate a single triplet with retries üîÅ

`generate_triplet(anchor)` calls the model, strips code fences, parses JSON, and retries on errors. Returns a dict with `anchor`, `positive`, and `negative` or `None` on failure.


In [9]:
def generate_triplet(anchor: str, max_retries: int = 2) -> Optional[Dict]:
    """Generate a single triplet with retry logic."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="Qwen/Qwen2.5-7B-Instruct",
                messages=[
                    {"role": "system", "content": "You generate high-quality contrastive training data in JSON format."},
                    {"role": "user", "content": build_prompt(anchor)}
                ],
                temperature=0.7,
                max_tokens=256,
                timeout=30
            )

            content = response.choices[0].message.content.strip()
            content = content.replace("```json", "").replace("```", "").strip()
            parsed = json.loads(content)
            
            if "positive" not in parsed or "negative" not in parsed:
                continue
                
            return {
                "anchor": anchor,
                "positive": parsed["positive"],
                "negative": parsed["negative"]
            }
            
        except json.JSONDecodeError:
            if attempt == max_retries - 1:
                print(f"‚ùå JSON parse failed: {anchor[:50]}...")
            continue
        except Exception as e:
            if attempt == max_retries - 1:
                print(f"‚ùå Error: {e}")
            time.sleep(0.1)
            continue
    
    return None

print("‚úÖ Functions loaded!")

‚úÖ Functions loaded!


## Test single generation ‚úÖ

Quick test that runs `generate_triplet` on a sample anchor and prints the result. Use this to validate the vLLM server and parsing logic before generating the full dataset.


### Run a single test and inspect output üîç

Execute a single example and print the returned JSON to ensure `positive` and `negative` fields are present and well-formed.


In [10]:
# Test with one example
test_anchor = "Python programming for data analysis"
test_result = generate_triplet(test_anchor)

if test_result:
    print("‚úÖ Test successful!")
    print(json.dumps(test_result, indent=2))
else:
    print("‚ùå Test failed - check vLLM server")

‚úÖ Test successful!
{
  "anchor": "Python programming for data analysis",
  "positive": "Learning Python to manipulate datasets",
  "negative": "Building a website using Python frameworks"
}


## Notebook UI dependencies (optional) üß©

Commented instruction to upgrade `notebook` and `ipywidgets`. Useful when using widgets or interactive progress bars in some environments.


In [11]:
# pip install --upgrade notebook ipywidgets

### Progress bar example (commented) ‚è≥

A minimal example showing how to use `tqdm.notebook.tqdm` for progress feedback when generating triplets. Kept commented for reference.


In [12]:
# from tqdm.notebook import tqdm

# for i in tqdm(range(10),  desc="Generating Triplets"):
#     print(i)

## Initialize dataset containers üìö

Create `dataset` for successful triplets and `failed_anchors` to record anchors that could not be generated.


In [13]:
dataset = []
failed_anchors = []

### Dataset initialization and bookkeeping üßæ

`dataset` will hold valid triplets and `failed_anchors` will collect any anchors that fail generation after retries. These lists are used during long-running generation loops and checkpointing.


In [14]:
# def generate_dataset_parallel(
#     anchors: List[str], 
#     max_workers: int = 8,  # Lower default
#     timeout: int = 60  # Per request timeout
# ) -> List[Dict]:
#     """Generate dataset with parallel processing."""
#     dataset = []
#     failed_anchors = []
    
#     with ThreadPoolExecutor(max_workers=max_workers) as executor:
#         future_to_anchor = {
#             executor.submit(generate_triplet, anchor): anchor 
#             for anchor in anchors
#         }
        
#         for future in tqdm(
#             as_completed(future_to_anchor, timeout=timeout), 
#             total=len(anchors),
#             desc="Generating Triplets"
#         ):
#             try:
#                 anchor = future_to_anchor[future]
#                 result = future.result(timeout=timeout)
                
#                 if result:
#                     dataset.append(result)
#                 else:
#                     failed_anchors.append(anchor)
#             except Exception as e:
#                 print(f"‚ö†Ô∏è Timeout/Error: {str(e)[:50]}")
#                 failed_anchors.append(future_to_anchor[future])
    
#     print(f"\n‚úÖ Success: {len(dataset)}/{len(anchors)} ({len(dataset)/len(anchors)*100:.1f}%)")
#     print(f"‚ùå Failed: {len(failed_anchors)}")
    
#     return dataset

## Parallel generation helper (commented) ‚ö°

A robust parallel generation implementation using `ThreadPoolExecutor` is included as a commented reference. It includes timeout handling and progress reporting. Enable and adjust `max_workers` to suit your hardware.


In [15]:
def restart_vllm_server():
    """Restart vLLM server."""
    print("üîÑ Restarting server...")
    !pkill -f vllm.entrypoints.openai.api_server
    time.sleep(5)
    
    subprocess.Popen([
        "python", "-m", "vllm.entrypoints.openai.api_server",
        "--model", "Qwen/Qwen2.5-7B-Instruct",
        "--host", "0.0.0.0",
        "--port", "8000",
        "--gpu-memory-utilization", "0.8",
        "--max-model-len", "1024",
        "--disable-log-requests",
        "--enforce-eager"  # Disable CUDA graph to prevent memory issues
    ], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    
    for i in range(40):
        try:
            response = requests.get("http://localhost:8000/v1/models", timeout=3)
            if response.status_code == 200:
                print("‚úÖ Ready!")
                return
        except:
            pass
        time.sleep(2)

# Start initial server
# restart_vllm_server()

### Server restart helper üîÅ

`restart_vllm_server()` attempts to gracefully kill and restart the vLLM server and waits until the server is healthy. Useful for long runs where memory leaks or failures may require a restart.


In [16]:
from tqdm.notebook import tqdm
import json

dataset = []
restart_every = 400

for idx, skill in enumerate(tqdm(skills_list, desc="Generating Triplets")):
    # Restart server every 400 requests
    if idx > 0 and idx % restart_every == 0:
        with open('checkpoint_title.json', 'w') as f:
            json.dump({'dataset': dataset, 'idx': idx}, f)
        restart_vllm_server()
    
    ans = generate_triplet(skill)
      
    if ans:
        dataset.append(ans)
    
    # Save checkpoint every 50
    if (idx + 1) % 50 == 0:
        with open('checkpoint_title.json', 'w') as f:
            json.dump({'dataset': dataset, 'idx': idx + 1}, f)

print(f"‚úÖ Done: {len(dataset)}/{len(skills_list)}")

Generating Triplets:   0%|          | 0/5000 [00:00<?, ?it/s]

üîÑ Restarting server...
‚úÖ Ready!
üîÑ Restarting server...
‚úÖ Ready!
üîÑ Restarting server...
‚úÖ Ready!
üîÑ Restarting server...
‚úÖ Ready!
üîÑ Restarting server...
‚úÖ Ready!
üîÑ Restarting server...
‚úÖ Ready!
üîÑ Restarting server...
‚úÖ Ready!
üîÑ Restarting server...
‚úÖ Ready!
üîÑ Restarting server...
‚úÖ Ready!
üîÑ Restarting server...
‚úÖ Ready!
üîÑ Restarting server...
‚úÖ Ready!
üîÑ Restarting server...
‚úÖ Ready!
‚úÖ Done: 5000/5000


## Main generation loop with checkpointing üíæ

Iterates over `skills_list`, generates triplets, and periodically saves to `checkpoint_title.json`. Restarts the server every `restart_every` requests to mitigate memory issues. Adjust `restart_every` and checkpoint frequency as needed.


In [17]:
# from tqdm.notebook import tqdm

# dataset = []

# for i in tqdm(skills_list,  desc="Generating Triplets"):
#     ans = generate_triplet(i)
    
#     if not ans:
#         continue
    
#     dataset.append(ans)

### Sequential generation (simple) üß≠

A simplified, sequential loop that iterates `skills_list` and appends valid triplets to `dataset`. This is simple and robust but slower than parallel approaches.


In [18]:
# # Run generation
# # Run with lower workers
# generate_dataset_parallel(
#     anchors=skills_list,
#     max_workers=4,  # Try 8, then increase to 16 if stable
#     timeout=60
# )

### Run parallel generation (commented) üõ†Ô∏è

Example usage of the `generate_dataset_parallel` helper (commented out). Adjust `max_workers` and `timeout` before enabling for your system.


## Save results to CSV üì§

After generation, convert `dataset` to a `pandas.DataFrame`, inspect, and save it to `title.csv`.


In [19]:
len(dataset) , len(failed_anchors)

(5000, 0)

### Inspect dataset sizes and failures üìà

Check the lengths of `dataset` and `failed_anchors` before converting to a DataFrame.


In [20]:
skills_df = pd.DataFrame(dataset)

skills_df.head()

skills_df.to_csv("title.csv")

### Convert to DataFrame and save to CSV üíæ

Create a `pandas.DataFrame` named `skills_df` from `dataset`, preview it, and save to `title.csv` for downstream use.
