This is a project on domain name suggestion.

Proposing a suitable Domain Name is a tricky assignment for entrepreneurs. Clarity, Pronunciation, Popular Reception, Cultural Implications, trademark laws and regulations shall be taken into account.  

Targets include:
1. Reproducible Performance with Model Version Tracking
2. Runnable evaluation framework that works across all model iterations
3. **Optional**: Deploy selected model as API endpoint


# First Step: Create a small, diverse synthetic dataset for domain-name suggestion tasks

It includes:
1) Briefs (JSONL)

2) Candidates with labels (JSONL)

3) Pairwise preference judgments (JSONL)

4) A README-style methodology (Markdown)


# Synthetic Dataset for Domain Name Suggestions

**Generated:** {datetime.utcnow().isoformat()}Z

## Files
- `domain_briefs.jsonl` — Diverse briefs (industry, tone, keywords, constraints, complexity)
- `domain_candidates.jsonl` — Candidate suggestions with scores, pass/fail, safety flags
- `domain_pairwise.jsonl` — Synthetic pairwise preferences for DPO/IPO

## Diversity Coverage
- **Business types**: fintech, eco cosmetics, B2B AI, coffee roaster, tutoring (FR), dev tools (DE), travel (ES), wellness, home IoT, climate nonprofit, JP stationery (translit), AR food delivery (translit), pet supplements, outdoor rentals, kids coding.
- **Languages/scripts**: EN, FR, DE, ES (Latin). JP/AR represented via **Latin transliteration** to avoid IDN in this first version.
- **Complexity levels**: basic, moderate, advanced (randomly assigned) indicating constraints richness and prompt realism.

## Methodology
1. **Brief Construction**: For each business type, we define language, tone, keywords, and constraints:
   - `max_len` (10/12/14), `allowed_tlds` (domain-appropriate),
   - forbid digits/hyphens, ASCII-only for v1 (IDN can be added later).
2. **Candidate Generation**: Nonce-word generator from a curated syllable bank creates pronounceable, brandable strings. We avoid real brands or adult/illegal terms.
3. **Safety & Constraints**: We inject a small fraction of *intentionally flawed* candidates (digits, hyphens, or trademark-like typosquats such as `go0gle-...`) to train and evaluate filters. No explicit harmful content is included.
4. **Weak Labels**: Each candidate has pseudo-scores (`brandability`, `brevity`, `keyword_fit`) for reranking studies. Replace with human ratings over time.
5. **Pairwise Preferences**: Winners are sampled from higher-scored, constraint-passing candidates; losers from lower-scored/flagged ones. This supports preference optimization (DPO/IPO/KTO).
6. **Intended Use**: Bootstrapping a generation→filter→rerank pipeline and automated tests. For production, add multilingual scripts, IDN/homograph checks, and human review.
7. **Ethics & Safety**: The dataset purposefully avoids generating or normalizing harmful categories (hate, sexual content, illegal goods/services, self-harm, extremist content). It includes negative examples only in the form of benign constraint violations and obvious trademark-like typosquats to test refusal/filters.

## Schemas
### Brief
```json
{{"brief_id":"uuid","title":"string","language":"en","script":"Latin","tone":"string","keywords":["k1","k2"],"constraints":{{"max_len":12,"allowed_tlds":[".com",".io"],"forbid_digits":true,"forbid_hyphens":true,"ascii_only":true}},"complexity":"basic|moderate|advanced","notes":"string"}}
```


### Candidate
```json
{{"candidate_id":"uuid","brief_id":"uuid","domain":"brevexa.com","rationale":"string","scores":{{"brandability":0.85,"brevity":0.9,"keyword_fit":0.7}},"passes_constraints":true,"safety":{{"flagged":false,"reasons":[]}}}}
```



In [1]:
import json, random, uuid, textwrap, os
from datetime import datetime
import pandas as pd


In [3]:

import json, random, uuid, os, re
from datetime import datetime
import pandas as pd


random.seed(42)

# -------------------------
# Utilities
# -------------------------
def uid():
    return str(uuid.uuid4())

def ensure_dir(path):
    os.makedirs(path, exist_ok=True)

OUT_DIR = "."
ensure_dir(OUT_DIR)

# -------------------------
# 1) Briefs
# -------------------------
business_catalog = [
    # (title, keywords, tone, tlds, language, script, notes)
    ("Fintech payments wallet", ["pay", "wallet", "secure"], "premium, trustworthy", [".com",".io",".pay"], "en", "Latin", ""),
    ("Eco-friendly cosmetics", ["vegan", "plant", "glow"], "gentle, natural", [".com",".beauty",".shop"], "en", "Latin", ""),
    ("B2B AI analytics", ["insight", "metrics", "predict"], "modern, technical", [".ai",".io",".com"], "en", "Latin", ""),
    ("Artisanal coffee roaster", ["beans","roast","origin"], "craft, warm", [".com",".coffee",".shop"], "en", "Latin", ""),
    ("Online French tutoring", ["cours","langue","coach"], "convivial, sérieux", [".fr",".com"], "fr", "Latin", ""),
    ("SaaS developer tools (DE)", ["code","build","deploy"], "prägnant, professionell", [".de",".dev",".io"], "de", "Latin", ""),
    ("Travel planning app (ES)", ["viaje","ruta","plan"], "amable, inspirador", [".es",".app",".com"], "es", "Latin", ""),
    ("Wellness & yoga studio", ["flow","breathe","calm"], "soothing, minimalist", [".com",".studio",".fit"], "en", "Latin", ""),
    ("Home automation IoT", ["smart","home","mesh"], "sleek, futuristic", [".com",".tech",".io"], "en", "Latin", ""),
    ("Nonprofit climate org", ["climate","earth","action"], "serious, hopeful", [".org",".earth",".com"], "en", "Latin", ""),
    ("Japanese stationery (JP translit)", ["pen","paper","kawaii"], "cute, refined", [".jp",".shop",".com"], "ja", "Latin", "Transliterated keywords only"),
    ("Arabic food delivery (translit)", ["souk","fresh","sah"], "friendly, reliable", [".com",".me",".app"], "ar", "Latin", "Transliterated keywords only"),
    ("Pet supplements DTC", ["pet","chew","boost"], "friendly, credible", [".com",".pet",".shop"], "en", "Latin", ""),
    ("Outdoor gear rental", ["camp","hike","rent"], "adventurous, practical", [".com",".outdoors",".rentals"], "en", "Latin", ""),
    ("Kids coding classes", ["code","kids","learn"], "playful, educational", [".com",".school",".academy"], "en", "Latin", ""),
]

complexity_levels = ["basic","moderate","advanced"]

def make_briefs(catalog):
    briefs = []
    for (title, keywords, tone, tlds, lang, script, notes) in catalog:
        briefs.append({
            "brief_id": uid(),
            "title": title,
            "language": lang,
            "script": script,
            "tone": tone,
            "keywords": keywords,
            "constraints": {
                "max_len": random.choice([10,12,14]),
                "allowed_tlds": tlds,
                "forbid_digits": True,
                "forbid_hyphens": True,
                "ascii_only": True
            },
            "complexity": random.choice(complexity_levels),
            "notes": f"Synthetic brief; availability not verified. {notes}".strip()
        })
    return briefs

briefs = make_briefs(business_catalog)

# -------------------------
# 2) Candidate generation
# -------------------------

# A small syllable bank to form pronounceable nonce words (harmless content only)
syllables = [
    "bre","ve","xa","no","va","ly","zo","ri","ta","lo","fi","ki","ra","ne","mi","do","tu","su","pla","tri","quo","zen","lum","sio",
    "meta","nex","ora","kiri","terra","flux","vanta","pleni","astra","omni","veri","cora","mira","luma","axi","primo","alto","vivo",
    "nori","lumi","kora","vexa","tava","moro","lino","nexa","pivo","dela","soma","trio"
]

def gen_nonce(max_len):
    # build pronounceable-ish string; reserve 3 chars for ".tld"
    for _ in range(80):
        parts = random.sample(syllables, k=random.choice([2,3]))
        name = "".join(parts).lower()
        name = re.sub(r'(.)\1{2,}', r'\1\1', name)  # compress 3+ repeats
        if len(name) <= max_len and name.isascii() and name.isalpha():
            return name
        # fallback: trim
        if len(name) > max_len:
            name = name[:max_len]
            if name.isalpha():
                return name
    return "novexa"

def make_candidates_for_brief(b, k=12):
    cands = []
    max_len = b["constraints"]["max_len"]
    allowed_tlds = b["constraints"]["allowed_tlds"]

    # functions that create intentionally *bad* examples (for training filters)
    def inj_digit(base): return base.replace("o","0") + random.choice(allowed_tlds)
    def inj_hyphen(base): return base + "-pro" + random.choice(allowed_tlds)
    def inj_trademark_like(base): return "go0gle-" + base + random.choice(allowed_tlds)

    bad_funcs = [inj_digit, inj_hyphen, inj_trademark_like]

    for i in range(k):
        base = gen_nonce(max_len)
        tld = random.choice(allowed_tlds)
        domain = f"{base}{tld}"
        rationale = f"Short coined word aligned to keywords ({', '.join(b['keywords'])}) and tone '{b['tone']}'."
        safety = {"flagged": False, "reasons": []}
        passes = True

        # inject one flawed sample per 6
        if (i+1) % 6 == 0:
            dom = random.choice(bad_funcs)(base)
            domain = dom

        # checks
        name_part = domain.split(".")[0]
        if len(name_part) > max_len:
            passes = False
        if any(ch.isdigit() for ch in domain):
            passes = False; safety["flagged"]=True; safety["reasons"].append("contains_digit")
        if "-" in domain:
            passes = False; safety["flagged"]=True; safety["reasons"].append("contains_hyphen")
        if "go0gle" in domain:
            passes = False; safety["flagged"]=True; safety["reasons"].append("trademark_like")

        scores = {
            "brandability": round(random.uniform(0.6, 0.95),2),
            "brevity": round(max(0.3, 1 - len(name_part)/max(6, max_len)),2),
            "keyword_fit": round(random.uniform(0.55, 0.9),2)
        }
        cands.append({
            "candidate_id": uid(),
            "brief_id": b["brief_id"],
            "domain": domain,
            "rationale": rationale,
            "scores": scores,
            "passes_constraints": passes,
            "safety": safety
        })
    return cands

candidates = []
for b in briefs:
    candidates.extend(make_candidates_for_brief(b, k=12))

# -------------------------
# 3) Pairwise preferences (synthetic DPO/IPO data)
# -------------------------
pairwise = []
for b in briefs:
    bcands = [c for c in candidates if c["brief_id"] == b["brief_id"]]
    ranked = sorted(bcands, key=lambda x: (x["passes_constraints"], x["scores"]["brandability"]), reverse=True)
    tops = ranked[:4]
    bots = ranked[-4:]
    for a in tops:
        for d in bots:
            pairwise.append({
                "pair_id": uid(),
                "brief_id": b["brief_id"],
                "winner_candidate_id": a["candidate_id"],
                "loser_candidate_id": d["candidate_id"],
                "reason_codes": ["brandability","constraint_pass","safety_margin"]
            })

# -------------------------
# 4) Write files
# -------------------------
paths = {
    "briefs": os.path.join(OUT_DIR, "domain_briefs.jsonl"),
    "candidates": os.path.join(OUT_DIR, "domain_candidates.jsonl"),
    "pairwise": os.path.join(OUT_DIR, "domain_pairwise.jsonl"),
    "readme": os.path.join(OUT_DIR, "README_methodology.md"),
    "script": os.path.join(OUT_DIR, "generate_synthetic_dataset.py"),
}

with open(paths["briefs"], "w", encoding="utf-8") as f:
    for b in briefs:
        f.write(json.dumps(b, ensure_ascii=False)+"\n")

with open(paths["candidates"], "w", encoding="utf-8") as f:
    for c in candidates:
        f.write(json.dumps(c, ensure_ascii=False)+"\n")

with open(paths["pairwise"], "w", encoding="utf-8") as f:
    for p in pairwise:
        f.write(json.dumps(p, ensure_ascii=False)+"\n")



## Model Development & Iteration
• Baseline Model: Fine-tune initial open-source LLM. You can use common recipes for that.

• Improved Model(s): Address discovered issues through, i.e.:

o Dataset augmentation

o Different fine-tuning approaches (LoRA, full fine-tuning, etc.)

o Hyperparameter optimization

• Save and version all model checkpoints

In [4]:
!pip -q install "transformers>=4.43" "datasets>=2.19" "accelerate>=0.33" "peft>=0.12" "bitsandbytes>=0.43" "trl>=0.9" sentencepiece evaluate huggingface_hub


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.9/511.9 kB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m80.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m84.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m55.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from huggingface_hub import login
login()  # paste my HF token
