# **Domain Generator — Business Dataset Creation** 

> To keep the notebook from becoming too long, most of the code for training, evaluation, and other tasks was moved into modules, which can be found in the project’s directories and are referenced in the imports.

## **Domain Dataset Generation (Groq API)**

> To create the dataset, I used an Grop API that is free for now. I chose the LLaMA 3 70B model.

In [1]:
#from create_datasets.domain_dataset import DomainDataset

#generator = DomainDataset(from_scratch=False)
#generator.generate(n=5)

> I used a prompt to generate fictionals companies along with a fictional business description. For each company description, the model provided five domain names.
> Two points are important here. First, due to Grop API limitations, I cannot generate all the data I need in a single call, or even in just a few calls. I have to make multiple API requests to reach a sufficient dataset size, so I worked in iterations. In this case, I stopped at iteration 42. Each time new data is added, the dataset version number automatically increases.
> Second, since I make multiple calls, some companies appear multiple times — and indeed, some do — but with slightly different descriptions and different domain names. Later in this document, I will explain how I handled this to avoid introducing bias into the training process.

## **Load Data**

In [15]:
from train.data_processing import load_json_dataset

model_path = "model-based"
data_path = "data/attempt_0/domain_dataset_v42.json"

data = load_json_dataset(data_path)

In [18]:
data[:3]

[{'business_description': 'EcoCycle is a startup that specializes in recycling and upcycling of electronic waste. Their goal is to reduce the massive amounts of e-waste that end up in landfills every year and promote sustainable living. They partner with companies and organizations to collect and process electronic waste, converting it into raw materials that can be used to manufacture new products.',
  'domain': ['ecocycle.co',
   'recycleit.today',
   'greenbytes.net',
   'techrecycle.io']},
 {'business_description': 'FitFusion is a fitness studio that offers a unique blend of yoga, Pilates, and dance classes. Their expert instructors design fusion workouts that cater to different fitness levels, from beginners to advanced practitioners. Their studios are equipped with state-of-the-art equipment and provide a serene atmosphere for members to relax and rejuvenate.',
  'domain': ['fitfusion.studio',
   'fusionfitness.co',
   'yogadance.co',
   'pilatesplus.fit',
   'movewithfusion.com'

## **Group data by company**

> Once the data generation was complete, I grouped the entries by company. Since the data is added per API call, it comes in as a list of dictionaries, each containing a business description and a list of domain names. I then grouped these entries by company so that each company has all its associated descriptions, along with the domain names linked to each description.

In [19]:
from train.data_processing import group_by_company
data_grouped_by_company_name  = group_by_company(data)

In [25]:
data_grouped_by_company_name['EcoCycle'][:2]

[{'business_description': 'EcoCycle is a startup that specializes in recycling and upcycling of electronic waste. Their goal is to reduce the massive amounts of e-waste that end up in landfills every year and promote sustainable living. They partner with companies and organizations to collect and process electronic waste, converting it into raw materials that can be used to manufacture new products.',
  'domain': ['ecocycle.co',
   'recycleit.today',
   'greenbytes.net',
   'techrecycle.io']},
 {'business_description': 'EcoCycle is a sustainable waste management company that provides innovative recycling solutions to households and businesses. Our mission is to reduce landfill waste and promote a greener future.',
  'domain': ['ecocycle.io',
   'greencycle.net',
   'wastewarriors.co',
   'recyclerevolution.com',
   'sustainablewaste.com']}]

## **Splitting raw data**

> Next, the data is split into training and test sets. With the system we implemented earlier, this ensures that a company appearing in the training set will not appear in the test set, preventing the model from simply reproducing domain names it has already seen — even if the descriptions are slightly different.

In [27]:
from train.data_processing import split_data

# Split flattened data into train and eval sets
train_data, eval_data = split_data(data_grouped_by_company_name, test_size=0.2, seed=42)

## **Flatten data**

> I then flatten the data so that each description is associated with a single domain name. Since each description is linked to five domain names, I simply expand the dataset so that each description–domain pair becomes its own entry.

In [28]:
from train.data_processing import flatten

# Flatten data so each business description has one domain per item
flatten_train_data = flatten(train_data)
flatten_eval_data = flatten(eval_data)

In [29]:
flatten_train_data[:3]

[{'business_description': 'MindfulMentor is an online coaching service that provides personalized guidance on mindfulness, productivity, and goal-setting. Their certified coaches offer one-on-one sessions.',
  'domain': 'mindfulmentor.co'},
 {'business_description': 'MindfulMentor is an online coaching service that provides personalized guidance on mindfulness, productivity, and goal-setting. Their certified coaches offer one-on-one sessions.',
  'domain': 'mindfulcoaching.io'},
 {'business_description': 'MindfulMentor is an online coaching service that provides personalized guidance on mindfulness, productivity, and goal-setting. Their certified coaches offer one-on-one sessions.',
  'domain': 'productivitypros.net'}]

## **Formatting data for Supervised Fine-tuning**

> Next, the training and evaluation data were transformed into an input–output format.

In [30]:
from train.data_processing import format_for_sft

# Input-output formatting
formatted_train_data = format_for_sft(flatten_train_data)
formatted_eval_data = format_for_sft(flatten_eval_data)

In [31]:
formatted_train_data[:3]

[{'input': 'MindfulMentor is an online coaching service that provides personalized guidance on mindfulness, productivity, and goal-setting. Their certified coaches offer one-on-one sessions.',
  'output': 'mindfulmentor.co'},
 {'input': 'MindfulMentor is an online coaching service that provides personalized guidance on mindfulness, productivity, and goal-setting. Their certified coaches offer one-on-one sessions.',
  'output': 'mindfulcoaching.io'},
 {'input': 'MindfulMentor is an online coaching service that provides personalized guidance on mindfulness, productivity, and goal-setting. Their certified coaches offer one-on-one sessions.',
  'output': 'productivitypros.net'}]

## **Converting to HuggingFace Datasets**

> Next, we convert the training and evaluation data into the Hugging Face Dataset format.

In [36]:
from train.data_processing import convert_to_dataset

# Convert formatted data to HuggingFace Dataset format
train_dataset = convert_to_dataset(formatted_train_data)
eval_dataset = convert_to_dataset(formatted_eval_data)

# Print number of samples for sanity check
print("Number of training samples:", len(train_dataset))
print("Number of evaluation samples:", len(eval_dataset))
print("Total number of samples:", len(train_dataset) + len(eval_dataset))

Number of training samples: 14849
Number of evaluation samples: 3291
Total number of samples: 18140


## **Compute max train token length**

> Next, we estimate the token length of the sequences to help fine-tune the training parameters later.

In [37]:
from transformers import AutoTokenizer
from train.data_processing import get_max_token_length

tokenizer = AutoTokenizer.from_pretrained(model_path)

# Compute max token lengths for input and output fields
max_train_input_len = get_max_token_length(train_dataset, tokenizer, "input")
max_train_output_len = get_max_token_length(train_dataset, tokenizer, "output")
max_eval_input_len = get_max_token_length(eval_dataset, tokenizer, "input")
max_eval_output_len = get_max_token_length(eval_dataset, tokenizer, "output")

# Display all lengths
print("Train set:")
print("  Max input token length :", max_train_input_len)
print("  Max output token length:", max_train_output_len)

print("\nEval set:")
print("  Max input token length :", max_eval_input_len)
print("  Max output token length:", max_eval_output_len)

# Total sequence lengths (input + output)
print("\nmax_seq_length (train):", max_train_input_len + max_train_output_len)
print("max_seq_length (eval) :", max_eval_input_len + max_eval_output_len)

Train set:
  Max input token length : 79
  Max output token length: 11

Eval set:
  Max input token length : 85
  Max output token length: 10

max_seq_length (train): 90
max_seq_length (eval) : 95


## **Tokenize train and eval datasets**

> Next, we tokenize the training and evaluation data.

In [38]:
from functools import partial
from train.data_processing import tokenize_for_sft

tokenize_fn = partial(
    tokenize_for_sft,
    tokenizer=tokenizer,
    max_length = 100
)

# Tokenize train dataset
tokenized_train_dataset = train_dataset.map(
    tokenize_fn,
    batched=False,
    remove_columns=train_dataset.column_names
)

# Tokenize eval dataset
tokenized_eval_dataset = eval_dataset.map(
    tokenize_fn,
    batched=False,
    remove_columns=eval_dataset.column_names
)

Map:   0%|          | 0/14849 [00:00<?, ? examples/s]

Map:   0%|          | 0/3291 [00:00<?, ? examples/s]

In [44]:
print(tokenized_train_dataset[0])

{'input_ids': [14683, 1007, 28755, 308, 271, 349, 396, 3270, 21464, 2372, 369, 5312, 3327, 1332, 15988, 356, 2273, 19965, 28725, 24504, 28725, 304, 5541, 28733, 15062, 28723, 6723, 20654, 25360, 2405, 624, 28733, 266, 28733, 538, 13912, 28723, 2273, 1007, 466, 271, 28723, 1115, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -1

## **Training**

The training was launched using the parameters listed below, grouped into three categories: Training Parameters, Quantization Parameters, and LoRA Parameters.
| Category | Parameter | Value |
|----------|-----------|-------|
| **Training** | num_train_epochs | 20 |
|  | per_device_train_batch_size | 64 |
|  | per_device_eval_batch_size  | 64 |
|  | gradient_accumulation_steps | 4 |
|  | eval_accumulation_steps  | 16 |
|  | eval_strategy | steps |
|  | eval_steps | 25 |
|  | logging_steps | 25 |
|  | save_steps | 25 |
|  | learning_rate | 5e-5 |
|  | weight_decay | 0.001 |
|  | fp16 | False |
|  | bf16 | True |
|  | max_grad_norm | 0.3 |
|  | max_steps | -1 |
|  | warmup_ratio | 0.1 |
|  | group_by_length | True |
|  | lr_scheduler_type | cosine |
|  | max_seq_length | 100 |
|  | optimizer | paged_adamw_32bit |
|  | metric_for_best_model | eval_loss |
|  | load_best_model_at_end | True |
|  | greater_is_better | False |
|  | early_stopping_patience | 3 |
|  | early_stopping_threshold | 0.001 |
| **Quantization** | load_in_4bit | True |
|  | bnb_4bit_quant_type | nf4 |
|  | bnb_4bit_compute_dtype | bfloat16 |
|  | bnb_4bit_use_double_quant | False |
| **LoRA** | r | 8 |
|  | lora_alpha | 16 |
|  | bias | none |
|  | task_type | CAUSAL_LM |
|  | target_modules | q_proj, k_proj, v_proj, o_proj |
|  | lora_dropout | 0.05 |


In [9]:
from config.train_config import train_config
from train.train_runner import train_model

model, attempt_path = train_model(tokenized_train_dataset, tokenized_eval_dataset, train_config, models_dir = "models")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Step,Training Loss,Validation Loss
25,10.9906,7.469403
50,6.4399,4.417335
75,3.7942,3.455979
100,3.2646,3.308002
125,3.1711,3.292005
150,3.1224,3.287942
175,3.093,3.297572
200,3.0518,3.321847
225,3.0358,3.335022




## **Evaluate**

> Next, inference is performed on the evaluation data to test the model. The results are first assessed using cosine similarity, and then a model or API is used to estimate the confidence score.

### **Predict domain name**

In [11]:
from evaluate.generation_utils import generate

prediction_eval_data = generate(
    model = model,
    tokenizer = tokenizer,
    data = flatten_eval_data,
    proportion = 1.0,
    seed = 0,
    max_new_tokens = 10
)

Génération:   0%|          | 0/3291 [00:00<?, ?it/s]

In [13]:
prediction_eval_data[:5]

[{'business_description': 'LumiStudio is a photography studio that specializes in portrait, wedding, and event photography. They offer customized photography packages and editing services.',
  'generated_domain': 'photographyhug.com',
  'raw_generation': 'photographyhug.com We provide photography services for'},
 {'business_description': 'LumiStudio is a photography studio that specializes in portrait, wedding, and event photography. They offer customized photography packages and editing services.',
  'generated_domain': 'photorevivals.coourb.com',
  'raw_generation': 'photorevivals.coourb.com'},
 {'business_description': 'LumiStudio is a photography studio that specializes in portrait, wedding, and event photography. They offer customized photography packages and editing services.',
  'generated_domain': 'shutterstudio.net',
  'raw_generation': 'shutterstudio.net 3Tango'},
 {'business_description': 'LumiStudio is a photography studio that specializes in portrait, wedding, and event ph

### **Evaluate with similarity cosine**

In [14]:
from evaluate.similarity_score import SimilarityScorer

scorer = SimilarityScorer()

mean_score, scores = scorer.evaluate_batch(
    data = prediction_eval_data,
    proportion = 1.0
)
print(f"Cosinus Similarity Mean : {mean_score}")

Évaluation:   0%|          | 0/3291 [00:00<?, ?it/s]

Cosinus Similarity Mean : 0.8084388961910803


> Once the predictions are made, cosine similarity is used to evaluate the semantic closeness between the description and the domain name generated by the model. This approach has limitations, as a deliberately lightweight model (`intfloat/e5-small-v2`) was chosen to avoid memory constraints. The main goal is simply to check whether the generated domain is at least somewhat semantically related to the description. This method is also interesting as a low-cost alternative to using an API, which can be expensive. Although there was also the idea of using the Groq API to generate the data, the plan later involves using GPT-4. As seen above, scores can become quite high because the model tends to reuse words from the description in the generated domain, which inflates the similarity score and reduces relevance.

In [2]:
import json
# Temporary save the prediction dataset
#with open("prediction_eval_data.json", "w", encoding="utf-8") as f:
#    json.dump(prediction_eval_data, f, ensure_ascii=False, indent=2)

## **LLM as a Judge**

In [3]:
# Load prediction data
with open("prediction_eval_data.json", "r", encoding="utf-8") as f:
    prediction_eval_data =json.load(f)

In [45]:
from evaluate.llm_as_judge import evaluate

mean_conf, scores = evaluate(
    data=prediction_eval_data,
    model_name="gpt-4",
    proportion=0.1,
    seed=0,
    temperature=0.0
)
print(f"GPT-4 confidence Mean : {mean_conf}")

GPT-4 confidence Mean : 0.5364741641337372


> Next, to evaluate the domains generated by the LLM, `gpt-4` was used. The evaluation was not run on the entire evaluation dataset; instead, a random sample was taken from the generated results. In this case, around 10% of the data was selected — roughly 300 description–domain pairs. Using the prompt defined in the `llm_as_a_judge.py` file located in the `evaluate` directory,`gpt-4` was asked to score the relevance of each generated domain to its corresponding description on a scale from 0 to 1. The resulting average confidence score was 0.54, which reflects the model’s performance and highlights certain limitations, as some generated domains were clearly irrelevant.

In [14]:
scores[:5]

[{'business_description': 'HomeHelp is a home maintenance and repair service that connects homeowners with trusted and experienced technicians. From plumbing to electrical work, HomeHelp has got it covered.',
  'generated_domain': 'our',
  'confidence': 0.1},
 {'business_description': 'PetalPushers is a flower delivery service that offers same-day delivery of fresh bouquets and arrangements. They source their flowers from local farmers to ensure the highest quality.',
  'generated_domain': 'same-day-petal.com',
  'confidence': 0.8},
 {'business_description': 'Provenance is a fine art gallery that showcases original works by emerging and established artists. Their exhibitions feature painting, sculpture, photography, and mixed media.',
  'generated_domain': 'artsphere.net',
  'confidence': 0.7},
 {'business_description': 'FreshFare is a meal kit delivery service providing healthy, pre-portioned ingredients and easy-to-follow recipes for home cooking. Their menu changes seasonally to ens

## **Save specific models**

> The final step is to save the weights of the best-performing model.

In [39]:
from train.train_utils import save_weights

# after your training finishes
save_dir = "models/weights/attempt_11/last_weights_evaluated"
save_weights(model, save_dir)

'models/weights/attempt_11/last_weights_evaluated'

## **Merge Models**

> Finally, the weights are merged with the base model to produce the final fine-tuned model.

In [40]:
from merge.merge_lora import merge
import torch

out = merge(
    base_model_id="model-based",
    adapter_path="models/weights/attempt_11/last_weights_evaluated",
    output_base_path="./models/merged_models",
    use_case="generate-domain",
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
)
print("Saved to:", out)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Saved to: models/merged_models/Mistral-7B-v0.1-generate-domain_v4
