<a href="https://colab.research.google.com/github/LaurynasRekasius/Domain_Name_Generator/blob/main/notebooks/Synthetic_Data_Generation_Mistral.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Objective:
- Generate synthetic data for fine-tuning
- Dataset should have two components:
  - Business description
  - List of suggested domain names for corresponding description
    - Names should be in JSON format, to get the best results from fine-tuning

Note:
- I decided to use Mistral model due to their license. It allows generated data to be used for other model training. Which is not always the case.
- Due to low costs and faster inference times I am using here Mistal AI API
- I picked Mistral 7b over Mixtral 8x7B for main dataset generation due to better price/performance ratio for this particular task

#####################

This notebooks runs on CPU


# Setup Libraries

In [1]:
!pip install mistralai -qU

from IPython.display import clear_output
import re
import json
import random

clear_output()

In [2]:
from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage

api_key = "i7o7t7R1BTF1U7BChdi7eSjY9SgevD3x"
model = "open-mixtral-8x7b" #open-mistral-7b #open-mixtral-8x7b

client = MistralClient(api_key=api_key)

# Generating list of diverse business

First step is to generate a list of business industries. This will make easier to generate more diverse business description that would not repeat that much.

In [3]:
messages = [
    ChatMessage(role="user",
                content="""
Generate a diverse and unique list of business industries starting from the most popular ones. The list should have 200 items.
"""
    )
          ]

chat_response = client.chat(
    model=model,
    messages=messages,
    temperature = 0.7
)

industries_string = chat_response.choices[0].message.content
print(industries_string)

1. Technology/Software
2. Financial Services
3. Healthcare
4. Retail
5. Professional Services (e.g. law, consulting)
6. Manufacturing
7. Construction
8. Education
9. Real Estate
10. Wholesale Trade
11. Arts, Entertainment, and Recreation
12. Accommodation and Food Services
13. Transportation and Warehousing
14. Information Services
15. Administrative and Support Services
16. Mining, Quarrying, and Oil and Gas Extraction
17. Utilities
18. Agriculture, Forestry, Fishing and Hunting
19. Management of Companies and Enterprises
20. Scientific Research and Development Services
21. Repair and Maintenance
22. Waste Management and Remediation Services
23. Personal and Laundry Services
24. Religious, Grantmaking, Civic, Professional and Similar Organizations
25. Social Assistance
26. Telecommunications
27. Insurance Carriers
28. Publishing Industries (except Internet)
29. Broadcasting (except Internet)
30. Motion Picture and Sound Recording Industries
31. Broadcasting and Telecommunications
32. 

Format the output as a list for easier usage

In [4]:
# Regular expression to match the pattern
pattern = r'\d+\.\s*(.*)'

# Find all matches of the pattern in the string, excluding the numbers
industries_list = re.findall(pattern, industries_string)

# Remove dublicated industries
industries = list(set(industries_list))

print(len(industries))

179


# Create Initial Sample of Business Descriptions

These descriptions will bring more variety to the prompts making sure we get more relevant and diverse business descriptions in the final dataset.

Following code use Mistral 7B only

In [None]:
model = "open-mistral-7b"  # Model selection
num_queries = 3  # Number of times to run query for each industry

# Initialize the Mistral client
client = MistralClient(api_key=api_key)

In [None]:
# Initialize storage for results
all_results = []
all_raw = []

# Base prompt template with placeholder for industries
base_prompt_template = """
Generate a single detailed and creative business description for a company in the {industry} industry, followed by five unique website name suggestions. The names should be relevant to the business description and business name if provided, focusing on the core services or products offered. The names should not include domain extensions like .com or prefixes like www. Ensure each business name is a single word or a creative combination of words that could potentially be used as a domain name. Provide the business description and the five suggested names in a JSON format.

Here is a response format example:
{{
    "description": "An innovative forestry company offering tree planting services, reforestation projects, and sustainable wood products.",
    "names": [
        "forestfuel",
        "treeteam",
        "ecogrowth",
        "woodwise",
        "greengrowth"
    ]
}}
"""

In [None]:
# Iterate through each industry
for industry in industries:
    print(industry)  # For debugging

    for _ in range(num_queries):  # Run queries X time per industry
        # Update prompt with correct industry
        content = base_prompt_template.format(industry=industry)
        #print(content)  # For debugging
        messages = [ChatMessage(role="user", content=content)]

        chat_response = client.chat(model=model,
                                    messages=messages,
                                    max_tokens=200,
                                    temperature=0.3)

        # Store the raw chat response content
        response_content = chat_response.choices[0].message.content if chat_response.choices else "No response"
        print(response_content)

        # Attempt to process and parse the response
        try:
            response_dict = json.loads(response_content)
            json_status = "YES"
            response_dict['industry'] = industry  # Add the industry to the result
        except (json.JSONDecodeError, ValueError) as e:
            json_status = "NO"
            response_dict = {"error": f"Failed to process response: {e}", "industry": industry}

        result_record = {
            "response": response_dict,
            "JSON": json_status,
        }

        # Add the results to lists
        all_results.append(result_record)
        all_raw.append(chat_response)

# Filter only valid JSON responses
sample_results = [result for result in all_results if result['JSON'] == "YES"]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        "mindfulfaith",
        "serenitystream"
    ],
    "industry": "Religious Organizations"
}
{
    "description": "A spiritual sanctuary providing inspirational sermons, community outreach programs, and personalized counseling services to foster emotional and spiritual growth.",
    "names": [
        "soulseed",
        "sparkfaith",
        "heartwisdom",
        "mindfulspark",
        "peacebloom"
    ],
    "industry": "Religious Organizations"
}
{
    "description": "A spiritual sanctuary providing inspirational sermons, community outreach programs, and personalized counseling services to foster emotional and spiritual growth.",
    "names": [
        "soulseed",
        "sparkfaith",
        "heartwisdom",
        "mindfulspirit",
        "serenitystream"
    ],
    "industry": "Religious Organizations"
}
Grantmaking and Giving Services
{
    "description": "Empathy Grants: A compassionate grantmaking organi

# Full dataset generation

In [None]:
def get_one_random_example_for_industry(industry, reference):
    # Filter examples for the given industry with valid JSON responses
    industry_examples = [
        example for example in reference
        if example['response']['industry'] == industry and example['JSON'] == 'YES'
    ]

    # Randomly choose one example from the filtered list, or take what's available
    chosen_example = random.choice(industry_examples) if industry_examples else None

    # Prepare the output by selecting only the 'description' and 'names' keys
    formatted_example = None
    if chosen_example:
        formatted_example = {
            "description": chosen_example['response']['description'],
            "names": chosen_example['response']['names']
        }
    return formatted_example

In [None]:
# Initialize storage for results
all_results = []
all_raw = []

# Number of times to run query for each industry
# Run until 5000 records are produced
num_queries = round(5000/len(industries))

# Base prompt template with placeholders for examples and industry
base_prompt_template = """
Generate a single detailed and creative business description for a company in the {industry} industry, followed by five unique website name suggestions. The names should be relevant to the business description and business name if provided, focusing on the core services or products offered. The names should not include domain extensions like .com or prefixes like www. Ensure each business name is a single word or a creative combination of words that could potentially be used as a domain name. Provide the business description and the five suggested names in a JSON format.

Here is a response format example:
{examples}


"""

In [None]:
# Main dataset generation
# checks for valid json output pattern and keeps ratio fixed for all industries

# Iterate through each industry
for industry in industries:
    print(industry)  # For debugging

    valid_json_count = 0  # Counter for valid JSON responses
    total_attempts = 0  # Counter for total attempts, including invalid responses

    while valid_json_count < num_queries:  # Run until valid JSON responses reach num_queries
        total_attempts += 1  # Increment total attempts each iteration

        # Fetch a new example for each run
        chosen_example = get_one_random_example_for_industry(industry, sample_results)
        examples_str = ""
        if chosen_example:
            # Format the single example for inclusion in the prompt
            formatted_example = json.dumps(chosen_example, indent=2)
            examples_str = formatted_example

        # Update content for each query to include the new example
        content = base_prompt_template.format(industry=industry, examples=examples_str)
        messages = [ChatMessage(role="user", content=content)]

        chat_response = client.chat(model=model, messages=messages, max_tokens=200, temperature=0.3)

        # Store the raw chat response content
        response_content = chat_response.choices[0].message.content if chat_response.choices else "No response"
        print(response_content)

        try:
            response_dict = json.loads(response_content)
            json_status = "YES"
            valid_json_count += 1  # Only increment valid_json_count if JSON is valid

            response_dict['industry'] = industry  # Add the industry to the result
        except (json.JSONDecodeError, ValueError) as e:
            json_status = "NO"
            response_dict = {"error": f"Failed to process response: {e}", "industry": industry}

        result_record = {
            "response": response_dict,
            "JSON": json_status,
            "attempt": total_attempts,  # Track the attempt number for debugging or analysis
        }

        all_results.append(result_record)  # Add the result record to the all_results list
        all_raw.append(chat_response)

    print(f"Industry: {industry}, Valid JSON Count: {valid_json_count}, Total Attempts: {total_attempts}")
    # Reset the counters for the next industry
    valid_json_count = 0
    total_attempts = 0


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  "description": "TechGuard: A premier testing laboratory dedicated to ensuring the reliability and safety of technology products in various industries. Our cutting-edge facilities and team of certified engineers provide comprehensive testing services, including functional testing, safety certifications, and performance analysis. By partnering with TechGuard, clients can deliver high-quality, secure, and compliant technology solutions to their customers.",
  "names": [
    "CertiShield",
    "TechSafeLabs",
    "ReliabilityHub",
    "SafetyTestPro",
    "ComplianceGuard"
  ]
}
{
  "description": "Veracity Labs: A pioneering testing laboratory dedicated to uncovering truth and ensuring reliability in various industries. Our team of meticulous scientists employs advanced technologies to analyze and validate the authenticity of consumer goods, food and beverages, pharmaceuticals, and environmental samples. With a commitment 

In [None]:
# Check used tokens
total_prompt_tokens = sum(response.usage.prompt_tokens for response in all_raw)
total_completion_tokens = sum(response.usage.completion_tokens for response in all_raw)

print(f"Total prompt tokens: {total_prompt_tokens}")
print(f"Total completion tokens: {total_completion_tokens}")


Total prompt tokens: 641974
Total completion tokens: 373747


In [None]:
# filter only valid data
data = [result for result in all_results if result['JSON'] == "YES"]

len(data)

5000

In [None]:
# create train/test split. Make sure to keep industry distribution even

from collections import defaultdict
import math

# Transform the list of dictionaries
transformed_data = []
for item in data:
    transformed_item = {
        'prompt': item['response']['description'],
        'response': [name.lower() for name in item['response']['names']], # make sure all names are lowercase
        'industry': item['response']['industry']
    }
    transformed_data.append(transformed_item)

# Group data by industry
industry_groups = defaultdict(list)
for item in transformed_data:
    industry_groups[item['industry']].append(item)

# Initialize train and test sets
train_data = []
test_data = []

# Split proportion
split_proportion = 0.7  # 70% train, 30% test

# Stratify split by industry
for industry, items in industry_groups.items():
    split_index = math.floor(len(items) * split_proportion)
    train_data.extend(items[:split_index])
    test_data.extend(items[split_index:])

# Output the structure and some statistics about the datasets
print(f"Training dataset: {len(train_data)} rows")
print(f"Testing dataset: {len(test_data)} rows")

Training dataset: 3427 rows
Testing dataset: 1573 rows


In [None]:
# Write datasets to a JSON file

json_file_path = 'train_dict.json'
with open(json_file_path, 'w') as file:
    json.dump(train_data, file)

print(f"Dataset dictionary has been written to {json_file_path}.")

json_file_path = 'test_dict.json'
with open(json_file_path, 'w') as file:
    json.dump(test_data, file)

print(f"Dataset dictionary has been written to {json_file_path}.")


Dataset dictionary has been written to train_dict.json.
Dataset dictionary has been written to test_dict.json.


# Validation dataset creation

To make sure business descriptions are diffrent than in test and training datasets for this task I am using Mixtral 8x7B model

Also these descriptions will be of different format which might be closer to what domain team will have.

In [5]:
# Set model
model = "open-mixtral-8x7b"

# Initialize storage for results
all_descriptions = []
all_descriptions_raw = []

num_queries = round(500/len(industries))    # Number of times to run query for each industry

# Base prompt template with placeholders for examples and industry
base_prompt_template = """
Generate a short and creative business description for a company in the {industry} industry.

"""

In [6]:
# Validation dataset generation

# Iterate through each industry
for industry in industries:
    print(industry)  # For debugging

    for _ in range(num_queries):  # Run until valid JSON responses reach num_queries

        # Update content for each query to include the new example
        content = base_prompt_template.format(industry=industry)
        messages = [ChatMessage(role="user", content=content)]

        chat_response = client.chat(model=model,
                                    messages=messages,
                                    max_tokens=200,
                                    temperature=0.7)

        # Store the raw chat response content
        response_content = chat_response.choices[0].message.content if chat_response.choices else "No response"
        print(response_content)

        all_descriptions.append(response_content)  # Add the result record to the all_results list
        all_descriptions_raw.append(chat_response)

Used Merchandise Stores
"Revive Treasures: Unearthing Pre-loved Gems! We breathe new life into yesteryear's treasures, offering a vibrant collection of used merchandise that's been thoughtfully curated and sustainably sourced. From vintage fashion finds to pre-owned home décor, every item at Revive Treasures tells a story waiting to be discovered. Join us in our mission to reduce waste, promote circular economy, and celebrate the charm of the past. Your next favorite piece is just a thrift store adventure away!"
"Revive Treasures: Unleashing the Hidden Gems of Yesteryears! We handpick, polish, and present preloved merchandise, transforming forgotten treasures into your unique statement pieces. With a twist of nostalgia and a dash of sustainability, Revive Treasures invites you to explore, reimagine, and celebrate the charm of the past, today!"
"At Revive Treasures, we believe that one person's trash is another's treasure! As your friendly neighborhood used merchandise store, we handpic

In [26]:
# clean up to better meet definition of "short business description"

filtered_descriptions = [
    desc.replace("\\'", "'").replace("\n", " ") for desc in all_descriptions if len(desc.replace("\\'", "'").replace("\n", " ")) < 500
]

In [28]:
# Save validation dataset to file

json_file_path = 'eval_data.json'
with open(json_file_path, 'w') as file:
    json.dump(filtered_descriptions, file)