# **Job Description**  

This study references the group's pipeline but makes several adjustments based on the specifics of the sub-project.

In [None]:
!pip install bitsandbytes
!pip install --upgrade transformers accelerate



In [None]:
!pip install transformers
!pip install peft bitsandbytes accelerate datasets



# **Mode Loading: mistralai/Mistral-7B-Instruct-v0.3**

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig
from huggingface_hub import login
import torch
from google.colab import userdata

# 1️⃣ Load model

login(userdata.get('HF_TOKEN'))
model_name = "mistralai/Mistral-7B-Instruct-v0.3"

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True  # 8-bit Quantitize
)

# loading 8-bit Quantitization model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config,
    use_auth_token=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)

tokenizer.pad_token = tokenizer.eos_token

llm_pipeline = pipeline("text-generation",
                        model=model,
                        tokenizer=tokenizer,
                        do_sample=True,
                        temperature=1.1,
                        top_p=0.9,
                        eos_token_id=tokenizer.eos_token_id)

print("✅ LLM loaidng（8-bit）")



config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Device set to use cuda:0


✅ LLM loaidng（8-bit）


In [None]:
prompt = "Job Title: AI Research Intern\n\nDescription: Internship opportunity for exploring generative AI applications in business settings.\n\nGenerate the full job description below:"
output = llm_pipeline(prompt)[0]['generated_text']
print(output)


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Job Title: AI Research Intern

Description: Internship opportunity for exploring generative AI applications in business settings.

Generate the full job description below:

Title: AI Research Intern

Description:

We are excited to offer an internship opportunity for a passionate and creative AI Research Intern to join our dynamic team. This position is ideal for students or recent graduates with a strong interest in artificial intelligence and its applications in business settings.

As an AI Research Intern, you will have the chance to work on cutting-edge projects, collaborate with industry experts, and gain valuable hands-on experience in the field of AI. Your primary responsibilities will include:

1. Conducting research on generative AI models and their potential applications in business.
2. Developing and implementing AI solutions to improve our business processes and products.
3. Collaborating with cross-functional teams to integrate AI technologies into our workflows.
4. Assist

# **Try Different Ways to Generate Synthetic Data**  

Due to the long generation time (more than 30 hours), the process was manually parallelized based on my available time. Various methods were tested to improve the quality of the synthetic data, so in addition to de-duplication during generation, a second round of de-duplication is required.

The JD data subsets are stored across multiple jd_part_*.json files, which will later be consolidated and further cleaned.

Problems faced during generation:  


1.   During data generation, one of the first challenges encountered was the long generation time caused by the length of each JD. This issue was somewhat mitigated by generating the data in smaller fragments (which can also be parallelized manually by launching multiple sessions) and then merging the outputs at the end.
2.   Later, I found that the quality of the generated JD samples was not ideal. Some samples were stuck together, forming a single long entry. This was resolved by adding truncation logic in the code.
1.   There was also an issue of duplicated outputs, which was partially addressed by the updated generation script that removes duplicates and performs a second round of cleaning on the entire dataset. As a result, the actual number of generated samples was far more than 1,000.

In [None]:
# ✅ Synthetic JD Generator - Few-shot + Mistral + Theme Pool Control + Theme Quotas
import json
import random
import time
from tqdm import tqdm
from transformers import pipeline

# -------------------- CONFIG --------------------
TASK_ID = 5
BATCH_SIZE = 20
FEW_SHOT_K = 3
MAX_NEW_TOKENS = 256
SEED_FILE = "normalized_jd_seed_data.json"
OUTPUT_JSON = f"jd_part_{TASK_ID}.json"
SAMPLES_PER_THEME = 10

# -------------------- SETUP --------------------
random.seed(1000 + TASK_ID)

with open(SEED_FILE, "r", encoding="utf-8") as f:
    seed_data = json.load(f)
assert len(seed_data) >= FEW_SHOT_K, "⚠️ Not enough seed examples for few-shot."

few_shot_pool = seed_data.copy()

# Theme pool
theme_pool = [
    # Technology & Engineering
    "Software Development",
    "AI & Machine Learning",
    "Data Science & Analytics",
    "Cybersecurity",
    "Cloud & DevOps",
    "Embedded Systems & IoT",

    # Design & Creative
    "UX/UI Design",
    "Graphic & Visual Design",
    "Product Design",
    "Motion Design / Animation",
    "Creative Direction",
    "Industrial / Furniture Design",
    "Branding & Identity",

    # Education & Research
    "K-12 Teaching",
    "EdTech Product Design",
    "Academic Research",
    "Curriculum Development",
    "Online Tutoring",

    # Business & Management
    "Project Management",
    "Product Management",
    "Operations Management",
    "Strategy & Consulting",
    "Business Analysis",

    # Marketing & Communications
    "Content Marketing",
    "Digital Advertising / SEO",
    "Brand Management",
    "Social Media Strategy",
    "Corporate Communications",

    # Finance & Accounting
    "Investment Analysis",
    "Accounting & Bookkeeping",
    "Financial Planning & Analysis",
    "Risk Management",
    "Crypto & DeFi Finance",

    # Sales & Customer Success
    "B2B Tech Sales",
    "Retail Sales",
    "Customer Success",
    "Sales Engineering",
    "Inside Sales / SDR",

    # Healthcare & Life Sciences
    "Nursing & Clinical Support",
    "Healthcare Administration",
    "Medical Device QA",
    "Clinical Research",
    "Mental Health Support",

    # Skilled Trades & Logistics
    "HVAC / Plumbing / Electrical",
    "Warehouse Operations",
    "Facility Maintenance",
    "Dispatch & Logistics",
    "Fire & Building Safety",

    # Emerging / Interdisciplinary Roles
    "AI + Art",
    "ClimateTech Product Management",
    "Bioinformatics",
    "AI Ethics Consulting",
    "Mixed Reality Production",
    "Voice UX Design",
    "Digital Twin Modeling"
]


# Injected llm_pipeline = pipeline(...) from main script

# -------------------- HELPERS --------------------
seen_responses = set()
seen_prompts = set()
per_theme_counter = {theme: 0 for theme in theme_pool}


def build_few_shot_prompt(few_shot_examples, theme):
    prompt_parts = ["Below are examples of job description (JD) tasks:\n"]
    for ex in few_shot_examples:
        prompt_parts.append(
            f"Instruction: {ex['instruction']}\n"
            f"Input: {ex['input']}\n"
            f"Response: {ex['response']}\n"
        )
    prompt_parts.append(
        f"Now generate **only one** new JD task following the same format.\n"
        f"Theme: {theme}\n"
        "⚠️ Make sure the instruction, input and task framing are DIFFERENT from all the above.\n"
        "Avoid using similar instructions or inputs.\n"
        "Use a different angle, structure, or job subtype within the theme.\n"
        "Creativity is encouraged. Do NOT reuse any of the instruction/input wording.\n"
        "Start immediately below:\n\n"
        "Instruction:\n"
        "Input:\n"
        "Response:\n"
        "\n(Note: Avoid repeating any topic or phrasing from the examples above. Use a fresh task and setting.)"
    )
    return "\n".join(prompt_parts)


def is_duplicate(new_entry):
    prompt_key = (new_entry['instruction'], new_entry['input'])
    response_hash = hash(new_entry['response'])
    if prompt_key in seen_prompts:
        print("⚠️ Skipped duplicate JD based on instruction+input.")
        return True
    if response_hash in seen_responses:
        print("⚠️ Skipped duplicate JD based on response hash.")
        return True
    seen_prompts.add(prompt_key)
    seen_responses.add(response_hash)
    return False

# -------------------- GENERATION LOOP --------------------
generated_data = []
theme_index = 0

# keep looping until all themes have enough samples
theme_quota_met = lambda: all(v >= SAMPLES_PER_THEME for v in per_theme_counter.values())

while not theme_quota_met():
    batch_prompts = []
    theme_batch = []
    for _ in range(BATCH_SIZE):
        # pick a theme that still needs samples
        for _ in range(len(theme_pool)):
            theme = theme_pool[theme_index % len(theme_pool)]
            theme_index += 1
            if per_theme_counter[theme] < SAMPLES_PER_THEME:
                break
        few_shots = random.sample(few_shot_pool, FEW_SHOT_K)
        prompt = build_few_shot_prompt(few_shots, theme)
        batch_prompts.append(prompt)
        theme_batch.append(theme)

    start_time = time.time()
    batch_outputs = llm_pipeline(
        batch_prompts,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=0.9,
        top_p=0.95,
        do_sample=True,
        batch_size=BATCH_SIZE
    )
    end_time = time.time()
    print(f"🔹 Batch took {end_time - start_time:.2f} seconds")

    for response, theme in zip(batch_outputs, theme_batch):
        full_text = response[0]["generated_text"]
        try:
            first_block = full_text.split("Instruction:", 1)[1]
            inst_part, rest = first_block.split("\nInput:", 1)
            input_part, resp_part = rest.split("\nResponse:", 1)
            resp_part = resp_part.split("\nInstruction:")[0]

            new_entry = {
                "instruction": inst_part.strip(),
                "input": input_part.strip(),
                "response": resp_part.strip(),
                            }

            if not is_duplicate(new_entry):
                generated_data.append(new_entry)
                few_shot_pool.append(new_entry)
                per_theme_counter[theme] += 1

                if len(generated_data) % 10 == 0:
                    print(f"\n🔹 {len(generated_data)} samples so far.")

        except Exception as e:
            print("⚠️ Skipped malformed output:", full_text[:100])

# -------------------- SAVE OUTPUT --------------------
with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
    json.dump(generated_data, f, indent=4)

print(f"✅ All done! Generated {len(generated_data)} samples (10 per theme) → Saved to {OUTPUT_JSON}")
# This was intrupted by me since it takes too long time.

Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


🔹 Batch took 50.62 seconds
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.

🔹 10 samples so far.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.


Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


🔹 Batch took 50.53 seconds
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.

🔹 20 samples so far.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.


KeyboardInterrupt: 

In [None]:
# ✅ Enhanced Synthetic JD Generator with Expanded Prompt Diversity + Stop Sequence + Monitoring
import json
import random
import time
from tqdm import tqdm
from transformers import pipeline, StoppingCriteria, StoppingCriteriaList
import torch

# -------------------- CONFIG --------------------
TASK_ID = 1
BATCH_SIZE = 20
NUM_SAMPLES = 1000
FEW_SHOT_K = 3
MAX_NEW_TOKENS = 256
SEED_FILE = "normalized_jd_seed_data.json"
OUTPUT_JSON = f"jd_part_{TASK_ID}.json"

# -------------------- SETUP --------------------
random.seed(1000 + TASK_ID)

with open(SEED_FILE, "r", encoding="utf-8") as f:
    seed_data = json.load(f)
assert len(seed_data) >= FEW_SHOT_K, "\u26a0\ufe0f Not enough seed examples for few-shot."

few_shot_pool = seed_data.copy()

# -------------------- PROMPT TEMPLATE POOL --------------------
prompt_templates = [
    # Creative Style Prompt
    """Create a fresh, imaginative JD generation task that stands out.

{few_shots}
Instructions for this task:
- Be unconventional and surprising.
- Use storytelling, hypotheticals, or scenario-based task prompts.
- Focus on creativity over corporate tone.
- Explore unusual job intersections (e.g., blockchain + art).

Example format:
Instruction:
Input:
Response:""",

    # Analytical Comparison Prompt
    """Generate a job-description-related task that compares or analyzes job types.

{few_shots}
Instructions for this task:
- Use comparative or evaluative language.
- The instruction can ask the model to contrast roles, job ads, or applicant profiles.
- Target HR specialists or hiring managers.

Example format:
Instruction:
Input:
Response:""",

    # Enhanced Diversity Template
    """Below are example tasks for generating Job Descriptions (JDs):

{few_shots}
Now generate **one unique Job Description (JD)** instruction-following task in the format below.

🎯 Theme: {theme}
🧠 Diversity Rules (MUST follow):
1. Do NOT reuse ANY phrasing or wording from previous instructions.
2. Do NOT repeat the same job titles or input formats.
3. Use different writing styles (e.g., question, command, hypothetical scenario).
4. Vary the task difficulty (easy/hard), target audience, or industry focus.
5. Think of uncommon job subcategories or emerging roles in the theme.

🚩 Example of BAD (REJECTED) instruction:
\"Write a job description for a data analyst.\" ← Too generic, overused.

✅ Example of GOOD instruction:
\"Given a startup launching a fitness AI app, generate a creative job description for a hybrid role combining UX design and behavioral psychology.\"

📜 Output Format:
Instruction:
Input:
Response:

Note: This task must be distinct in both structure and content from all prior examples."""
]

# -------------------- ENHANCED SUBTYPE CONFIG --------------------
job_levels = [
    "Internship", "Entry-level", "Junior", "Mid-level", "Senior",
    "Lead", "Principal", "Staff-level", "Director-level"
]

job_modes = [
    "Remote", "On-site", "Hybrid", "Remote-first", "Client-facing"
]

job_types = [
    "Full-time", "Part-time", "Contract", "Freelance", "Temporary"
]

job_industries = [
    "Fintech", "Healthcare", "E-commerce", "Gaming", "Cybersecurity",
    "Education", "Media & Entertainment", "Travel & Hospitality", "Real Estate", "Blockchain",
    "Green Energy", "Logistics", "Retail", "Manufacturing", "Biotechnology",
    "Social Media", "Automotive", "Agritech", "Edtech", "AI Research",
    "Robotics", "Insurance", "Legal Tech", "Fashion", "Food Delivery",
    "Cloud Infrastructure", "IoT", "Space Tech", "Telecommunications", "Smart Home Tech",
    "Supply Chain", "Digital Health", "3D Printing", "Quantum Computing", "Augmented Reality",
    "DevOps & Tooling", "Open-source Platforms", "Digital Marketing", "Online Education"
]

job_specials = [
    "Startup environment", "Government project", "NGO role",
    "AI research lab", "Open-source focused team", "Early-stage startup",
    "High-frequency trading firm", "Data consultancy", "Cross-functional product team"
]

styles = ["Formal", "Creative", "Concise", "Technical"]
skill_focuses = ["Python", "SQL", "Machine Learning", "NLP", "Data Visualization"]
themes = ["Data Science", "Software Engineering"]


def random_job_subtype():
    parts = [
        random.choice(job_levels),
        random.choice(job_types),
        random.choice(job_modes),
        f"{random.choice(job_industries)} sector"
    ]
    if random.random() < 0.5:
        parts.append(random.choice(job_specials))
    return ", ".join(parts)

# -------------------- CUSTOM STOPPING CRITERIA --------------------
class StopOnSubsequence(StoppingCriteria):
    def __init__(self, stop_sequence, tokenizer):
        self.stop_ids = tokenizer.encode(stop_sequence, add_special_tokens=False)

    def __call__(self, input_ids, scores, **kwargs):
        if len(input_ids[0]) < len(self.stop_ids):
            return False
        return input_ids[0][-len(self.stop_ids):].tolist() == self.stop_ids

stop_sequence = "\nInstruction:"
stopping_criteria = StoppingCriteriaList([StopOnSubsequence(stop_sequence, tokenizer)])

# -------------------- HELPERS --------------------
seen_responses = set()
seen_prompts = set()

def build_few_shot_prompt(few_shot_examples):
    theme = random.choice(themes)
    subtype = random_job_subtype()
    style = random.choice(styles)
    skill = random.choice(skill_focuses)

    few_shot_parts = []
    for ex in few_shot_examples:
        few_shot_parts.append(
            f"Instruction: {ex['instruction']}\n"
            f"Input: {ex['input']}\n"
            f"Response: {ex['response']}\n"
        )
    few_shots = "\n".join(few_shot_parts)

    template = random.choice(prompt_templates)
    return template.format(
        few_shots=few_shots,
        theme=theme,
        subtype=subtype,
        style=style,
        skill=skill
    )

def is_duplicate(new_entry):
    prompt_key = (new_entry['instruction'], new_entry['input'])
    response_hash = hash(new_entry['response'])

    if prompt_key in seen_prompts:
        print("\u26a0\ufe0f Skipped duplicate JD based on instruction+input.")
        return True
    if response_hash in seen_responses:
        print("\u26a0\ufe0f Skipped duplicate JD based on response hash.")
        return True

    seen_prompts.add(prompt_key)
    seen_responses.add(response_hash)
    return False

# -------------------- GENERATION LOOP --------------------
generated_data = []
total_tokens_generated = 0

total_batches = range(0, NUM_SAMPLES, BATCH_SIZE)
for i in tqdm(total_batches, desc=f"\U0001f9e0 Task {TASK_ID}: Generating", dynamic_ncols=True, mininterval=1.0):
    batch_prompts = []
    for _ in range(BATCH_SIZE):
        few_shots = random.sample(few_shot_pool, FEW_SHOT_K)
        prompt = build_few_shot_prompt(few_shots)
        batch_prompts.append(prompt)

    start_time = time.time()
    batch_outputs = llm_pipeline(
        batch_prompts,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=0.9,
        top_p=0.95,
        do_sample=True,
        stopping_criteria=stopping_criteria
    )
    end_time = time.time()

    print(f"\U0001f539 Task {TASK_ID} | Batch {i // BATCH_SIZE + 1} took {end_time - start_time:.2f} seconds")

    for response in batch_outputs:
        full_text = response[0]["generated_text"]

        try:
            first_block = full_text.split("Instruction:", 1)[1]
            inst_part, rest = first_block.split("\nInput:", 1)
            input_part, resp_part = rest.split("\nResponse:", 1)
            resp_part = resp_part.split("\nInstruction:")[0]

            new_entry = {
                "instruction": inst_part.strip(),
                "input": input_part.strip(),
                "response": resp_part.strip()
            }

            if not is_duplicate(new_entry):
                generated_data.append(new_entry)
                few_shot_pool.append(new_entry)

                # Token length monitoring
                token_len = len(tokenizer.tokenize(resp_part))
                total_tokens_generated += token_len

                if len(generated_data) % 10 == 0:
                    print("\n\U0001f539 Sample Output:")
                    print(json.dumps(new_entry, indent=2))

        except Exception as e:
            print("\u26a0\ufe0f Skipped malformed output:", full_text[:100])

    avg_tokens = total_tokens_generated / max(1, len(generated_data))
    print(f"\U0001f4ca Avg response length so far: {avg_tokens:.2f} tokens")

# -------------------- SAVE OUTPUT --------------------
with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
    json.dump(generated_data, f, indent=4)

print(f"\u2705 Task {TASK_ID} done! Generated {len(generated_data)} samples \u2192 Saved to {OUTPUT_JSON}")


🧠 Task 1: Generating:  10%|█         | 1/10 [03:12<28:49, 192.20s/it]

🔹 Task 1 | Batch 1 took 192.19 seconds
⚠️ Skipped duplicate JD based on instruction+input.
📊 Avg response length so far: 459.67 tokens


🧠 Task 1: Generating:  20%|██        | 2/10 [06:24<25:36, 192.03s/it]

🔹 Task 1 | Batch 2 took 191.89 seconds

🔹 Sample Output:
{
  "instruction": "External Mortgage Loan Officer",
  "input": "Outside sales role at Orlando Credit Union focused on networking, cross-selling, and growing the loan portfolio.",
  "response": "{'job_title': 'External Mortgage Loan Officer', 'job_summary': \"The External Mortgage Loan Officer works with branch teams and members via phone, in-person, email, and mail to provide a positive member experience. This role is an outside sales and business development position, requiring attendance at networking events and functions to build referral relationships and promote Orlando Credit Union's lending products.\", 'responsibilities': ['Provide personalized service to credit union members and attract new members.', 'Attend external events and networking functions to develop and maintain a quality referral network (including real estate professionals, builders, and other contacts).', 'Analyze loan applications by reviewing applicant f

🧠 Task 1: Generating:  30%|███       | 3/10 [09:35<22:21, 191.59s/it]

🔹 Task 1 | Batch 3 took 191.05 seconds
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
📊 Avg response length so far: 448.53 tokens


🧠 Task 1: Generating:  40%|████      | 4/10 [12:46<19:08, 191.42s/it]

🔹 Task 1 | Batch 4 took 191.15 seconds
⚠️ Skipped duplicate JD based on instruction+input.

🔹 Sample Output:
{
  "instruction": "Lead Installer",
  "input": "Oversee and execute home theater system installations with strong technical expertise and leadership.",
  "response": "{'job_title': 'Lead Installer', 'job_summary': 'BWE Home Theater is seeking a skilled Lead Installer with a passion for technology and a keen eye for detail. The role involves overseeing the installation of custom home theater systems, ensuring quality workmanship, and delivering exceptional customer service across various projects.', 'responsibilities': ['Lead and supervise installation teams to ensure efficient project completion and high-quality standards.', 'Perform site assessments and collaborate with clients to understand their requirements and preferences.', 'Plan and organize installation schedules, resources, and materials to meet project deadlines.', 'Install, configure, and calibrate audiovisual equipm

🧠 Task 1: Generating:  50%|█████     | 5/10 [15:57<15:57, 191.49s/it]

🔹 Task 1 | Batch 5 took 191.61 seconds
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
📊 Avg response length so far: 462.36 tokens


🧠 Task 1: Generating:  60%|██████    | 6/10 [19:09<12:46, 191.60s/it]

🔹 Task 1 | Batch 6 took 191.80 seconds
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
📊 Avg response length so far: 469.96 tokens


🧠 Task 1: Generating:  70%|███████   | 7/10 [22:21<09:35, 191.73s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


🔹 Task 1 | Batch 7 took 191.99 seconds
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
📊 Avg response length so far: 473.24 tokens


🧠 Task 1: Generating:  80%|████████  | 8/10 [25:34<06:23, 191.97s/it]

🔹 Task 1 | Batch 8 took 192.50 seconds

🔹 Sample Output:
{
  "instruction": "Anaplan Developer",
  "input": "Remote full-time role leading multiple Anaplan projects for Fortune 500 customers.",
  "response": "{'job_title': 'Anaplan Developer', 'job_summary': 'Senior Consultant role in Anaplan development, leading projects and model-building for enterprise clients.', 'responsibilities': ['Develop and implement Anaplan models.', 'Translate client business requirements into scalable Anaplan solutions.', 'Lead proof-of-concept demonstrations.', 'Manage project timelines and development cycles.', 'Guide and mentor team members.', 'Ensure integration with ERP, CRM, and APS systems.'], 'qualifications': ['Bachelor\u2019s degree in Computer Science, Mathematics, or Statistics.', 'Experience in Anaplan model building and implementation.', 'Advanced Excel and multi-dimensional modeling skills.', 'Understanding of financial planning, EPM, or S&OP.'], 'how_to_apply': 'Submit application via compan

🧠 Task 1: Generating:  90%|█████████ | 9/10 [28:46<03:12, 192.00s/it]

🔹 Task 1 | Batch 9 took 192.05 seconds
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
📊 Avg response length so far: 455.94 tokens


🧠 Task 1: Generating: 100%|██████████| 10/10 [31:57<00:00, 191.75s/it]

🔹 Task 1 | Batch 10 took 191.24 seconds
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
⚠️ Skipped duplicate JD based on instruction+input.
📊 Avg response length so far: 459.05 tokens
✅ Task 1 done! Generated 38 samples → Saved to jd_part_1.json





In [None]:
# ✅ Synthetic JD Generator - 3-shot Prompt + Mistral + Topic-aware Prompting
import json
import random
import time
from tqdm import tqdm
from transformers import pipeline

# -------------------- CONFIG --------------------
TASK_ID = 1
BATCH_SIZE = 32
NUM_SAMPLES = 64
FEW_SHOT_K = 3
MAX_NEW_TOKENS = 256
SEED_FILE = "/content/jd_seed_data.json"
OUTPUT_JSON = f"/content/jd_part_{TASK_ID}.json"
GROUP_BY_TOPIC = False  # Set to True to use same-topic few-shot examples

# -------------------- SETUP --------------------
random.seed(1000 + TASK_ID)

with open(SEED_FILE, "r", encoding="utf-8") as f:
    seed_data = json.load(f)

assert len(seed_data) >= FEW_SHOT_K, "⚠️ Not enough seed examples for few-shot."

few_shot_pool = seed_data.copy()

# Injected externally
# llm_pipeline = pipeline(...)

# -------------------- HELPERS --------------------
seen_responses = set()
seen_prompts = set()

def build_few_shot_prompt(few_shot_examples):
    prompt_parts = ["Below are 3 example Job Description (JD) generation tasks:\n"]
    for idx, ex in enumerate(few_shot_examples, 1):
        prompt_parts.append(
            f"### Example {idx}\n"
            f"Instruction:\n{ex['instruction']}\n"
            f"Input:\n{ex['input']}\n"
            f"Response:\n{json.dumps(ex['response'], indent=2)}\n"
        )

    # Enhanced meta-guidance
    prompt_parts.append("""
Now generate **one unique Job Description (JD)** instruction-following task in the format below.

🎯 Theme: Any emerging or existing job domain
🧠 Diversity Rules (MUST follow):
1. Do NOT reuse ANY phrasing or wording from previous instructions.
2. Do NOT repeat the same job titles or input formats.
3. Use different writing styles (e.g., question, command, hypothetical scenario).
4. Vary the task difficulty (easy/hard), target audience, or industry focus.
5. Think of uncommon job subcategories or emerging roles in the theme.

🛑 Example of BAD (REJECTED) instruction:
"Write a job description for a data analyst." ← Too generic, overused.

✅ Example of GOOD instruction:
"Given a startup launching a fitness AI app, generate a creative job description for a hybrid role combining UX design and behavioral psychology."

📝 Output Format:
Instruction:
Input:
Response:

Note: This task must be distinct in both structure and content from all prior examples.
""")
    return "\n".join(prompt_parts)


def is_duplicate(new_entry):
    prompt_key = (new_entry['instruction'], new_entry['input'])
    response_hash = hash(new_entry['response'])

    if prompt_key in seen_prompts:
        print("⚠️ Skipped duplicate JD based on instruction+input.")
        return True
    if response_hash in seen_responses:
        print("⚠️ Skipped duplicate JD based on response hash.")
        return True

    seen_prompts.add(prompt_key)
    seen_responses.add(response_hash)
    return False

def sample_few_shots(pool, k=3):
    if not GROUP_BY_TOPIC:
        return random.sample(pool, k)
    # Simple topic grouping based on first keyword in instruction
    topic_to_items = {}
    for item in pool:
        topic = item['instruction'].split()[0]
        topic_to_items.setdefault(topic, []).append(item)
    candidates = [v for v in topic_to_items.values() if len(v) >= k]
    if not candidates:
        return random.sample(pool, k)
    return random.sample(random.choice(candidates), k)

# -------------------- GENERATION LOOP --------------------
generated_data = []

total_batches = range(0, NUM_SAMPLES, BATCH_SIZE)
for i in tqdm(total_batches, desc=f"🧠 Task {TASK_ID}: Generating", dynamic_ncols=True, mininterval=1.0):
    batch_prompts = []
    for _ in range(BATCH_SIZE):
        few_shots = sample_few_shots(few_shot_pool, FEW_SHOT_K)
        prompt = build_few_shot_prompt(few_shots)
        batch_prompts.append(prompt)

    print("\n📝 Sample Prompt Preview:")
    print("\n".join(batch_prompts[0].splitlines()[:20]))  # print first 20 lines of first prompt

    start_time = time.time()
    batch_outputs = llm_pipeline(
        batch_prompts,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=1.1,
        top_p=0.9,
        do_sample=True
    )
    end_time = time.time()
    print(f"🔹 Task {TASK_ID} | Batch {i // BATCH_SIZE + 1} took {end_time - start_time:.2f} seconds")

    for response in batch_outputs:
        full_text = response[0]["generated_text"]
        try:
            first_block = full_text.split("Instruction:", 1)[1]
            inst_part, rest = first_block.split("\nInput:", 1)
            input_part, resp_part = rest.split("\nResponse:", 1)
            resp_part = resp_part.split("\nInstruction:")[0]

            new_entry = {
                "instruction": inst_part.strip(),
                "input": input_part.strip(),
                "response": resp_part.strip()
            }

            if not is_duplicate(new_entry):
                generated_data.append(new_entry)
                few_shot_pool.append(new_entry)

                if len(generated_data) % 5 == 0:
                    print("\n🔹 Sample Output:")
                    print(json.dumps(new_entry, indent=2))

        except Exception:
            print("⚠️ Skipped malformed output:", full_text[:100])

# -------------------- SAVE OUTPUT --------------------
with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
    json.dump(generated_data, f, indent=2)

print(f"✅ Task {TASK_ID} done! Generated {len(generated_data)} samples → Saved to {OUTPUT_JSON}")

🧠 Task 1: Generating:   0%|          | 0/2 [00:00<?, ?it/s]


📝 Sample Prompt Preview:
Below are 3 example Job Description (JD) generation tasks:

### Example 1
Instruction:
Paralegal - Corporate Law
Input:
Support role in corporate law department, handling contract review, legal research, and compliance documentation.
Response:
{
  "job_title": "Paralegal - Corporate Law",
  "job_summary": "Our corporate law team is seeking an experienced Paralegal to support attorneys in contract drafting, legal research, and document management for compliance and governance matters.",
  "responsibilities": [
    "Review, draft, and organize contracts and legal documents.",
    "Conduct legal research on corporate law and regulatory requirements.",
    "Assist in the preparation of board meeting materials and minutes.",
    "Coordinate filings with state and federal agencies.",
    "Maintain legal databases and case tracking systems."
  ],
  "qualifications": [
    "Associate or Bachelor's degree in Legal Studies or related field.",


🧠 Task 1: Generating:  50%|█████     | 1/2 [20:32<20:32, 1232.10s/it]

🔹 Task 1 | Batch 1 took 1232.10 seconds
⚠️ Skipped duplicate JD based on instruction+input.

🔹 Sample Output:
{
  "instruction": "Architectural Designer",
  "input": "Full-time on-site role in Fort Myers, FL for designing custom high-end homes, including new model homes and client projects, with strong CAD skills and permitting knowledge.",
  "response": "{\n  \"job_title\": \"Architectural Designer\",\n  \"job_summary\": \"Florida Lifestyle Homes, a leading custom high-end home building company in Southwest Florida, is seeking a talented Architectural Designer to join our dynamic team. You will design and develop luxury residences and new model homes while collaborating with clients and internal teams, ensuring compliance with local permitting processes.\",\n  \"responsibilities\": [\n    \"Collaborate with clients and internal teams to understand project requirements and design preferences.\",\n    \"Create detailed CAD drawings and architectural construction plans for custom high-en

🧠 Task 1: Generating: 100%|██████████| 2/2 [41:03<00:00, 1231.89s/it]

🔹 Task 1 | Batch 2 took 1231.66 seconds
⚠️ Skipped duplicate JD based on instruction+input.

🔹 Sample Output:
{
  "instruction": "External Mortgage Loan Officer",
  "input": "Outside sales role at Orlando Credit Union focused on networking, cross-selling, and growing the loan portfolio.",
  "response": "{\n  \"job_title\": \"External Mortgage Loan Officer\",\n  \"job_summary\": \"The External Mortgage Loan Officer works with branch teams and members via phone, in-person, email, and mail to provide a positive member experience. This role is an outside sales and business development position, requiring attendance at networking events and functions to build referral relationships and promote Orlando Credit Union's lending products.\",\n  \"responsibilities\": [\n    \"Provide personalized service to credit union members and attract new members.\",\n    \"Attend external events and networking functions to develop and maintain a quality referral network (including real estate professionals,




In [None]:
# ✅ Enhanced Synthetic JD Generator with Expanded Prompt Diversity + Stop Sequence + Monitoring
import json
import random
import time
from tqdm import tqdm
from transformers import pipeline, StoppingCriteria, StoppingCriteriaList
import torch

# -------------------- CONFIG --------------------
TASK_ID = 6
BATCH_SIZE = 32
NUM_SAMPLES = 200
FEW_SHOT_K = 3
MAX_NEW_TOKENS = 256
SEED_FILE = "jd_seed_data.json"
OUTPUT_JSON = f"jd_part_{TASK_ID}.json"

# -------------------- SETUP --------------------
random.seed(1000 + TASK_ID)

with open(SEED_FILE, "r", encoding="utf-8") as f:
    seed_data = json.load(f)
assert len(seed_data) >= FEW_SHOT_K, "\u26a0\ufe0f Not enough seed examples for few-shot."

few_shot_pool = seed_data.copy()

# -------------------- PROMPT TEMPLATE POOL --------------------
prompt_templates = [
    # Template 1
    """Below are examples of job description (JD) tasks:\n
{few_shots}
Now generate **only one** new JD task following the same format.
Theme: {theme}
Job Subtype: {subtype}
Preferred Style: {style}
Skill Focus: {skill}
\u26a0\ufe0f Make sure the instruction, input and task framing are DIFFERENT from all the above.
Avoid using similar instructions or inputs.
Creativity is encouraged. Do NOT reuse any of the instruction/input wording.
Start immediately below:

Instruction:
Input:
Response:""",

    # Template 2
    """You are an expert JD generator. Here are some sample tasks:\n
{few_shots}
Now, create ONE new task for the following setup:
- Theme: {theme}
- Job Subtype: {subtype}
- Writing Style: {style}
- Skill Focus: {skill}
Make sure the task setting, instruction and input are new and varied.
Begin now:

Instruction:
Input:
Response:""",

    # Template 3
    """Observe the few-shot examples below. Then create one **new** job description task.\n
{few_shots}
The new task should reflect:
* Theme: {theme}
* Role Type: {subtype}
* Content Style: {style}
* Focus Skill: {skill}
Be original. Refrain from copying structures or wording from above.
Proceed:

Instruction:
Input:
Response:"""
]

# -------------------- ENHANCED SUBTYPE CONFIG --------------------
job_levels = [
    "Internship", "Entry-level", "Junior", "Mid-level", "Senior",
    "Lead", "Principal", "Staff-level", "Director-level"
]

job_modes = [
    "Remote", "On-site", "Hybrid", "Remote-first", "Client-facing"
]

job_types = [
    "Full-time", "Part-time", "Contract", "Freelance", "Temporary"
]

job_industries = [
    "Fintech", "Healthcare", "E-commerce", "Gaming", "Cybersecurity",
    "Education", "Media & Entertainment", "Travel & Hospitality", "Real Estate", "Blockchain",
    "Green Energy", "Logistics", "Retail", "Manufacturing", "Biotechnology",
    "Social Media", "Automotive", "Agritech", "Edtech", "AI Research",
    "Robotics", "Insurance", "Legal Tech", "Fashion", "Food Delivery",
    "Cloud Infrastructure", "IoT", "Space Tech", "Telecommunications", "Smart Home Tech",
    "Supply Chain", "Digital Health", "3D Printing", "Quantum Computing", "Augmented Reality",
    "DevOps & Tooling", "Open-source Platforms", "Digital Marketing", "Online Education"
]

job_specials = [
    "Startup environment", "Government project", "NGO role",
    "AI research lab", "Open-source focused team", "Early-stage startup",
    "High-frequency trading firm", "Data consultancy", "Cross-functional product team"
]

styles = ["Formal", "Creative", "Concise", "Technical"]
skill_focuses = ["Python", "SQL", "Machine Learning", "NLP", "Data Visualization"]
themes = ["Data Science", "Software Engineering"]


def random_job_subtype():
    parts = [
        random.choice(job_levels),
        random.choice(job_types),
        random.choice(job_modes),
        f"{random.choice(job_industries)} sector"
    ]
    if random.random() < 0.5:
        parts.append(random.choice(job_specials))
    return ", ".join(parts)

# -------------------- CUSTOM STOPPING CRITERIA --------------------
class StopOnSubsequence(StoppingCriteria):
    def __init__(self, stop_sequence, tokenizer):
        self.stop_ids = tokenizer.encode(stop_sequence, add_special_tokens=False)

    def __call__(self, input_ids, scores, **kwargs):
        if len(input_ids[0]) < len(self.stop_ids):
            return False
        return input_ids[0][-len(self.stop_ids):].tolist() == self.stop_ids

stop_sequence = "\nInstruction:"
stopping_criteria = StoppingCriteriaList([StopOnSubsequence(stop_sequence, tokenizer)])

# -------------------- HELPERS --------------------
seen_responses = set()
seen_prompts = set()

def build_few_shot_prompt(few_shot_examples):
    theme = random.choice(themes)
    subtype = random_job_subtype()
    style = random.choice(styles)
    skill = random.choice(skill_focuses)

    few_shot_parts = []
    for ex in few_shot_examples:
        few_shot_parts.append(
            f"Instruction: {ex['instruction']}\n"
            f"Input: {ex['input']}\n"
            f"Response: {ex['response']}\n"
        )
    few_shots = "\n".join(few_shot_parts)

    template = random.choice(prompt_templates)
    return template.format(
        few_shots=few_shots,
        theme=theme,
        subtype=subtype,
        style=style,
        skill=skill
    )

def is_duplicate(new_entry):
    prompt_key = (new_entry['instruction'], new_entry['input'])
    response_hash = hash(new_entry['response'])

    if prompt_key in seen_prompts:
        print("\u26a0\ufe0f Skipped duplicate JD based on instruction+input.")
        return True
    if response_hash in seen_responses:
        print("\u26a0\ufe0f Skipped duplicate JD based on response hash.")
        return True

    seen_prompts.add(prompt_key)
    seen_responses.add(response_hash)
    return False

# -------------------- GENERATION LOOP --------------------
generated_data = []
total_tokens_generated = 0

total_batches = range(0, NUM_SAMPLES, BATCH_SIZE)
for i in tqdm(total_batches, desc=f"\U0001f9e0 Task {TASK_ID}: Generating", dynamic_ncols=True, mininterval=1.0):
    batch_prompts = []
    for _ in range(BATCH_SIZE):
        few_shots = random.sample(few_shot_pool, FEW_SHOT_K)
        prompt = build_few_shot_prompt(few_shots)
        batch_prompts.append(prompt)

    start_time = time.time()
    batch_outputs = llm_pipeline(
        batch_prompts,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=1.0,
        top_p=0.95,
        do_sample=True,
        stopping_criteria=stopping_criteria
    )
    end_time = time.time()

    print(f"\U0001f539 Task {TASK_ID} | Batch {i // BATCH_SIZE + 1} took {end_time - start_time:.2f} seconds")

    for response in batch_outputs:
        full_text = response[0]["generated_text"]

        try:
            first_block = full_text.split("Instruction:", 1)[1]
            inst_part, rest = first_block.split("\nInput:", 1)
            input_part, resp_part = rest.split("\nResponse:", 1)
            resp_part = resp_part.split("\nInstruction:")[0]

            new_entry = {
                "instruction": inst_part.strip(),
                "input": input_part.strip(),
                "response": resp_part.strip()
            }

            if not is_duplicate(new_entry):
                generated_data.append(new_entry)
                few_shot_pool.append(new_entry)

                # Token length monitoring
                token_len = len(tokenizer.tokenize(resp_part))
                total_tokens_generated += token_len

                if len(generated_data) % 10 == 0:
                    print("\n\U0001f539 Sample Output:")
                    print(json.dumps(new_entry, indent=2))

        except Exception as e:
            print("\u26a0\ufe0f Skipped malformed output:", full_text[:100])

    avg_tokens = total_tokens_generated / max(1, len(generated_data))
    print(f"\U0001f4ca Avg response length so far: {avg_tokens:.2f} tokens")

# -------------------- SAVE OUTPUT --------------------
with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
    json.dump(generated_data, f, indent=4)

print(f"\u2705 Task {TASK_ID} done! Generated {len(generated_data)} samples \u2192 Saved to {OUTPUT_JSON}")

# **Synthetic Data Merge and Cleaning**  

The final data for fine tuning is "jd_all_cleaned.json"

Merge jd_part_*.json together:

In [None]:
import os
import json
import glob

# Output filename
output_filename = "jd_all.json"

# Find all part files
file_pattern = "jd_part_*.json"
file_list = sorted(glob.glob(file_pattern))

print(f"🔍 Found {len(file_list)} JD part files:\n")

# Combine all data and count per file
all_data = []
file_counts = {}

for file_path in file_list:
    with open(file_path, "r", encoding="utf-8") as f:
        try:
            data = json.load(f)
            count = len(data)
            all_data.extend(data)
            file_name = os.path.basename(file_path)
            file_counts[file_name] = count
        except Exception as e:
            print(f"⚠️ Failed to read {file_path}. Error: {e}")

# Print per-file counts
print("📊 JD counts per file:")
for name, count in file_counts.items():
    print(f"  - {name}: {count} entries")

# Final total
print(f"\n📦 Total JD entries combined (no deduplication): {len(all_data)}")

# Save merged data
with open(output_filename, "w", encoding="utf-8") as f:
    json.dump(all_data, f, indent=2, ensure_ascii=False)

print(f"\n✅ Merging complete. Output saved to: {output_filename}")

🔍 Found 20 JD part files:

📊 JD counts per file:
  - jd_part_1.json: 64 entries
  - jd_part_10.json: 64 entries
  - jd_part_11.json: 74 entries
  - jd_part_12.json: 73 entries
  - jd_part_13.json: 33 entries
  - jd_part_14.json: 33 entries
  - jd_part_15.json: 34 entries
  - jd_part_16.json: 16 entries
  - jd_part_17.json: 56 entries
  - jd_part_18.json: 71 entries
  - jd_part_19.json: 35 entries
  - jd_part_2.json: 18 entries
  - jd_part_20.json: 22 entries
  - jd_part_3.json: 39 entries
  - jd_part_4.json: 68 entries
  - jd_part_5.json: 58 entries
  - jd_part_6.json: 33 entries
  - jd_part_7.json: 62 entries
  - jd_part_8.json: 68 entries
  - jd_part_9.json: 66 entries

📦 Total JD entries combined (no deduplication): 987

✅ Merging complete. Output saved to: jd_all.json


Deduplicate based on the exact 'response' field

In [None]:
import json

# Load the data from the JSON file
with open('jd_all.json', 'r') as f:
    jd_data = json.load(f)

# Deduplicate based on the exact 'response' field
seen_responses = set()
unique_jd_data = []

for item in jd_data:
    resp = item['response']
    if resp not in seen_responses:
        seen_responses.add(resp)
        unique_jd_data.append(item)

# Print stats before and after deduplication
print(f"Original count: {len(jd_data)}")
print(f"Deduplicated count: {len(unique_jd_data)}")

# Save the cleaned data to a new file
with open('jd_all_deduplicated.json', 'w') as f:
    json.dump(unique_jd_data, f, indent=2)


Original count: 987
Deduplicated count: 827


Deduplicate based on high similarity

In [None]:
import json
from difflib import SequenceMatcher
from tqdm import tqdm

# Load deduplicated data
with open('jd_all_deduplicated.json', 'r') as f:
    data = json.load(f)

def is_similar(a, b, threshold=0.95):
    return SequenceMatcher(None, a, b).ratio() >= threshold

# Deduplicate based on high similarity
cleaned_data = []
seen_responses = []

for item in tqdm(data, desc="Checking similarity"):
    response = item['response']
    if not any(is_similar(response, existing) for existing in seen_responses):
        seen_responses.append(response)
        cleaned_data.append(item)

print(f"Original count: {len(data)}")
print(f"After similarity-based deduplication: {len(cleaned_data)}")

# Save the final cleaned version
with open('jd_all_cleaned.json', 'w') as f:
    json.dump(cleaned_data, f, indent=2)


Checking similarity: 100%|██████████| 827/827 [05:22<00:00,  2.56it/s]

Original count: 827
After similarity-based deduplication: 703





In [None]:
import json
from sklearn.model_selection import train_test_split

# 1️⃣ 读取你的 JD 数据
with open("jd_all_cleaned.json", "r", encoding="utf-8") as f:
    jd_data = json.load(f)

print(f"📦 共加载 {len(jd_data)} 条 JD 数据")

# 2️⃣ 使用 sklearn 进行 80/20 划分
train_data, test_data = train_test_split(jd_data, test_size=0.2, random_state=42)

print(f"✅ 训练集：{len(train_data)} 条")
print(f"✅ 测试集：{len(test_data)} 条")

# 3️⃣ （可选）保存为两个文件
with open("jd_train.json", "w", encoding="utf-8") as f:
    json.dump(train_data, f, indent=2)

with open("jd_test.json", "w", encoding="utf-8") as f:
    json.dump(test_data, f, indent=2)


📦 共加载 703 条 JD 数据
✅ 训练集：562 条
✅ 测试集：141 条


# **Fine Tuning**

In [None]:
import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, BitsAndBytesConfig, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from huggingface_hub import login
from datasets import load_dataset

login(userdata.get('HF_TOKEN'))
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.3"

# Configure 4-bit Quantization (BitsAndBytes)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

# Load and Quantize the Model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto"
)
model = prepare_model_for_kbit_training(model)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# LoRA Configuration
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

In [None]:
# Load Dataset
dataset = load_dataset("json", data_files={
    "train": "jd_train.json",
    "test": "jd_test.json"
})

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
def preprocess_function(examples):
    prompt_texts = [
        f"### Instruction:\n{inst}\n\n### Input:\n{inp}\n\n### Response:\n{resp}"
        for inst, inp, resp in zip(examples["instruction"], examples["input"], examples["response"])
    ]

    tokenized = tokenizer(
        prompt_texts,
        truncation=True,
        padding="max_length",
        max_length=1024
    )

    tokenized["labels"] = tokenized["input_ids"]
    return tokenized


tokenized_dataset = dataset.map(preprocess_function, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(["instruction", "input", "response"])

Map:   0%|          | 0/562 [00:00<?, ? examples/s]

Map:   0%|          | 0/141 [00:00<?, ? examples/s]

In [None]:
# Config TrainingArguments
training_args = TrainingArguments(
    output_dir="./mistral_finetuned",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=50,
    save_steps=100,
    logging_steps=10,
    learning_rate=2e-4,
    num_train_epochs=3,
    warmup_steps=20,
    fp16=True,
    lr_scheduler_type="cosine",
    save_total_limit=2,
    report_to="none",
)

# Collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Start Training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

# Save LoRA adapter
trainer.save_model("./mistral_finetuned")

  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss
50,0.1334,0.125675
100,0.1,0.100869


  return fn(*args, **kwargs)


Download Model

In [None]:
import shutil
from google.colab import files


shutil.make_archive('/content/mistral_finetuned', 'zip', '/content/mistral_finetuned')

files.download('/content/mistral_finetuned.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Upload Model

In [None]:
from google.colab import files

uploaded = files.upload()  # 会弹出文件上传窗口，选择 mistral_finetuned.zip

Saving mistral_finetuned.zip to mistral_finetuned.zip


In [None]:
import zipfile

with zipfile.ZipFile("mistral_finetuned.zip", 'r') as zip_ref:
    zip_ref.extractall("mistral_finetuned")

# **Evaluation**

1. Evaluate baseline model with BLEU, ROUGE and Perplexity

In [None]:
!pip install evaluate sacrebleu rouge_score nltk



In [None]:
!pip install bert-score

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert-score
Successfully installed bert-score-0.3.13


# Baseline Model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 使用 4-bit 加载
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    device_map="auto",
    quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
tokenizer.pad_token = tokenizer.eos_token


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [None]:
import json

# 只取前100条样本进行测试（可调）
with open("jd_test.json", "r", encoding="utf-8") as f:
    samples = json.load(f)[:100]


In [None]:
from tqdm import tqdm

predictions = []
references = []

for example in tqdm(samples):
    instruction = example["instruction"]
    input_text = example["input"]
    reference = example["response"]

    prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=False
    )

    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # 只取生成的 response 部分
    if "### Response:" in decoded:
        pred_response = decoded.split("### Response:")[1].strip()
    else:
        pred_response = decoded

    predictions.append(pred_response)
    references.append(reference)


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  1%|          | 1/100 [00:20<33:59, 20.60s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  2%|▏         | 2/100 [00:43<35:38, 21.82s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  3%|▎         | 3/100 [00:55<28:10, 17.43s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  4%|▍         | 4/100 [01:07<24:35, 15.37s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  5%|▌         | 5/100 [01:24<25:07, 15.87s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  6%|▌         | 6/100 [01:38<23:48, 15.20s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  7%|▋         | 7/100 [02:00<27:19, 17.63s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  8%|▊         | 8/100 [02:09<22:23, 14.61s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  9%|▉      

In [None]:
from nltk.translate.bleu_score import corpus_bleu
from rouge_score import rouge_scorer

# BLEU（每个参考都要是 list of tokens）
bleu_score = corpus_bleu([[ref.split()] for ref in references], [pred.split() for pred in predictions])
print(f"🔵 BLEU Score: {bleu_score:.4f}")

# ROUGE（使用中间件 rouge_scorer）
scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], use_stemmer=True)
rouge1, rougeL = 0, 0
for ref, pred in zip(references, predictions):
    scores = scorer.score(ref, pred)
    rouge1 += scores["rouge1"].fmeasure
    rougeL += scores["rougeL"].fmeasure

rouge1 /= len(references)
rougeL /= len(references)
print(f"🔴 ROUGE-1: {rouge1:.4f} | 🔴 ROUGE-L: {rougeL:.4f}")

🔵 BLEU Score: 0.0105
🔴 ROUGE-1: 0.2899 | 🔴 ROUGE-L: 0.1525


In [None]:
import torch.nn.functional as F
import math

def calculate_perplexity(prompt, target_response):
    full_text = prompt + target_response
    inputs = tokenizer(full_text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits[:, :-1]
        labels = inputs["input_ids"][:, 1:]
        loss = F.cross_entropy(logits.reshape(-1, logits.size(-1)), labels.reshape(-1), reduction="mean")
    return math.exp(loss.item())

# 对前 20 个样本计算平均 PPL
ppl_scores = []
for i in range(20):
    prompt = f"### Instruction:\n{samples[i]['instruction']}\n\n### Input:\n{samples[i]['input']}\n\n### Response:\n"
    response = samples[i]["response"]
    ppl = calculate_perplexity(prompt, response)
    ppl_scores.append(ppl)

print(f"🟡 Avg Perplexity (20 samples): {sum(ppl_scores) / len(ppl_scores):.2f}")

🟡 Avg Perplexity (20 samples): 4.89


In [None]:
from bert_score import score

P, R, F1 = score(predictions, references, lang="en", verbose=True)
print(f"💚 BERTScore - Precision: {P.mean():.4f}, Recall: {R.mean():.4f}, F1: {F1.mean():.4f}")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/3 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/2 [00:00<?, ?it/s]

done in 2.38 seconds, 41.99 sentences/sec
💚 BERTScore - Precision: 0.8274, Recall: 0.7926, F1: 0.8095


In [None]:
required_fields = [
    "job_title", "job_summary", "responsibilities", "qualifications",
    "benefits", "schedule", "compensation"
]

def check_structure(response_json_str):
    try:
        jd = json.loads(response_json_str)
        return sum(1 for field in required_fields if field in jd) / len(required_fields)
    except Exception:
        return 0.0  # 如果不是合法 JSON，结构准确度为 0

structure_scores = [check_structure(resp) for resp in predictions]
avg_structure_score = sum(structure_scores) / len(structure_scores)
print(f"🏗️ Structure Accuracy (avg fields covered): {avg_structure_score:.4f}")

🏗️ Structure Accuracy (avg fields covered): 0.0000


In [None]:
from collections import Counter

def distinct_n(sentences, n=1):
    total_ngrams = 0
    distinct_ngrams = set()
    for sent in sentences:
        tokens = sent.split()
        ngrams = list(zip(*[tokens[i:] for i in range(n)]))
        total_ngrams += len(ngrams)
        distinct_ngrams.update(ngrams)
    return len(distinct_ngrams) / total_ngrams if total_ngrams > 0 else 0

d1 = distinct_n(predictions, 1)
d2 = distinct_n(predictions, 2)
print(f"🌈 Diversity - Distinct-1: {d1:.4f}, Distinct-2: {d2:.4f}")


🌈 Diversity - Distinct-1: 0.0925, Distinct-2: 0.2874


In [None]:
print("===== Evaluation Summary =====")
print(f"🔵 BLEU: {bleu_score:.4f}")
print(f"🔴 ROUGE-1: {rouge1:.4f} | ROUGE-L: {rougeL:.4f}")
print(f"🟡 Perplexity: {sum(ppl_scores) / len(ppl_scores):.2f}")
print(f"💚 BERTScore (F1): {F1.mean():.4f}")
print(f"🏗️ Structure Accuracy: {avg_structure_score:.4f}")
print(f"🌈 Diversity - Distinct-1: {d1:.4f} | Distinct-2: {d2:.4f}")

===== Evaluation Summary =====
🔵 BLEU: 0.0105
🔴 ROUGE-1: 0.2899 | ROUGE-L: 0.1525
🟡 Perplexity: 4.89
💚 BERTScore (F1): 0.8095
🏗️ Structure Accuracy: 0.0000
🌈 Diversity - Distinct-1: 0.0925 | Distinct-2: 0.2874


# FineTuned Model

In [None]:
import json
import math
import torch
import torch.nn.functional as F
from tqdm import tqdm
from nltk.translate.bleu_score import corpus_bleu
from rouge_score import rouge_scorer
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel, PeftConfig

# ⚙️ 加载 Tokenizer 和模型（LoRA adapter）
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.3"
ADAPTER_PATH = "./mistral_finetuned"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    quantization_config=bnb_config
)

model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)
model.eval()

# 📂 读取测试集（与你 baseline 保持一致）
with open("jd_test.json", "r", encoding="utf-8") as f:
    samples = json.load(f)[:100]  # 可调数量

predictions = []
references = []

print("🚀 Generating predictions from fine-tuned model...")
for example in tqdm(samples):
    instruction = example["instruction"]
    input_text = example["input"]
    reference = example["response"]

    prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=False
    )

    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if "### Response:" in decoded:
        pred_response = decoded.split("### Response:")[1].strip()
    else:
        pred_response = decoded

    predictions.append(pred_response)
    references.append(reference)

# 📊 计算 BLEU 分数
bleu = corpus_bleu([[ref.split()] for ref in references], [pred.split() for pred in predictions])
print(f"🔵 BLEU Score: {bleu:.4f}")

# 📊 计算 ROUGE 分数
scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], use_stemmer=True)
rouge1, rougeL = 0, 0
for ref, pred in zip(references, predictions):
    scores = scorer.score(ref, pred)
    rouge1 += scores["rouge1"].fmeasure
    rougeL += scores["rougeL"].fmeasure
rouge1 /= len(references)
rougeL /= len(references)
print(f"🔴 ROUGE-1: {rouge1:.4f} | 🔴 ROUGE-L: {rougeL:.4f}")

# 📉 计算 Perplexity（前 20 个样本）
def calculate_perplexity(prompt, target_response):
    full_text = prompt + target_response
    inputs = tokenizer(full_text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits[:, :-1]
        labels = inputs["input_ids"][:, 1:]
        loss = F.cross_entropy(logits.reshape(-1, logits.size(-1)), labels.reshape(-1), reduction="mean")
    return math.exp(loss.item())

print("🧠 Calculating perplexity...")
ppl_scores = []
for i in range(20):  # 可调数量
    prompt = f"### Instruction:\n{samples[i]['instruction']}\n\n### Input:\n{samples[i]['input']}\n\n### Response:\n"
    response = samples[i]["response"]
    ppl = calculate_perplexity(prompt, response)
    ppl_scores.append(ppl)

print(f"🟡 Avg Perplexity (20 samples): {sum(ppl_scores) / len(ppl_scores):.2f}")


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

🚀 Generating predictions from fine-tuned model...


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  1%|          | 1/100 [00:30<50:23, 30.54s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  2%|▏         | 2/100 [01:01<50:01, 30.63s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  3%|▎         | 3/100 [01:31<49:20, 30.52s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  4%|▍         | 4/100 [02:01<48:37, 30.39s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  5%|▌         | 5/100 [02:32<48:02, 30.34s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  6%|▌         | 6/100 [03:02<47:31, 30.33s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  7%|▋         | 7/100 [03:32<46:55, 30.28s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  8%|▊         | 8/100 [04:02<46:20, 30.22s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  9%|▉      

🔵 BLEU Score: 0.6443
🔴 ROUGE-1: 0.8265 | 🔴 ROUGE-L: 0.7199
🧠 Calculating perplexity...
🟡 Avg Perplexity (20 samples): 1.11


In [None]:
from bert_score import score

P, R, F1 = score(predictions, references, lang="en", verbose=True)
print(f"💚 BERTScore - Precision: {P.mean():.4f}, Recall: {R.mean():.4f}, F1: {F1.mean():.4f}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/3 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/2 [00:00<?, ?it/s]

done in 1.79 seconds, 55.80 sentences/sec
💚 BERTScore - Precision: 0.9519, Recall: 0.9571, F1: 0.9545


In [None]:
required_fields = [
    "job_title", "job_summary", "responsibilities", "qualifications",
    "benefits", "schedule", "compensation"
]

def check_structure(response_json_str):
    try:
        jd = json.loads(response_json_str)
        return sum(1 for field in required_fields if field in jd) / len(required_fields)
    except Exception:
        return 0.0  # 如果不是合法 JSON，结构准确度为 0

structure_scores = [check_structure(resp) for resp in predictions]
avg_structure_score = sum(structure_scores) / len(structure_scores)
print(f"🏗️ Structure Accuracy (avg fields covered): {avg_structure_score:.4f}")

🏗️ Structure Accuracy (avg fields covered): 0.0800


In [None]:
from collections import Counter

def distinct_n(sentences, n=1):
    total_ngrams = 0
    distinct_ngrams = set()
    for sent in sentences:
        tokens = sent.split()
        ngrams = list(zip(*[tokens[i:] for i in range(n)]))
        total_ngrams += len(ngrams)
        distinct_ngrams.update(ngrams)
    return len(distinct_ngrams) / total_ngrams if total_ngrams > 0 else 0

d1 = distinct_n(predictions, 1)
d2 = distinct_n(predictions, 2)
print(f"🌈 Diversity - Distinct-1: {d1:.4f}, Distinct-2: {d2:.4f}")


🌈 Diversity - Distinct-1: 0.0175, Distinct-2: 0.0252


In [None]:
print("===== Evaluation Summary =====")
print(f"🔵 BLEU: {bleu_score:.4f}")
print(f"🔴 ROUGE-1: {rouge1:.4f} | ROUGE-L: {rougeL:.4f}")
print(f"🟡 Perplexity: {sum(ppl_scores) / len(ppl_scores):.2f}")
print(f"💚 BERTScore (F1): {F1.mean():.4f}")
print(f"🌈 Diversity - Distinct-1: {d1:.4f} | Distinct-2: {d2:.4f}")

===== Evaluation Summary =====
🔵 BLEU: 0.0105
🔴 ROUGE-1: 0.8265 | ROUGE-L: 0.7199
🟡 Perplexity: 1.11
💚 BERTScore (F1): 0.9545
🏗️ Structure Accuracy: 0.0800
🌈 Diversity - Distinct-1: 0.0175 | Distinct-2: 0.0252
