<a href="https://colab.research.google.com/github/AnasEhtisham/FYP/blob/main/LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Step 1: Install Necessary Libraries (Focus on upgrading datasets first)

# Try to get the latest possible version of datasets.
# This might pull in a version with more up-to-date fsspec compatibility.
!pip install -q datasets --upgrade

# Now install the other packages. Pip will attempt to reconcile their dependencies
# with what 'datasets --upgrade' has established (including its fsspec version).
# The --upgrade flag encourages pip to fetch later versions if available and compatible.
!pip install -q transformers accelerate peft bitsandbytes huggingface_hub --upgrade

# Optional: After installation, you can list the installed versions to check them
# Remove the '#' from the line below to run it
# !pip list | grep -E "datasets|fsspec|gcsfs|transformers|huggingface-hub"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/193.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
# Step 2: Import Libraries and Log in to Hugging Face
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import Dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from huggingface_hub import notebook_login

# Log in to Hugging Face Hub to download Llama 2
# You'll need a Hugging Face account and an access token with read permissions.
# Make sure you have accepted the Llama 2 license on Hugging Face.
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
# Step 3: Prepare Your Sample Data
# Using the example you provided

job_post_1 = """We’re looking for an experienced PHP developer with solid skills in HubSpot, WordPress, and MySQL to help with ongoing tasks on a live project.

This job will include:

Fixing existing issues and bugs
Integrating 3rd-party APIs with HubSpot and WordPress
Customizing PHP scripts and WordPress plugins
Working with MySQL for data handling and optimization
Coordinating with our internal team and following best practices

Requirements:
✅ Proven experience with HubSpot CMS and API
✅ Strong PHP and WordPress skills
✅ Experience integrating APIs (RESTful, HubSpot, etc.)
✅ Solid understanding of MySQL and database troubleshooting
✅ Familiarity with version control (e.g., Git)
✅ Good communication

This is a long-term, ongoing project, so we’re looking to build a relationship with a reliable and skilled developer.

When applying, please share:

A few relevant projects you’ve worked on
Your experience with HubSpot + WordPress API integration"""

proposal_1 = """Hey,

My name is Imran and I'm a WordPress developer with over 7 years of experience developing custom WP themes and plugin.

I have gone through the job and I understand the requirement that you need someone who knows how to integrate the hugspot API's with WordPress, can you please clarify if you're feeding data into Hubspot or is this integration be needed for some other purpose? I'm happy to assist for both scenarios.

I have good understanding of WordPress databases specially when it comes to optimizing MYSQL queries and managing requests generated from WP.

I'm good at GIT and understand the operations for version controlling.

I'm good at PHP and WordPress custom development, integrating API's and managing servers.

Here are some custom websites that I created from scratch for enterprise level clients (utilizing Hubspot through API's and Forms)https://www.advarra.com/ [Forms added to this website have special operations at the backend, also the Events are pulled through API's]https://www.coolblue.com/ [Hubspot forms and API integrations]https://integral.com [API integrations and custom work at the backend]

I'm happy to jump on a call and discuss my projects with you on a call - I'm available between 9am EST to 5pm EST [Monday - Friday].

Thanks,"""

# For fine-tuning, we need to structure this into a prompt and response format.
# Llama 2 uses a specific instruction format.
# We'll create a 'text' field that includes both the instruction and the desired output.

data = [
    {
        "job_description": job_post_1,
        "proposal": proposal_1
    }
    # Add more dictionaries here as you create more data
    # e.g., {"job_description": job_post_2, "proposal": proposal_2},
]

# Convert to Hugging Face Dataset
dataset = Dataset.from_list(data)

# --- THIS IS WHERE YOU WILL EXPAND YOUR DATASET ---
# As you create more (job_post, proposal) pairs, you'll add them to the 'data' list above.
# For a real fine-tuning run, you would load this from a CSV or JSON file.
# Example of loading from CSV if you prepare it:
# from datasets import load_dataset
# dataset = load_dataset("csv", data_files={"train": "your_data.csv"})["train"]
# Make sure your CSV has 'job_description' and 'proposal' columns.

In [5]:
# Step 4: Define Model ID and Quantization Configuration

model_id = "meta-llama/Llama-2-7b-hf" # Using Llama 2 7B

# BitsAndBytesConfig for 4-bit quantization
# This significantly reduces memory usage, crucial for Colab free tier
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # Recommended quantization type
    bnb_4bit_compute_dtype=torch.bfloat16, # Use bfloat16 for computation
    bnb_4bit_use_double_quant=True,     # Optional, can improve quality slightly
)

In [6]:
# Step 5: Load Tokenizer and Model

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
# Llama 2 typically doesn't have a pad token, so we set it to eos_token
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Ensure padding is on the right

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto", # Automatically maps model layers to available devices (GPU/CPU)
    trust_remote_code=True
)

# Prepare model for k-bit training (important for LoRA + quantization)
model = prepare_model_for_kbit_training(model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [8]:
# Step 6: Preprocess Data - Format and Tokenize
print("--- Step 6: Preprocessing Data ---")

# Ensure that 'dataset' (Hugging Face Dataset object from Step 3)
# and 'tokenizer' (loaded in Step 5) are defined before running this cell.

def format_and_tokenize(example):
    # Llama 2 instruction fine-tuning format
    # Adjust max_length based on your typical proposal length and Colab limits
    # Max sequence length for Llama 2 is 4096, but shorter is better for Colab free tier
    max_length = 1024 # Start with a reasonable length, e.g., 512, 1024, 2048

    prompt = f"""<s>[INST] Based on the following job description, write a compelling freelance proposal:

Job Description:
{example['job_description']} [/INST]
{example['proposal']} </s>""" # <s> and </s> are BOS/EOS tokens, [INST] marks instructions

    # Tokenize
    tokenized_inputs = tokenizer(
        prompt,
        max_length=max_length,
        padding="max_length", # Pad to max_length
        truncation=True,      # Truncate if longer than max_length
        return_tensors="pt"   # Return PyTorch tensors
    )
    # For CausalLM, labels are usually the same as input_ids. The model learns to predict the next token.
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].clone()
    return tokenized_inputs

# Apply the formatting and tokenization to the dataset
# batched=False processes one example at a time.
tokenized_dataset = dataset.map(format_and_tokenize, batched=False)
print("Dataset formatted and tokenized.")

# Display a sample of tokenized data for verification
print("\nSample of tokenized data (raw input_ids tensor from the first example):")
print(tokenized_dataset[0]['input_ids'])

print("\nDecoded sample (first item from the tokenized_dataset, special tokens skipped):")
# Correctly extract the 1D list/tensor of token IDs for decoding
input_ids_for_decode = tokenized_dataset[0]['input_ids'][0] # Get the first (and only) sequence from the batch of 1
if hasattr(input_ids_for_decode, 'tolist'): # Check if it's a tensor
    input_ids_list_for_decode = input_ids_for_decode.tolist()
else: # If not, assume it's already a list
    input_ids_list_for_decode = input_ids_for_decode
print(tokenizer.decode(input_ids_list_for_decode, skip_special_tokens=True))

# To see the version with special tokens (like <s>, </s>, [INST]):
# print("\nDecoded sample with special tokens:")
# print(tokenizer.decode(input_ids_list_for_decode, skip_special_tokens=False))

print("Data preprocessing complete.\n")


--- Step 6: Preprocessing Data ---


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Dataset formatted and tokenized.

Sample of tokenized data (raw input_ids tensor from the first example):
[[1, 1, 518, 25580, 29962, 16564, 373, 278, 1494, 4982, 6139, 29892, 2436, 263, 752, 7807, 3005, 295, 749, 24963, 29901, 13, 13, 11947, 12953, 29901, 13, 4806, 30010, 276, 3063, 363, 385, 18860, 5048, 13897, 411, 7773, 25078, 297, 14533, 5592, 327, 29892, 10803, 10923, 29892, 322, 9254, 304, 1371, 411, 373, 17696, 9595, 373, 263, 5735, 2060, 29889, 13, 13, 4013, 4982, 674, 3160, 29901, 13, 13, 29943, 861, 292, 5923, 5626, 322, 24557, 13, 23573, 1218, 29871, 29941, 5499, 29899, 22633, 23649, 411, 14533, 5592, 327, 322, 10803, 10923, 13, 7281, 5281, 5048, 12078, 322, 10803, 10923, 18224, 13, 5531, 292, 411, 9254, 363, 848, 11415, 322, 13883, 13, 7967, 4194, 1218, 411, 1749, 7463, 3815, 322, 1494, 1900, 23274, 13, 13, 1123, 1548, 1860, 29901, 13, 31681, 1019, 854, 7271, 411, 14533, 5592, 327, 315, 4345, 322, 3450, 13, 31681, 3767, 549, 5048, 322, 10803, 10923, 25078, 13, 31681, 28224,

In [9]:
# Step 7: Configure LoRA (Low-Rank Adaptation)

# LoRA significantly reduces the number of trainable parameters.
lora_config = LoraConfig(
    r=16,  # Rank of the update matrices. Common values: 8, 16, 32, 64.
    lora_alpha=32, # Alpha scaling factor (r * 2 is a common starting point).
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # Modules to apply LoRA to in Llama 2.
                                                            # You can find these by printing model architecture.
    lora_dropout=0.05, # Dropout probability for LoRA layers.
    bias="none", # Set bias to 'none' for LoRA.
    task_type="CAUSAL_LM" # Causal Language Modeling.
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # This will show how few parameters are actually being trained.

trainable params: 16,777,216 || all params: 6,755,192,832 || trainable%: 0.2484


In [11]:
# Step 8: Define Training Arguments

# Define output directory for checkpoints and final model
output_dir = "./upfreelance_llama2_7b_proposals_lora"

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,  # Start with 1 due to Colab free tier memory limits.
    gradient_accumulation_steps=4,  # Effective batch size = batch_size * accumulation_steps (1*4=4)
    learning_rate=2e-4,             # Common learning rate for LoRA.
    num_train_epochs=1,             # Start with 1 epoch for testing. Increase for real training (e.g., 3-5).
                                    # **With only 1 data point, more epochs are meaningless here.**
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=10,               # Log every 10 steps.
    save_strategy="epoch",          # Save a checkpoint at the end of each epoch.
    # optim="paged_adamw_8bit",     # Optimizer that can save memory if needed (for 8-bit, with 4-bit it might not be necessary or compatible without extra checks)
    fp16=False, # Set to False when using 4-bit quantization as BitsAndBytes handles precision.
                # If using 8-bit or no quantization, and GPU supports it (T4 does), set to True.
    bf16=True, # Set to True if using bnb_4bit_compute_dtype=torch.bfloat16 and GPU supports it (T4 might not fully, Ampere+ does)
               # If not, ensure bnb_4bit_compute_dtype is torch.float16 and set fp16=True if not using 4-bit.
               # With 4-bit nf4 and compute_dtype bfloat16, bf16=True is generally correct.
    report_to="tensorboard",        # Optional: for tracking with TensorBoard.
    # max_steps = 10 # For quick testing: train only for a few steps. Remove for full training.
)

# Check GPU compatibility for bf16
if not torch.cuda.is_bf16_supported():
    print("BF16 is not supported on this GPU. TrainingArguments.bf16 will be ignored or might cause issues if True.")
    # If bf16 is not supported, and you used torch.bfloat16 in BitsAndBytesConfig,
    # you might need to switch BitsAndBytesConfig to use torch.float16
    # and set training_args.fp16 = True instead of bf16 = True.
    # For now, we'll proceed, but this is a common point of configuration.
    # Often, even if not fully supported, it might still run due to internal fallbacks or mixed precision.
    training_args.bf16 = False # Fallback if not supported
    # If you change compute_dtype to float16, also set training_args.fp16 = True

In [12]:
# Step 9: Initialize Trainer and Start Training (Placeholder for full training)

# Data collator for language modeling. MLM (Masked Language Modeling) is False for Causal LM.
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset, # Using our single-example dataset
    # eval_dataset=tokenized_eval_dataset, # You would create this for evaluation
    data_collator=data_collator,
)

# Start training
print("Starting training (on a tiny dataset for demonstration)...")
# In a real scenario with more data, this would take time.
# With one data point and 1 epoch, it will be very fast.
try:
    trainer.train()
    print("Training finished.")
except Exception as e:
    print(f"An error occurred during training: {e}")
    print("This might be due to resource limitations or configuration issues.")
    print("Ensure your Colab instance has GPU allocated and consider reducing max_length or batch size further if it's an OOM error.")


# Save the LoRA adapters
lora_adapter_path = f"{output_dir}/final_lora_adapters"
model.save_pretrained(lora_adapter_path)
tokenizer.save_pretrained(lora_adapter_path) # Save tokenizer along with adapters
print(f"LoRA adapters saved to {lora_adapter_path}")

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Starting training (on a tiny dataset for demonstration)...


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


An error occurred during training: too many values to unpack (expected 4)
This might be due to resource limitations or configuration issues.
Ensure your Colab instance has GPU allocated and consider reducing max_length or batch size further if it's an OOM error.
LoRA adapters saved to ./upfreelance_llama2_7b_proposals_lora/final_lora_adapters


In [13]:
# Step 10: Inference with the Fine-Tuned LoRA Adapters

from peft import PeftModel
import gc

# Clear some memory before loading for inference
del model
del trainer
if torch.cuda.is_available():
    torch.cuda.empty_cache()
gc.collect()

# Load the base model again (quantized)
base_model_name = "meta-llama/Llama-2-7b-hf"
quant_config_inf = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

base_model_for_inference = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=quant_config_inf,
    device_map="auto",
    trust_remote_code=True
)

# Load the LoRA adapters onto the base model
# Ensure lora_adapter_path is correct from the training step
inference_model = PeftModel.from_pretrained(base_model_for_inference, lora_adapter_path)
inference_model.eval() # Set model to evaluation mode

# Load the tokenizer (if not already loaded for inference)
tokenizer_inf = AutoTokenizer.from_pretrained(lora_adapter_path) # Load from where it was saved
if tokenizer_inf.pad_token is None:
    tokenizer_inf.pad_token = tokenizer_inf.eos_token
    tokenizer_inf.padding_side = "right"


# Prepare a new job post for proposal generation
new_job_post = """We are seeking a skilled Python developer to create a data scraping tool for our e-commerce analytics platform. The ideal candidate will have experience with libraries like Scrapy or Beautiful Soup, be able to handle dynamic JavaScript-rendered content, and store data efficiently in a PostgreSQL database.

Key Responsibilities:
- Design and build web scrapers.
- Implement data cleaning and validation.
- Integrate with our existing PostgreSQL database.
- Troubleshoot and maintain scraping scripts.

Requirements:
- Proven experience in web scraping.
- Strong Python skills (Scrapy, Beautiful Soup, Requests).
- Experience with JavaScript-heavy websites (e.g., using Selenium or Puppeteer via Pyppeteer).
- Familiarity with PostgreSQL.
- Ability to write clean, maintainable code.

Please provide examples of previous scraping projects.
"""

# Format the input prompt for the model (MUST match the training format)
prompt_template_inf = "<s>[INST] Based on the following job description, write a compelling freelance proposal:\n\nJob Description:\n{job_description} [/INST]\n"
inference_prompt = prompt_template_inf.format(job_description=new_job_post)

print("\n--- Generating Proposal ---")
print(f"Input Prompt:\n{inference_prompt}")

# Tokenize the input
inputs = tokenizer_inf(inference_prompt, return_tensors="pt", truncation=True, max_length=1024).to(inference_model.device)


# Generate text
# Adjust generation parameters as needed
# Note: Since we "trained" on only one example, the output will likely not be good.
# This just demonstrates the inference pipeline.
try:
    with torch.no_grad(): # Ensure no gradients are calculated during inference
        outputs = inference_model.generate(
            **inputs,
            max_new_tokens=500,  # Max length of the generated proposal
            temperature=0.7,     # Controls randomness. Lower is more deterministic.
            top_p=0.9,           # Nucleus sampling.
            top_k=50,            # Top-k sampling.
            do_sample=True,      # Enable sampling for more creative output.
            eos_token_id=tokenizer_inf.eos_token_id,
            pad_token_id=tokenizer_inf.pad_token_id if tokenizer_inf.pad_token_id is not None else tokenizer_inf.eos_token_id
        )

    generated_text = tokenizer_inf.decode(outputs[0], skip_special_tokens=True)

    # The output will contain the input prompt as well, so we can try to extract just the proposal part
    # This depends on the exact output format and might need adjustment.
    # A simple way is to find the end of the instruction marker.
    inst_marker = "[/INST]"
    proposal_part = generated_text.split(inst_marker)[-1].strip() if inst_marker in generated_text else generated_text

    print(f"\nGenerated Proposal (raw output includes prompt):\n{generated_text}")
    print(f"\nExtracted Proposal Part:\n{proposal_part}")

except Exception as e:
    print(f"An error occurred during inference: {e}")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


--- Generating Proposal ---
Input Prompt:
<s>[INST] Based on the following job description, write a compelling freelance proposal:

Job Description:
We are seeking a skilled Python developer to create a data scraping tool for our e-commerce analytics platform. The ideal candidate will have experience with libraries like Scrapy or Beautiful Soup, be able to handle dynamic JavaScript-rendered content, and store data efficiently in a PostgreSQL database.

Key Responsibilities:
- Design and build web scrapers.
- Implement data cleaning and validation.
- Integrate with our existing PostgreSQL database.
- Troubleshoot and maintain scraping scripts.

Requirements:
- Proven experience in web scraping.
- Strong Python skills (Scrapy, Beautiful Soup, Requests).
- Experience with JavaScript-heavy websites (e.g., using Selenium or Puppeteer via Pyppeteer).
- Familiarity with PostgreSQL.
- Ability to write clean, maintainable code.

Please provide examples of previous scraping projects.
 [/INST]

