<a href="https://colab.research.google.com/github/Musab678/LDA/blob/main/last_final_assistant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git


In [2]:
from google.colab import userdata
from huggingface_hub import login

login(token=userdata.get('HUG_KEY'))  # Use your HF Token


In [3]:
from unsloth import FastLanguageModel
from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import is_bfloat16_supported


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [4]:
max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    token=userdata.get('HUG_KEY'),
)

==((====))==  Unsloth 2025.2.12: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [5]:
# Apply LoRA adapter for training
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Use "unsloth" for long context
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

Unsloth 2025.2.12 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [6]:
# Define the train_prompt_style and EOS_TOKEN before using them
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.



### Instruction: You are an expert in the fields of Accounting, Banks and Banking, Business, Communication, Career management, E-Business, Economics, Entrepreneurship, Finance, Financial Planning, Industries and Professions, Investments, Management and Leadership, Marketing and Sales, Real Estate, Stock Trading, Tax, Small Business and Entrepreneurship.
Write a response focusing only on these business categories.

### Question:
{}

### Response:
<think>
{}
</think>
{}"""


In [7]:
EOS_TOKEN = "<|endoftext|>"  # Or any other suitable end-of-sequence token


In [9]:
# Load dataset
dataset = load_dataset("json", data_files="/content/fine_tuning_data.jsonl", split="train")

Generating train split: 0 examples [00:00, ? examples/s]

In [10]:
# Function to format the dataset according to the updated structure
def formatting_prompts_func(examples):
    inputs = examples["prompt"]  # Use the correct column name
    outputs = examples["completion"]  # Use the correct column name
    texts = []

    for input_text, output_text in zip(inputs, outputs):
        # Format the prompt with instructions to stay within the specified domains
        text = train_prompt_style.format(input_text, "", output_text) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}

# Apply the formatting function to the dataset
dataset = dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/894 [00:00<?, ? examples/s]

In [11]:
# Inspect the first example to verify the format
print(dataset["text"][0])

Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.



### Instruction: You are an expert in the fields of Accounting, Banks and Banking, Business, Communication, Career management, E-Business, Economics, Entrepreneurship, Finance, Financial Planning, Industries and Professions, Investments, Management and Leadership, Marketing and Sales, Real Estate, Stock Trading, Tax, Small Business and Entrepreneurship.
Write a response focusing only on these business categories.

### Question:
lack of visibility into financial performance across different departments implement a financial performance dashboard with department-specific data and analytics

### Response:
<think>

</think>
accounting<|endoftext|>


In [12]:
# Apply LoRA adapter for training
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Use "unsloth" for long context
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

Unsloth: Already have LoRA adapters! We shall skip this step.


In [13]:
# Define trainer settings
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",  # Use this for WandB or other reporting tools
    ),
)

# Start training
trainer_stats = trainer.train()

Map:   0%|          | 0/894 [00:00<?, ? examples/s]

Converting train dataset to ChatML (num_proc=2):   0%|          | 0/894 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=2):   0%|          | 0/894 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/894 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/894 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 894 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
10,2.9349
20,2.1634
30,2.0184
40,1.8145
50,1.817
60,1.4961


In [14]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.


### Instruction:
Instruction: You are an expert in the fields of Accounting, Banks and Banking, Business, Communication, Career management, E-Business, Economics, Entrepreneurship, Finance, Financial Planning, Industries and Professions, Investments, Management and Leadership, Marketing and Sales, Real Estate, Stock Trading, Tax, Small Business and Entrepreneurship.
Write a response focusing only on these business categories.

### Question:
{}

### Response:
<think>{}"""

In [29]:
def is_business_related(question):
    # Define expanded keywords for various business-related categories
    business_keywords = [
        "accounting", "bookkeeping", "audit", "financial statement", "ledgers", "tax filing", "balance sheet", "revenue",
        "expenses", "profit", "liabilities", "assets", "accounts receivable", "accounts payable",  # Accounting
        "bank", "banking", "loans", "credit", "savings", "interest rates", "mortgage", "account management", "transactions",
        "bank fees", "digital banking", "financial institution", "deposit", "withdrawal", "business", "sales", "strategy", "growth", "payment", "improvement", "revenue", "startup", "operations",
        "entrepreneurship", "management", "finance", "marketing", "sales", "investment", "market", "payment gateway",  # Banks and Banking
        "business", "strategy", "growth", "startup", "operations", "team management", "business model", "venture",
        "innovation", "profit", "scale", "market share", "industry", "competition",  # Business
        "communication", "email", "meeting", "negotiation", "public relations", "corporate communication", "branding",
        "presentations", "messaging",  # Business Communication
        "career", "job search", "resume", "skills", "job interview", "career development", "promotion", "networking",
        "professional growth", "career path",  # Career Management
        "e-commerce", "online store", "digital marketing", "online business", "website", "payment gateway", "online transaction",
        "digital sales", "product listings", "dropshipping", "affiliate marketing", "e-business model",  # E-Business
        "economics", "market", "supply and demand", "inflation", "GDP", "trade", "fiscal policy", "monetary policy",
        "recession", "growth", "economic theory", "economic trends", "market equilibrium",  # Economics
        "entrepreneur", "startup", "business plan", "funding", "innovation", "pitching", "venture capital", "small business",
        "founder", "leadership", "scalability", "risk-taking", "idea generation",  # Entrepreneurship
        "finance", "investments", "capital", "assets", "liabilities", "equity", "risk", "portfolio", "stock market",
        "bonds", "savings", "returns", "financial analysis", "wealth management",  # Finance
        "financial planning", "budgeting", "savings plan", "retirement", "financial goals", "investment strategies",
        "debt management", "financial advisor", "wealth planning", "estate planning", "tax planning",  # Financial Planning
        "industry", "profession", "career", "job market", "manufacturing", "technology", "healthcare", "retail", "education",
        "service sector", "labor market", "sector", "business professional",  # Industries and Professions
        "investment", "stocks", "bonds", "mutual funds", "ETF", "portfolio management", "risk assessment", "diversification",
        "asset allocation", "real estate investment", "venture capital", "angel investors",  # Investments
        "management", "leadership", "team management", "project management", "decision-making", "strategic planning",
        "organizational structure", "performance", "coaching", "mentoring",  # Management and Leadership
        "marketing", "sales", "advertising", "branding", "content marketing", "SEO", "SEM", "social media", "customer acquisition",
        "lead generation", "market research", "sales funnel", "digital marketing",  # Marketing and Sales
        "real estate", "property", "investment property", "rental", "leasing", "market trends", "commercial real estate",
        "residential property", "real estate agent", "real estate market", "home buying", "property management",  # Real Estate
        "stock trading", "shares", "portfolio", "stock market", "day trading", "brokerage", "stock analysis", "trading strategy",
        "market trends", "investment", "financial markets", "equity", "risk management",  # Stock Trading
        "tax", "taxation", "income tax", "corporate tax", "tax filing", "tax returns", "deductions", "tax planning", "tax strategy",
        "VAT", "tax bracket", "corporate tax rate", "tax laws", "compliance",  # Tax
        "small business", "entrepreneur", "startup", "business plan", "market research", "funding", "business development",
        "small business strategy", "scaling", "operations", "cash flow", "small business management", "business ownership"  # Small Business and Entrepreneurship
    ]
    return any(keyword in question.lower() for keyword in business_keywords)

question = "Best practices of sell tax"

if is_business_related(question):
    FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
    inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=1200,
        use_cache=True,
    )
    response = tokenizer.batch_decode(outputs)
    print(response[0].split("### Response:")[1])
else:
    print("The question is not business-related.")



<think>
Okay, so I'm trying to figure out the best practices for selling tax. First, I need to understand what selling tax means. It sounds like it's about selling tax-related products or services. Maybe that includes tax software, tax consulting, or tax education. I should think about the different industries and professions involved, like accounting, finance, and business. 

I remember reading that understanding the audience is crucial. So, if I'm selling tax software, my audience might be accountants, tax professionals, and small business owners. Each group has different needs, so I need to tailor my approach. For accountants, I might highlight features like automation and integration with accounting software. For small business owners, I might focus on ease of use and affordability. 

Next, I think about marketing strategies. Maybe using social media to reach a wider audience, or creating content that provides value, like tax tips or tutorials. Networking with professionals in the

In [17]:
# Save the trained model
model.save_pretrained("deepseek-business-solution-think")
tokenizer.save_pretrained("deepseek-business-solution-think")

('deepseek-business-solution-think/tokenizer_config.json',
 'deepseek-business-solution-think/special_tokens_map.json',
 'deepseek-business-solution-think/tokenizer.json')

In [20]:
from huggingface_hub import create_repo, upload_folder

# Define the model path and repository name
repo_name = "business-assistant-think"  # Removed trailing space
model_path = "/content/deepseek-business-solution-think"  # Replace with the path to your saved model folder

# Check if the repository exists, and skip creation if it does
try:
    create_repo(repo_name, private=False, exist_ok=True) # Use exist_ok=True to skip if it exists
except Exception as e:
    print(f"Warning: Repository creation failed with error: {e}")
    print("Continuing with upload, assuming the repository already exists.")

# Upload the model folder to Hugging Face
upload_folder(
    folder_path=model_path,
    repo_id=f"Musab123456/{repo_name}"  # Replace with your Hugging Face username and repo name
)


adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Musab123456/business-assistant-think/commit/4b18d746c95d2471037132e387e7a01584f2b47d', commit_message='Upload folder using huggingface_hub', commit_description='', oid='4b18d746c95d2471037132e387e7a01584f2b47d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Musab123456/business-assistant-think', endpoint='https://huggingface.co', repo_type='model', repo_id='Musab123456/business-assistant-think'), pr_revision=None, pr_num=None)

In [22]:
import shutil

# Create a zip file of the model folder
shutil.make_archive("/content/deepseek-business-solution-think", 'zip', "/content/deepseek-business-solution-think")


'/content/deepseek-business-solution-think.zip'

In [24]:
from google.colab import files

# Download the zip file
files.download('/content/deepseek-business-solution-think.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>