# SAWiT AI Hackathon: Colloquial Dataset Creation and Model Training

Welcome to the **SAWiT AI Hackathon**! In this hackathon, you'll be creating a colloquial language dataset for one of the following languages:

- **Marathi**
- **Tamil**
- **Hindi**
- **Telugu**
- **Malayalam**
- **Bengali**

### Task Overview

1. **Pick a Language:** Choose one language from the list above.
2. **Create a Colloquial Dataset:** Collect and curate a dataset that represents the colloquial language of your selected language. This could include informal conversations, slang, and regional variations.
3. **Train a Model:** Use your dataset to train a machine learning model that understands or processes the colloquial language. You can use existing models and fine-tune them with your dataset.
4. **Push the Dataset & Model to Hugging Face:** Once the model is trained, push both your dataset and model to [Hugging Face](https://huggingface.co). You will need to create a Hugging Face account if you don't have one already.
5. **Share the Final Links for Evaluation:** After pushing the dataset and model to Hugging Face, share the final links for evaluation in [Hackathon Platform](https://hackathon.guvi.in/login?hackathon-id=ed007f87-a6b9-47b5-9530-eac83b4033bf).

### Helpful Resources

- **[Unsloth](https://www.unsloth.ai/)**: An optimization framework for fine-tuning Large Language Models (LLMs) that makes training 2-4x faster. It provides optimized implementations of common operations like LoRA (Low-Rank Adaptation) training, specialized kernels for faster computation, and memory-efficient training methods. The framework integrates with Hugging Face's ecosystem and focuses on making LLM fine-tuning more efficient and accessible.
- **[Hugging Face](https://huggingface.co/)**: A platform for hosting and sharing your datasets and models. You can push your trained models and datasets to Hugging Face and share the link for evaluation.

### Steps Overview:

1. **Dataset Creation**: Use Unsloth or similar approach for dataset creation (link above).
2. **Model Training**: Fine-tune or train a model using your dataset.
3. **Push to Hugging Face**: Upload your model and dataset to Hugging Face.
4. **Evaluation**: Share your Hugging Face model URL for evaluation.

### Example Output
``

**Input to Translate**: "What is data structure?"

**Output Expected (Tamil)**: "Data structure na yenna?"

``
### Good Luck!
We are excited to see your contributions! Happy coding and training! 🚀


`Below is an example that demonstrates the process of sample dataset creation, model training using Unsloth, and uploading it to Hugging Face`


In [1]:
# Install required packages
!pip install torch==2.5.1
!pip install transformers datasets accelerate bitsandbytes
!pip install unsloth
!pip install peft

import torch
from datasets import Dataset
from unsloth import FastLanguageModel
import pandas as pd
from datetime import datetime
from transformers import TrainingArguments, Trainer

# Verify GPU
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.5.1)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.5.1)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch==2.5.1)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.5.1)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch==2.5.1)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch==2.5.1)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torc

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
CUDA Available: True
GPU Device: Tesla T4


In [3]:
# Initialize model with unsloth and PEFT
from peft import LoraConfig, get_peft_model

# First initialize the model
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
MAX_LENGTH = 128

# Initialize base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_LENGTH,
    dtype=torch.bfloat16,
    load_in_4bit=True,
    trust_remote_code=True,
  #  attn_implementation="eager" ,
    use_cache=False,
    # device_map="auto"
)

# Disable xformers attention
model.config.use_cache = False
model.config.pretraining_tp = 1
model.config.use_flash_attention = False

# Add LoRA adapter configuration
lora_config = LoraConfig(
    r=16,  # Rank
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Add LoRA adapters to the model
model = get_peft_model(model, lora_config)
model.enable_input_require_grads()
model.gradient_checkpointing_enable()
model.print_trainable_parameters()  # This will show you the trainable parameters

Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.2.12: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:  18%|#7        | 136M/762M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.37k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

trainable params: 4,505,600 || all params: 1,104,553,984 || trainable%: 0.4079


In [4]:
# Create training dataset, this is an example, You can load from hugging face
training_data = {
    'text': [
        "### Human: Translate to Tamil colloquial: What is data structure?\n### Assistant: Data structure na yenna?",
        "### Human: Translate to Tamil colloquial: How are you doing?\n### Assistant: Eppadi irukka?",
        "### Human: Translate to Tamil colloquial: What is your name?\n### Assistant: Un peru enna?",
        "### Human: Translate to Tamil colloquial: Where are you going?\n### Assistant: Enga pora?",
        "### Human: Translate to Tamil colloquial: What time is it?\n### Assistant: Time enna achu?",
        "### Human: Translate to Tamil colloquial: Can you explain this concept?\n### Assistant: Idha explain panna mudiyuma?",
        "### Human: Translate to Tamil colloquial: How does this work?\n### Assistant: Idhu eppadi work agudhu?",
        "### Human: Translate to Tamil colloquial: What is machine learning?\n### Assistant: Machine learning na enna?",
        "### Human: Translate to Tamil colloquial: Where can I find the documentation?\n### Assistant: Documentation enga irukku?",
        "### Human: Translate to Tamil colloquial: Why is this not working?\n### Assistant: Idhu yen work agala?"
    ]
}

# Create dataset
dataset = Dataset.from_dict(training_data)
dataset = dataset.shuffle(seed=42)
split_dataset = dataset.train_test_split(test_size=0.2)

# ADD THIS NEW PREPROCESSING FUNCTION HERE
def preprocess_function(examples):
    # Tokenize inputs
    model_inputs = tokenizer(
        examples['text'],
        truncation=True,
        max_length=MAX_LENGTH,
        padding="max_length",
        # return_tensors="pt"
    )

    # Create labels (same as input_ids for causal language modeling)
    model_inputs['labels'] = model_inputs['input_ids']

    return model_inputs

# THEN ADD THIS PREPROCESSING STEP
tokenized_train = split_dataset['train'].map(
    preprocess_function,
    remove_columns=['text'],
    batched=True
)
tokenized_val = split_dataset['test'].map(
    preprocess_function,
    remove_columns=['text'],
    batched=True
)

# Convert to PyTorch format AFTER preprocessing
tokenized_train.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
tokenized_val.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])


print(f"Training examples: {len(split_dataset['train'])}")
print(f"Validation examples: {len(split_dataset['test'])}")

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Training examples: 8
Validation examples: 2


In [11]:
# Hugging Face User Name
hugging_face_user_name="kalluriashok1326"

# Training configuration
training_args = TrainingArguments(
    output_dir=f"./english-tamil-colloquial-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    learning_rate=3e-4,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=1,  # More frequent updates for demo
    evaluation_strategy="steps",
    eval_steps=2,
    save_strategy="steps",
    save_steps=2,
    load_best_model_at_end=True,
    push_to_hub=True,
    hub_model_id=f"{hugging_face_user_name}/english-tamil-colloquial-translator",
    fp16=True,
    gradient_accumulation_steps=2,
    warmup_steps=2,
    report_to=["none"],
    optim='adamw_torch',
     dataloader_pin_memory=False,
    torch_compile=False,  # Disable torch compile
    gradient_checkpointing=True  # Enable gradient checkpointing
)



In [12]:
# Login to Hugging Face
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [14]:
# Initialize trainer and start training (replace the existing training code)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val
)

# Modified training with progress bar
print("Starting training...")
trainer.train()

# Save and push to hub
trainer.save_model()
trainer.push_to_hub()

print("Preparing model for inference...")
model = FastLanguageModel.for_inference(model)

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 8 | Num Epochs = 10
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 2
\        /    Total batch size = 8 | Total steps = 10
 "-____-"     Number of trainable parameters = 4,505,600


Starting training...


Step,Training Loss,Validation Loss
2,14.9065,9.325418
4,14.9065,9.325418
6,14.9065,9.325418
8,14.9065,9.325418
10,14.9065,9.325418


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient
No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


Preparing model for inference...


In [15]:
def translate_to_tamil_colloquial(text):
    prompt = f"""### Human: You are a Tamil colloquial language translator. Translate the following English text to Tamil colloquial language (spoken Tamil).
Here are some examples:
"What is this?" -> "Idhu enna?"
"How are you?" -> "Eppadi irukka?"
"Where are you going?" -> "Enga pora?"

Now translate this: {text}
### Assistant: """
    inputs = tokenizer(prompt, return_tensors="pt", max_length=MAX_LENGTH, truncation=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    outputs = model.generate(
        **inputs,
        max_length=MAX_LENGTH,
        do_sample=True,          # Enable sampling
        temperature=0.7,         # Keep temperature
        num_return_sequences=1,  # Return only one sequence
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
        num_beams=1,            # Disable beam search
        top_p=0.95,             # Add top_p sampling
        repetition_penalty=1.2   # Add repetition penalty
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split("### Assistant:")[-1].strip()

# Test the model
test_sentences = [
    "What is your name?",
    "What is data structure?",
    "How does this work?",
    "Can you explain this to me?",
    "Where can I find the documentation?",
    "What is the error in this code?"
]

print("\nTesting the model:\n")
for sentence in test_sentences:
    translation = translate_to_tamil_colloquial(sentence)
    print(f"English: {sentence}")
    print(f"Tamil Colloquial: {translation}")
    print("-" * 50)


Testing the model:

English: What is your name?
Tamil Colloquial: Your name is 'Ramya.'

And now, in Tamil colloquial language:
'
--------------------------------------------------
English: What is data structure?
Tamil Colloquial: Data Structure is a set of rules and guidelines that help us organize, store, manipulate or retrieve information
--------------------------------------------------
English: How does this work?
Tamil Colloquial: 1) To use a translator, simply enter your message in the input box and click on “Translate”.
--------------------------------------------------
English: Can you explain this to me?
Tamil Colloquial: (speaks in a conversational tone)
Sure! Here's what I meant by asking,
--------------------------------------------------
English: Where can I find the documentation?
Tamil Colloquial: In Tamil, it would be translated as:
இன்னதுக்க
--------------------------------------------------
English: What is the error in this code?
Tamil Colloquial: The given input

## Interactive Demo
Try your own sentences below:

In [16]:
from IPython.display import HTML, display

def interactive_translation(text_input):
    if text_input:
        translation = translate_to_tamil_colloquial(text_input)
        display(HTML(f"""
        <div style='padding: 10px; border-radius: 5px;'>
            <b>English:</b> {text_input}<br>
            <b>Tamil Colloquial:</b> {translation}
        </div>
        """))

# Create interactive widget
from ipywidgets import interact, widgets
interact(interactive_translation,
         text_input=widgets.Text(description='English:', placeholder='Enter text to translate'))

interactive(children=(Text(value='', description='English:', placeholder='Enter text to translate'), Output())…