This notebook uses the `Llama-3` format for conversation style finetunes. We use [Open Assistant conversations](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style) in ShareGPT style.

In [6]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install langtorch
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

In [1]:
import torch
import json
from torch.nn import functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset

@torch.no_grad()
def eval_function(model, tokenizer, dataset_name='hellaswag', split='validation', language='pol_Latn', device='cuda', verbose=False):
    """
    Evaluate a language model on a given dataset.
    
    Args:
    - model: The language model to evaluate
    - tokenizer: The tokenizer for the model
    - dataset_name: 'hellaswag', 'belebele', or 'hellaswag_pl'
    - split: The dataset split to use (for hellaswag)
    - language: The language to use (for belebele)
    - device: The device to run the model on

    Returns:
    - accuracy: The normalized accuracy of the model on the dataset
    """
    
    def get_dataset():
        if dataset_name == 'hellaswag':
            ds = load_dataset("Rowan/hellaswag", split=split)
            return ds.map(lambda x: {"label": int(x["label"])})
        elif dataset_name == 'belebele':
            ds = load_dataset("facebook/belebele", language)
            # hellaswag format
            return ds.map(lambda x: {
                "ctx": x["question"],
                "label": int(x["correct_answer_num"]) - 1,
                "endings": [x["mc_answer1"], x["mc_answer2"], x["mc_answer3"], x["mc_answer4"]]
            })["test"]
        elif dataset_name == 'hellaswag_pl':
            ds = []
            with open("../../translate_hellaswag/translations.jsonl", "r") as f:
                for line in f.readlines():
                    if "{\n" in line:
                        jsonl_line = line
                    elif "}\n" in line:
                        jsonl_line += line
                        jsonl_line = jsonl_line.replace("\n", "")
                        ds.append(json.loads(jsonl_line))
                    else:
                        jsonl_line += line
                        
                for entry in ds:
                    entry["label"] = int(entry["label"])
            return ds
        else:
            raise ValueError("Unknown dataset name")

    def render_example(example):
        ctx = example["ctx"]
        label = example["label"]
        endings = example["endings"]
        if len(endings)>4:
            return None, None, None

        ctx_tokens = tokenizer.encode(ctx)
        tok_rows = []
        mask_rows = []
        for end in endings:
            end_tokens = tokenizer.encode(" " + end)
            tok_rows.append(ctx_tokens + end_tokens)
            mask_rows.append([0]*len(ctx_tokens) + [1]*len(end_tokens))

        max_len = max(len(row) for row in tok_rows)
        tokens = torch.zeros((4, max_len), dtype=torch.long)
        mask = torch.zeros((4, max_len), dtype=torch.long)
        for i, (tok_row, mask_row) in enumerate(zip(tok_rows, mask_rows)):
            tokens[i, :len(tok_row)] = torch.tensor(tok_row)
            mask[i, :len(mask_row)] = torch.tensor(mask_row)

        return tokens, mask, label

    model.to(device)
    model.eval()

    ds = get_dataset()
    num_correct_norm = 0
    num_total = 0
    with torch.no_grad():
        for example in ds:
            tokens, mask, label = render_example(example)
            if tokens is None:
                continue
            tokens = tokens.to(device)
            mask = mask.to(device)

            logits = model(tokens).logits
            shift_logits = logits[..., :-1, :].contiguous()
            shift_tokens = tokens[..., 1:].contiguous()
            shift_mask = mask[..., 1:].contiguous()

            flat_shift_logits = shift_logits.view(-1, shift_logits.size(-1))
            flat_shift_tokens = shift_tokens.view(-1)

            shift_losses = F.cross_entropy(flat_shift_logits, flat_shift_tokens, reduction='none')
            shift_losses = shift_losses.view(tokens.size(0), -1)

            masked_shift_losses = shift_losses * shift_mask
            avg_loss = masked_shift_losses.sum(dim=1) / shift_mask.sum(dim=1)
            pred_norm = avg_loss.argmin().item()

            num_total += 1
            num_correct_norm += int(pred_norm == label)

            if verbose and num_total % 100 == 0:
                print(f"Processed {num_total} examples. Current accuracy: {num_correct_norm/num_total:.4f}")

    accuracy = num_correct_norm / num_total
    print(f"Final accuracy: {accuracy:.4f}")
    
    return accuracy


# print(eval_function(model, tokenizer, "belebele"))

In [1]:
from unsloth import FastLanguageModel
import torch
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

chekpoint_path = "unsloth/Meta-Llama-3.1-8B"#-instruct" #"outputs/checkpoint-7000" 
model, tokenizer = FastLanguageModel.from_pretrained(
  model_name = chekpoint_path, 
  max_seq_length = max_seq_length,
  dtype = dtype,
  load_in_4bit = load_in_4bit
)


ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
GPU = NVIDIA H100 80GB HBM3. Max memory = 79.216 GB.
0.0 GB of memory reserved.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
   \\   /|    GPU: NVIDIA H100 80GB HBM3. Max memory: 79.216 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0. CUDA = 9.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [9]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig

def scale_adapters(model, scale, previous_scale):
    for name, param in model.named_parameters():
        if 'lora' in name:
            param.mul_(scale/previous_scale)

def add_scaled_adapters(model_base, adapter_list):
    """
    Load a model with multiple LoRA adapters.
    
    Args:
    - model_name (str): Name of the base model
    - adapter_list (list): List of tuples (adapter_name, alpha) or a single tuple
    
    Returns:
    - model: The loaded model with adapters applied
    - tokenizer: The tokenizer for the model
    """

    if isinstance(adapter_list, tuple):
        adapter_list = [adapter_list]

    model = model_base
    for i, (adapter_name, alpha) in enumerate(adapter_list):
        model = PeftModel.from_pretrained(model, adapter_name)
        
    adjusted_alpha = alpha / len(adapter_list)
    scale_adapters(model, adjusted_alpha, 1)
    return model

def binary_search_optimal_alpha(model_base, adapter_name, eval_function, dataset_name='hellaswag', split='validation', language='pol_Latn', device='cuda', tolerance=1e-2, acc_0_2 = None):
    """
    Find the optimal alpha scaling factor for a single adapter using binary search.
    
    Args:
    - model_base (str or CausalLM): The base model or its name
    - adapter_name (str): Name of the LoRA adapter
    - eval_function (function): Evaluation function that returns accuracy
    - dataset_name (str): Name of the dataset to use for evaluation
    - split (str): Dataset split to use
    - language (str): Language for Belebele dataset
    - device (str): Device to run the model on
    - tolerance (float): Tolerance for binary search convergence
    
    Returns:
    - optimal_alpha (float): The optimal alpha scaling factor
    - best_accuracy (float): The best accuracy achieved
    """
    global tokenizer
    if isinstance(model_base, str):
        model_base, tokenizer = FastLanguageModel.from_pretrained(
          model_name = model_base, 
          max_seq_length = max_seq_length,
          dtype = dtype,
          load_in_4bit = load_in_4bit
        )
    
    left, right = 0.0, 1.125
    best_alpha = None
    best_accuracy = float('-inf')

    if acc_0_2 is None:
        # Calculate accuracy for alpha = 0 (no adapter)
        accuracy_left = eval_function(model_base, tokenizer, dataset_name, split, language, device)
        print(f"Alpha: 0.0000, Accuracy: {accuracy_left:.4f}")
        
        # Calculate accuracy for alpha = 2
        mid = right
        model = add_scaled_adapters(model_base, (adapter_name, right))
        accuracy_right = eval_function(model, tokenizer, dataset_name, split, language, device)
        print(f"Alpha: 2.0000, Accuracy: {accuracy_right:.4f}")
        
    else:
        mid = 1.0
        model = add_scaled_adapters(model_base, (adapter_name, mid))
        accuracy_left, accuracy_right = acc_0_2

    
    if accuracy_left > best_accuracy:
        best_accuracy = accuracy_left
        best_alpha = 0.0
    if accuracy_right > best_accuracy:
        best_accuracy = accuracy_right
        best_alpha = 2.0

    while right - left > tolerance:
        scale_adapters(model, (left + right) / 2, mid)
        mid = (left + right) / 2
        
        accuracy_mid = eval_function(model, tokenizer, dataset_name, split, language, device)
        
        print(f"Alpha: {mid:.4f}, Accuracy: {accuracy_mid:.4f}")
        
        if accuracy_mid > best_accuracy:
            best_accuracy = accuracy_mid
            best_alpha = mid

        # Decide which half to continue searching
        if accuracy_left < accuracy_mid > accuracy_right:
            # Peak is between left and right
            if mid - left < right - mid:
                right = mid
                accuracy_right = accuracy_mid
            else:
                left = mid
                accuracy_left = accuracy_mid
        elif accuracy_left > accuracy_mid:
            # Peak is on the left side
            right = mid
            accuracy_right = accuracy_mid
        else:
            # Peak is on the right side
            left = mid
            accuracy_left = accuracy_mid

    return best_alpha, best_accuracy

    
    return best_alpha, best_accuracy

model_name = "unsloth/Meta-Llama-3.1-8B-Instruct"
adapter_name = "ASobieszek/l3.1-wiki-15k-128-wd0.5"

optimal_alpha, best_accuracy = binary_search_optimal_alpha(model_name, adapter_name, eval_function, "belebele", acc_0_2 = (0.2689,0.2833))
print(f"Optimal alpha: {optimal_alpha:.4f}, Best accuracy: {best_accuracy:.4f}")

# optimal_model, optimal_tokenizer = load_model_with_adapters(model_name, (adapter_name, optimal_alpha))

# final_accuracy = eval_function(optimal_model, optimal_tokenizer, "belebele")
# print(f"Final accuracy with optimal alpha: {final_accuracy:.4f}")

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
   \\   /|    GPU: NVIDIA H100 80GB HBM3. Max memory: 79.216 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0. CUDA = 9.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Final accuracy: 0.2844
Alpha: 0.5625, Accuracy: 0.2844
Final accuracy: 0.2856
Alpha: 0.8438, Accuracy: 0.2856
Final accuracy: 0.2833
Alpha: 0.9844, Accuracy: 0.2833
Final accuracy: 0.2822
Alpha: 0.9141, Accuracy: 0.2822


KeyboardInterrupt: 

In [1]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
from unsloth import FastLanguageModel
import torch
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

chekpoint_path = "unsloth/Meta-Llama-3.1-8B"#-instruct" #"outputs/checkpoint-7000" 
model, tokenizer = FastLanguageModel.from_pretrained(
  model_name = chekpoint_path, 
  max_seq_length = max_seq_length,
  dtype = dtype,
  load_in_4bit = load_in_4bit
)

# # Call the eval function
# print(eval_function(model_base, tokenizer, "belebele"))

# for adapter_name in ["ASobieszek/l3.1-wiki-15k-128-wd0.5", "ASobieszek/l3.1-wiki-20k-256-wd0.5", "ASobieszek/l3.1-wiki-15k-128-wd0.01"]:
#     model = PeftModel.from_pretrained(model_base, adapter_name)
#     print(adapter_name.split("wiki-")[1])
#     print(eval_function(model, tokenizer, "belebele"))

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
GPU = NVIDIA H100 80GB HBM3. Max memory = 79.216 GB.
0.0 GB of memory reserved.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
   \\   /|    GPU: NVIDIA H100 80GB HBM3. Max memory: 79.216 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0. CUDA = 9.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [5]:
from langtorch import TextModule, TextTensor, ChatTensor
from langtorch.tt import Activation
from langtorch import Text, Chat
from langtorch import Text
from huggingface_hub import notebook_login
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig, TrainingArguments, IntervalStrategy
from datasets import load_dataset, Dataset
from trl import SFTTrainer
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as st
from tqdm import tqdm

def load_data():
    class LLamaChat(Chat):
        def __str__(self):
            formatted = ""
            for role, content in self.items():
                formatted += f"<|start_header_id|>{role}<|end_header_id|>\n\n{Text(content,parse=False)}<|eot_id|>"
            if role == "user": # Add assistant prompt for generation
                formatted += "<|start_header_id|>assistant<|end_header_id|>"
            return formatted

    interview = pd.read_csv(r"interviews.csv")
    interview["doctor_diagnosis"] = interview["doctor_diagnosis"].apply(lambda x: x.split('code":"')[1].split('"')[0])
    interview_tensor = TextTensor(dict((k,interview[k].to_list()) for k in ["doctor_interview","doctor_diagnosis"]),parse = False)

    answers = TextTensor((
        ("system", "JesteÅ› lekarzem, ktÃ³ry stawia do danego wywiadu medycznego diagnozÄ™."),
        ("user","doctor_interview"),
        ("assistant","doctor_diagnosis"))
    )*interview_tensor

    answers.ttype = LLamaChat
    print(answers.item())
    def first_n_words(s, n):
        words = s.split()
        first_n_words = words[:n]
        return ' '.join(first_n_words)


    return [str(m) for m in answers.flat]


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [2]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",

                      "embed_tokens", "lm_head",],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Offloading input_embeddings to disk to save VRAM
Unsloth: Offloading output_embeddings to disk to save VRAM


Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Unsloth: Casting embed_tokens to float32
Unsloth: Casting lm_head to float32


<a name="Data"></a>
### Data Prep

In [None]:
from datasets import load_dataset, Dataset
from functools import partial
import re

def process_wikipedia_dataset(prompt, dataset_name="JonaszPotoniec/wikipedia-with-statistics-pl"):
    """
    Load, filter, and format a Wikipedia dataset using HuggingFace's Dataset class optimizations.
    
    Args:
    - prompt: To format entries
    - dataset_name: The name of the dataset to load from HuggingFace

    Returns:
    - A processed dataset with formatted prompts
    """
    
    # Define banned strings for filtering
    banned_strings = [
        "â€“ wieÅ›", "â€“ gmina", "â€“ miasto", "â€“ jednostka administracyjna", "â€“ stacja", "â€“ jaskinia", "â€“ park", "â€“ powiat", "â€“ chutor", "miejscowoÅ›Ä‡", "planetoida", "â€“ osada", "â€“ supernowa", "â€“ galaktyka", "â€“ gwiazda", "â€“ rzeka",  "â€“ osiedle",  "â€“ kolonia",  "â€“ port", "â€“ gaun", "â€“ przystanek", "â€“ jezioro", "â€“ okrÄ™g", "â€“ dystrykt", "â€“ droga", "â€“ ulica", "â€“ przysiÃ³Å‚ek", "â€“ wyspa", "â€“ pomnik", "â€“ wzniesienie", "â€“ rezerwat", "- prowincja", "- rejon", "- zatoka", "hrabstw", "- region", "- szczyt", "- potok", 
        "â€“ gitara", "â€“ perkusja", "â€“ piosenka",  "â€“ singel", "- budynek",
        "â€“ parafia",  "â€“ rzymskokatolick", "â€“ diecezja", "â€“ duchown", "â€“ koÅ›ciÃ³Å‚", "â€“ biskup", "â€“ prawosÅ‚awn", "â€“ synagoga", "â€“ droga", 
        "â€“ dawn", "â€“ zabytkowy", "â€“ herb", "â€“ oficerowie", "â€“ generaÅ‚", "- podpuÅ‚kownik", "- kasztelan", "- sÄ™dzia", "- pododdzia", "Å¼oÅ‚nierz", "- hrabia", "- funkcjonariusz",
        "w Rumunii", "struga",
        "w reÅ¼yserii", "debiutancki", 
        "- turniej", "- zawody", "wioÅ›lar", "Å‚yÅ¼wiar", "tenis", "piÅ‚kar", "piÅ‚ce", "klub piÅ‚karski", "aktor", "piosenkar"
    ]
    
    def format_and_filter1(example):
        """
        Format a single example and filter based on banned strings.
        """
        try:
            # Check if any banned string is in the text
            if not example['text'] or any(s in example['text'] for s in banned_strings):
                return {'text': ''}  # This will filter out the example
            if example['pageviews'] < 10:
                return {'text': 'obscure'}
            
            def format_text(t):
                for end in ["\n\nUwagi","\n\nPrzypisy","\n\nBibliografia", "\n\nLinki zewnÄ™trzne"]:
                    t = t.split(end)[0]
                return t  
            formatted_text = format_text(example['text'])
            return example | {'text': formatted_text}
        except:
            return {'text': ''}

    def split_long_entries(examples, max_len = 7_000):
        """
        Split long entries into multiple examples.
        """
        new_examples = []
        for example in examples:
            if len(example['text']) <= max_len:
                new_examples.append({'text': prompt.format(example['title'], example['text'].strip()) + EOS_TOKEN})
            else:
                text = example['text']
                while text:
                    # Find the last paragraph break within the first 10k characters
                    split_index = -1
                    if len(text)<max_len+1000:
                        split_index = len(text)
                    else:
                        while split_index == -1: # tweak this
                            split_index = text[:max_len].rfind('\n\n')
                            if split_index == -1:  # If no paragraph break, just take the first max_len
                                split_index = max_len
                    
                    chunk = {'text': prompt.format(example['title'], text[:split_index].strip()) + EOS_TOKEN}
                    new_examples.append(chunk)
                    
                    text = text[split_index:].strip()
                    if len(text)<500:
                        break
        
        return new_examples

    def format_and_filter2(example):
        try:
            formatted_text = prompt.format(example['title'], example['text']) + EOS_TOKEN
            return {'text': formatted_text}
        except:
            return {'text': ''}
    
    # Load the dataset
    dataset = load_dataset(dataset_name)['train']
    orig_len = len(dataset)

    # Apply initial formatting and filtering
    formatted_dataset = dataset.map(
        partial(format_and_filter1),
        num_proc=32  # Use multiple processes for speedup
    )
    
    # Filter out None values and obscure articles
    formatted_dataset = formatted_dataset.filter(lambda x: x['text'] != '' and x['text'] != 'obscure')
    diff = orig_len - len(formatted_dataset)
    print(f"Filtered out {diff} rows ({diff/orig_len*100:.2f}% of the original dataset)")
    
    # Split long entries
    all_examples = formatted_dataset.to_list()
    split_examples = split_long_entries(all_examples)
    
    print(f"Increased the dataset size by {(len(split_examples)/len(all_examples)-1)*100:.2f}% via splitting")
    formatted_dataset = Dataset.from_list(split_examples)
        
    print(formatted_dataset)
    
    # Final formatting
    # formatted_dataset = formatted_dataset.map(
    #     partial(format_and_filter2),
    #     remove_columns=formatted_dataset.column_names,
    #     num_proc=32
    # )
    
    # Filter out short entries
    formatted_dataset = formatted_dataset.filter(lambda x: len(x['text']) > 250)
    final_len = len(formatted_dataset)
    print(f"Final dataset has {final_len} entries ({final_len/orig_len*100:.2f}% of the original dataset)")
    
    return formatted_dataset

# Usage
EOS_TOKEN = tokenizer.eos_token
wikipedia_prompt = "{}\n\n{}"
dataset = process_wikipedia_dataset(wikipedia_prompt)

Filtered out 945258 rows (59.54% of the original dataset)


In [27]:
import pandas as pd
from collections import Counter
import re
def create_word_frequency_df(text_list, n=200):
    """
    Create a DataFrame with counts of the n most common words in a list of texts.
    
    Args:
    - text_list (list): A list of strings to analyze
    - n (int): Number of top words to include (default: 200)
    
    Returns:
    - pandas.DataFrame: A DataFrame with columns 'word' and 'count', sorted by count in descending order
    """
    # Combine all texts into a single string
    all_text = ' '.join(text_list)
    
    # Convert to lowercase and split into words
    words = re.findall(r'\w+', all_text.lower())
    
    # Count word frequencies
    word_counts = Counter(words)
    
    # Get the n most common words
    most_common = word_counts.most_common(n)
    
    # Create DataFrame
    df = pd.DataFrame(most_common, columns=['word', 'count'])
    
    return df
pd.set_option('display.max_rows', 500)
# create_word_frequency_df(dataset["text"][:100000])
for t in dataset["text"][::-1]:
    if len(t)>10_000:
        print(t)
        break

### UAProf
User Agent Profile (UAprof) to definicja opisu moÅ¼liwoÅ›ci telefonu komÃ³rkowego, utworzona przez organizacjÄ™ WAP Forum w ramach specyfikacji WAP 2.0, obecnie rozszerzana przez Open Mobile Alliance.

Powodem powstania definicji UAprof byÅ‚a ciÄ…gle wzrastajÄ…ca iloÅ›Ä‡ wspomaganych formatÃ³w i serwisÃ³w, poprzez co pole "Accept" w nagÅ‚Ã³wku HTTP byÅ‚o coraz dÅ‚uÅ¼sze. DziÄ™ki UAProf nagÅ‚Ã³wek ten musi zawieraÄ‡ tylko jeden URL, a kaÅ¼dy zainteresowany serwer moÅ¼e Å›ciÄ…gnÄ…Ä‡ plik XML z opisem wszystkich moÅ¼liwoÅ›ci telefonu i zainstalowanych na nim programÃ³w. Pole "Accept" moÅ¼e byÄ‡ dziÄ™ki temu ograniczone do najwaÅ¼niejszych formatÃ³w.

PrzykÅ‚ad 
(Nokia N73)

HTTP-Header (wycinek):
<nowiki>
Accept: text/javascript, text/ecmascript, application/x-javascript, text/html, application/vnd.wap.xhtml+xml, application/xhtml+xml, text/css, multipart/mixed, text/vnd.wap.wml, application/vnd.wap.wmlc, application/vnd.wap.wmlscriptc, application/java-archive, application/jav

If you're looking to make your own chat template, that also is possible! You must use the Jinja templating regime. We provide our own stripped down version of the `Unsloth template` which we find to be more efficient, and leverages ChatML, Zephyr and Alpaca styles.

More info on chat templates on [our wiki page!](https://github.com/unslothai/unsloth/wiki#chat-templates)

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from transformers import TrainerCallback, TrainerState, TrainerControl
import math
import wandb
import os
from unsloth import UnslothTrainer, UnslothTrainingArguments

wandb.login(key="2b23111621454d465a8227978bee4da77bc05133")
os.environ["WANDB_API_KEY"] = "2b23111621454d465a8227978bee4da77bc05133"

# Define a custom callback for logging
class WandbLoggingCallback(TrainerCallback):
    def on_log(self, args, state, control, **kwargs):
        # Log training loss
        if state.log_history[-1].get("loss") is not None:
            wandb.log({"train_loss": state.log_history[-1]["loss"], "step": state.global_step})

        # Log evaluation loss
        if state.log_history[-1].get("eval_loss") is not None:
            wandb.log({"eval_loss": state.log_history[-1]["eval_loss"], "step": state.global_step})

        torch.cuda.empty_cache()


Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33masobieszek[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
learning_rate = 7.5e-5
embedding_learning_rate = 1.5e-5
batch_size = 32
weight_decay = 0.5

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing=True,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 32,
        gradient_accumulation_steps = 1,

        warmup_ratio = 0.1,
        num_train_epochs = 3,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = learning_rate,
        embedding_learning_rate = embedding_learning_rate,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = weight_decay,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs_filtered",
        report_to="wandb",  # Enable wandb logging
        save_strategy = "steps",
        save_steps = 1000,
    ),
    callbacks=[WandbLoggingCallback()],
)
# Initialize wandb
wandb.init(project="wiki-pretrain-llama3.1-8b",
           entity="jutro",


    # track hyperparameters and run metadata
    config={
        "learning_rate": learning_rate,
        "embedding_learning_rate": embedding_learning_rate,
        "batch_size": batch_size,
        "lora_r": 128,
        "weight_decay": weight_decay,
    }
)

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
# flush cuda memory
torch.cuda.empty_cache()
trainer.args.learning_rate

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train() # resume_from_checkpoint = True)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
# model.save_pretrained("7k_wiki") # Local saving
# model.push_to_hub("ASobieszek/l3.1-7k", token = "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") # Online saving
model.push_to_hub("ASobieszek/l3.1-wiki-15k-128-wd0.5", token = "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") # Online saving

In [None]:

from peft import AutoModelForPeftCausalLM
from transformers import AutoTokenizer
model = AutoModelForPeftCausalLM.from_pretrained(
    "lora_model", # YOUR MODEL YOU USED FOR TRAINING
    load_in_4bit = False,
)
tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).