Unsloth optimization libraries.

In [1]:
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [1]:
!pip uninstall torch torchvision torchaudio -y
!pip cache purge
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 
!pip install -U unsloth
!pip install -U accelerate
!pip install -U bitsandbytes
!pip uninstall unsloth -y
!pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Found existing installation: torch 2.4.0
Uninstalling torch-2.4.0:
  Successfully uninstalled torch-2.4.0
Found existing installation: torchvision 0.19.0
Uninstalling torchvision-0.19.0:
  Successfully uninstalled torchvision-0.19.0
Found existing installation: torchaudio 2.4.0
Uninstalling torchaudio-2.4.0:
  Successfully uninstalled torchaudio-2.4.0
Files removed: 8
Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torch
  Downloading https://download.pytorch.org/whl/cu118/torch-2.5.0%2Bcu118-cp310-cp310-linux_x86_64.whl (838.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m838.3/838.3 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting torchvision
  Downloading https://download.pytorch.org/whl/cu118/torchvision-0.20.0%2Bcu118-cp310-cp310-linux_x86_64.whl (6.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[?25hCollecting

FastLanguageModel for Llama pre-trained models using unsloth.
torch for deep learning and gpu computation

max_seq_length: This sets the maximum sequence length for the input tokens. The model can process up to 2048 tokens in a single sequence.

dtype: This is the data type for the model’s tensors. It’s set to None here for automatic detection based on the GPU.

load_in_4bit: This flag is set to True, enabling 4-bit quantization. This is a memory optimization technique, allowing the model to use less GPU memory and run faster by compressing the weights into 4-bit precision.

In [3]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None # None for auto detection.
load_in_4bit = True # Use 4bit quantization to reduce memory usage.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.3: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu118. CUDA = 7.5. CUDA Toolkit = 11.8.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers via:
`pip uninstall transformers -y && pip install --upgrade --no-cache-dir "git+https://github.com/huggingface/transformers.git"`


 LoRA enables fine-tuning by adding and updating only a small number of trainable parameters, keeping the main model frozen. Wraps model with LoRa fucntionality.

 r = 16: This is the rank of the low-rank decomposition. The value 16 means that LoRA will use a low-rank matrix with rank 16 to approximate the model updates.

 target_modules: These are the specific components of the model where LoRA will be applied. In this case, it's applied to:

q_proj, k_proj, v_proj, o_proj: These represent different projection layers in the attention mechanism.
gate_proj, up_proj, down_proj: These correspond to components of the feed-forward network in the transformer model.

lora_alpha = 16: This is a scaling factor for the LoRA matrices. It controls the impact of the LoRA updates on the model's output. A higher alpha gives more weight to the LoRA updates, while a smaller alpha reduces their influence.

lora_dropout = 0: This specifies the dropout rate for the LoRA layers. Setting this to 0 means no dropout is applied, which is often the optimal setting for smaller datasets or when dropout is not needed for regularization.

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, 
    bias = "none",    
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2024.10.3 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.



### Data Prep

In [4]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset('json', data_files='/kaggle/input/dataset/processed_data.json', split='train')

Generating train split: 0 examples [00:00, ? examples/s]

Formatting dataset into correct template for model.
Possibly unnecessary because our data is already been formatted. NOTE!!

In [5]:
new_dataset = dataset.select(range(10000))
new_dataset

Dataset({
    features: ['conversations'],
    num_rows: 10000
})

In [6]:
from unsloth.chat_templates import standardize_sharegpt
new_dataset = standardize_sharegpt(new_dataset)
print(new_dataset)
new_dataset = new_dataset.map(formatting_prompts_func, batched = True)

Standardizing format:   0%|          | 0/10000 [00:00<?, ? examples/s]

Dataset({
    features: ['conversations'],
    num_rows: 10000
})


Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

We look at how the conversations are structured for item 5:

In [7]:
new_dataset[5]["conversations"]

[{'content': 'You are an assistant', 'role': 'system'},
 {'content': 'Good morning all,\nI\'m working on a project related to the occupations that university graduates are likely to go into. I have a large data set (N in the hundreds of thousands) where the unit of analysis is an individual person. For each person, I have two categorical variables - a code representing the field of their university degree and a code representing their occupation. I\'m looking for a statistically valid way to find out what fields of study and occupations "go together." In other words, what courses of study prepare people for which jobs?\nSo far, I\'ve considered doing this with simple descriptive statistics... pull, say, the top 10 occupations for every subject area while ruling out occupations like cashiers, fast food workers, etc. But it would be great if there was some sort of more rigorous test that could be used for this. Perhaps one regression per occupation with tons of dummy variables representi

And we see how the chat template transformed these conversations.

[Notice] Llama 3.1 Instruct's default chat template default adds "Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024", so do not be alarmed!

In [8]:
new_dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are an assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nGood morning all,\nI\'m working on a project related to the occupations that university graduates are likely to go into. I have a large data set (N in the hundreds of thousands) where the unit of analysis is an individual person. For each person, I have two categorical variables - a code representing the field of their university degree and a code representing their occupation. I\'m looking for a statistically valid way to find out what fields of study and occupations "go together." In other words, what courses of study prepare people for which jobs?\nSo far, I\'ve considered doing this with simple descriptive statistics... pull, say, the top 10 occupations for every subject area while ruling out occupations like cashiers, fast food workers, etc. But it would be great if there was som


### Train the model

In [9]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,  # LLaMA model for fine-tuning
    tokenizer=tokenizer,  # Tokenizer for processing text data
    train_dataset=new_dataset,  # Training dataset (Q&A pairs)
    dataset_text_field="text",  # Field name in dataset containing text
    max_seq_length=max_seq_length,  # Max sequence length (number of tokens)
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),  # Collates batches for seq2seq tasks
    dataset_num_proc=2,  # Number of processes to speed up data loading
    packing=False,  # Don't pack short sequences together
    args=TrainingArguments(
        per_device_train_batch_size=4,  # Batch size per device (GPU/CPU)
        gradient_accumulation_steps=4,  # Accumulate gradients for larger effective batch size
        warmup_steps=5,  # Warmup steps to gradually increase the learning rate
        # max_steps=None,  # Stop training after 60 steps
        num_train_epochs = 3, # Set this for 1 full training run
        learning_rate=2e-3,  # Initial learning rate
        fp16=not is_bfloat16_supported(),  # Use FP16 if bfloat16 is not supported
        bf16=is_bfloat16_supported(),  # Use bfloat16 if supported by hardware
        logging_steps=1,  # Log metrics every step
        optim="adamw_8bit",  # Optimizer: AdamW with 8-bit precision for memory efficiency
        weight_decay=0.0001,  # Weight decay to prevent overfitting
        lr_scheduler_type="linear",  # Learning rate schedule: linear decay
        seed=3407,  # Set seed for reproducibility
        output_dir="outputs",  # Directory to save model checkpoints and logs

        # Add these lines:
        run_name = "My_Custom_Run_Name",  # A custom name for your run
        report_to = "none",  # Disable WandB (set to 'wandb' if you want to use it)
    ),
)


Map (num_proc=2):   0%|          | 0/10000 [00:00<?, ? examples/s]

We also use Unsloth's train_on_completions method to only train on the assistant outputs and ignore the loss on the user's inputs.

The train_on_responses_only function focuses the model’s fine-tuning on the responses it generates rather than the entire conversation. It marks the boundaries of user input and model responses using special tokens and helps the model learn how to generate better replies based on input, which is key for tasks like chatbot development. This approach improves training efficiency by emphasizing response generation and omitting unnecessary details from user instructions. It’s useful when the primary goal is to enhance how well the model generates responses to user queries.

In [10]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Masking verification.

Decoding is crucial when you want to see what the model or tokenizer has processed and transformed the input into. It helps verify that the correct data is being passed into the model. In this case, the model likely generated or processed text in token form, and the decode() function converts those tokens back to readable text for analysis.

In [11]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are an assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nGood morning all,\nI\'m working on a project related to the occupations that university graduates are likely to go into. I have a large data set (N in the hundreds of thousands) where the unit of analysis is an individual person. For each person, I have two categorical variables - a code representing the field of their university degree and a code representing their occupation. I\'m looking for a statistically valid way to find out what fields of study and occupations "go together." In other words, what courses of study prepare people for which jobs?\nSo far, I\'ve considered doing this with simple descriptive statistics... pull, say, the top 10 occupations for every subject area while ruling out occupations like cashiers, fast food workers, etc. But it would be great if there was som

In [12]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                                                                                                                                                                                                           \n\nSounds like multinomial logistic regression is the tool to use.\nThe dependent variable would be "field of work" and the independent variable would be "field of study". With such a large N you can be fairly specific in defining the levels of the variables, but you should probably start with frequency counts of each and then a crosstabulation of the two, not for statistical testing but to see what\'s going on and whether you want to combine some categories of either variable.<|eot_id|>'

In [13]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
2.635 GB of memory reserved.


And now we train. This should take abt 10 mins per (step size 60) given the Tesla T4 is used.

In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 10,000 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 1,875
 "-____-"     Number of trainable parameters = 24,313,856


**** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers and Unsloth!


Step,Training Loss
1,2.4019
2,2.2973
3,1.9853
4,2.1713
5,2.0628
6,2.0329
7,2.1649
8,2.4706
9,1.8602
10,2.2841


### Inference
We use the model. We use TextStreamer for continuous inference, token by token.

temperature=1.5: Controls randomness in the generation. Higher values like 1.5 produce more diverse and creative outputs.
min_p=0.1: Filters out tokens with a cumulative probability lower than 0.1, ensuring less likely tokens are excluded for more coherent results.

In [26]:
from IPython.display import Markdown, display

FastLanguageModel.for_inference(model)

prompt = "What is the Bayes rule?"

messages = [
    {"role": "user", "content": prompt},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt", #returns as tensor
).to("cuda") #Uses GPU for inference

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
output = model.generate(input_ids = inputs, streamer = None, max_new_tokens = 256, #Limits the generated text to x amount of tokens, kwool!.
                   use_cache = True, temperature = 1.5, min_p = 0.1)

# Decode the output tokens to get the generated string
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)

cleaned_output = decoded_output.split("2024", 1)[-1].strip()
final_output = cleaned_output.replace("assistant", "\n\nassistant") 

# Display the cleaned output in Markdown format
display(Markdown(final_output))

user

What is the Bayes rule?

assistant

It is
$$P(\theta|X)\propto P(X|\theta)P(\theta)$$
(Where $\theta$ is the parameter, X is the observation)

SAVING THE MODEL...HOPEFULLY!

In [78]:
model.save_pretrained("lora_model_10k")         # Save model
tokenizer.save_pretrained("lora_model_10k")     # Save tokenizer

('lora_model_10k/tokenizer_config.json',
 'lora_model_10k/special_tokens_map.json',
 'lora_model_10k/tokenizer.json')

USING ZIP AND LOCAL DOWNLOAD(SLOW)

In [80]:
trainer.save_model("lora_trainer_10k")
trainer.tokenizer.save_pretrained("lora_trainer_10k")

('lora_trainer_10k/tokenizer_config.json',
 'lora_trainer_10k/special_tokens_map.json',
 'lora_trainer_10k/tokenizer.json')

In [72]:
import shutil

# Define the directory to zip
model_directory = "lora_trainer_10k"

# Create a zip file
shutil.make_archive(model_directory, 'zip', model_directory)

'/kaggle/working/lora_trainer_10k.zip'

In [60]:
import shutil

# Define the directory to zip
model_directory = "lora_model_10k"

# Create a zip file
shutil.make_archive(model_directory, 'zip', model_directory)

'/kaggle/working/lora_model_10k.zip'

In [76]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 3072)
        (layers): ModuleList(
          (0-27): 28 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
              (k_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
              (v_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
              (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
              (rotary_emb): LlamaExtendedRotaryEmbedding()
            )
            (mlp): LlamaMLP(
              (gate_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
              (up_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
              (down_proj): Linear4bit(in_features=8192, out_features=3072, bias=False)
              (act_fn): SiLU()
    

In [11]:
model_load

NameError: name 'model_load' is not defined

In [4]:
!pip install -U bitsandbytes
!pip install -U transformers accelerate

Collecting transformers
  Downloading transformers-4.45.2-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate
  Downloading accelerate-1.0.1-py3-none-any.whl.metadata (19 kB)
Collecting tokenizers<0.21,>=0.20 (from transformers)
  Downloading tokenizers-0.20.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.45.2-py3-none-any.whl (9.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m78.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading accelerate-1.0.1-py3-none-any.whl (330 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m330.9/330.9 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.20.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [4]:
from unsloth import FastLanguageModel
model_load, tokenizer_load = FastLanguageModel.from_pretrained(
    model_name = "/kaggle/input/models/transformers/default/1/lora_model_10k", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

==((====))==  Unsloth 2024.10.3: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu118. CUDA = 7.5. CUDA Toolkit = 11.8.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2024.10.3 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [34]:
from IPython.display import Markdown, display

FastLanguageModel.for_inference(model_load)

prompt = "What is the gaussian formula and all of its parameters?"

messages = [
    {"role": "user", "content": prompt},
]
inputs = tokenizer_load.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt", #returns as tensor
).to("cuda") #Uses GPU for inference

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer_load, skip_prompt = True)
output = model_load.generate(input_ids = inputs, streamer = None, max_new_tokens = 1028,
                   use_cache = True, temperature = 0.5, min_p = 0)

# Decode the output tokens to get the generated string
decoded_output = tokenizer_load.decode(output[0], skip_special_tokens=True)

# Optionally, remove any unwanted system messages
cleaned_output = decoded_output.split("2024", 1)[-1].strip()

# Ensure the assistant's response starts on a new line
final_output = cleaned_output.replace("assistant", "\n\nassistant")  # Add newline before 'assistant'

# Display the cleaned output in Markdown format
display(Markdown(final_output))

user

What is the gaussian formula and all of its parameters?

assistant

The Gaussian distribution is defined by the following formula:
$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
where $\mu$ is the mean and $\sigma$ is the standard deviation.  The parameters of the Gaussian distribution are:

$\mu$: the mean of the distribution, which is the center of the distribution.
$\sigma$: the standard deviation, which is the width of the distribution at the mean.  $\sigma$ is also known as the standard error.
$X \sim N(\mu, \sigma^2)$: the random variable is distributed according to the Gaussian distribution with mean $\mu$ and standard deviation $\sigma$.  This is also known as a normal distribution.

Note that the mean and standard deviation are not unique to the Gaussian distribution.  For example, the mean and standard deviation can be any real number.  The Gaussian distribution is only unique if the mean and standard deviation are both real numbers.

In [44]:
from IPython.display import Markdown, display

FastLanguageModel.for_inference(model_load) # Enable native 2x faster inference

prompt = "Can you explain how decision trees handle numerical variables in a classification task?"

messages = [
    {"role": "user", "content": prompt},
]
inputs = tokenizer_load.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt", #returns as tensor
).to("cuda") #Uses GPU for inference

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer_load, skip_prompt = True)
output = model_load.generate(input_ids = inputs, streamer = None, max_new_tokens = 1028, #Limits the generated text to 128 new tokens.
                   use_cache = True, temperature = 0.5, min_p = 0)

# Decode the output tokens to get the generated string
decoded_output = tokenizer_load.decode(output[0], skip_special_tokens=True)

# Optionally, remove any unwanted system messages
cleaned_output = decoded_output.split("2024", 1)[-1].strip()

# Ensure the assistant's response starts on a new line
final_output = cleaned_output.replace("assistant", "\n\nassistant")  # Add newline before 'assistant'

# Display the cleaned output in Markdown format
display(Markdown(final_output))

user

Can you explain how decision trees handle numerical variables in a classification task?

assistant

Decision trees can handle numerical variables in several ways. Here are a few:

In the most simple case, decision trees can treat a numerical variable as an attribute. This is done by simply splitting the data based on the value of the numerical variable. For example, if we have a variable called "salary" that ranges from $0$ to $100000$, we can split the data based on whether it is less than $20000$ or not. This is called a split by value.
Decision trees can also treat a numerical variable as a continuous variable. In this case, the tree can be split based on the range of the numerical variable. For example, if we have a variable called "age" that ranges from $0$ to $100$, we can split the data based on whether it is less than $30$ or not.
Decision trees can also treat a numerical variable as a categorical variable. In this case, the tree can be split based on the category of the numerical variable. For example, if we have a variable called "country" that has values "USA", "Canada", "Mexico", etc., we can split the data based on whether it is USA, Canada, or Mexico, etc.

Note that these are the most simple ways to treat numerical variables. There are more sophisticated ways to treat numerical variables in decision trees. For example, you can treat a numerical variable as a time series, or as a variable that has different values for different levels of a categorical variable.

In [47]:
from IPython.display import Markdown, display

FastLanguageModel.for_inference(model_load)

prompt =  "How do gradient-based optimization techniques like Adam handle exploding or vanishing gradient problems in deep learning models?"

messages = [
    {"role": "user", "content": prompt},
]
inputs = tokenizer_load.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt", #returns as tensor
).to("cuda") #Uses GPU for inference

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer_load, skip_prompt = True)
output = model_load.generate(input_ids = inputs, streamer = None, max_new_tokens = 1028,
                   use_cache = True, temperature = 0.5, min_p = 0)

# Decode the output tokens to get the generated string
decoded_output = tokenizer_load.decode(output[0], skip_special_tokens=True)

# Optionally, remove any unwanted system messages
cleaned_output = decoded_output.split("2024", 1)[-1].strip()

# Ensure the assistant's response starts on a new line
final_output = cleaned_output.replace("assistant", "\n\nassistant")  # Add newline before 'assistant'

# Display the cleaned output in Markdown format
display(Markdown(final_output))

user

How do gradient-based optimization techniques like Adam handle exploding or vanishing gradient problems in deep learning models?

assistant

Adam and most other modern optimization algorithms work by adjusting the weights in the direction that the gradient is pointing. If the gradient is not pointing in the right direction, then the weights will not be updated in that direction. If the gradient is consistently pointing in the same direction, then the weights will not be updated at all. In this sense, Adam and other modern algorithms are not immune to exploding or vanishing gradients, but they work in a way that makes exploding or vanishing gradients less likely.
For example, in a neural network, the output of the final layer is a vector of length $n$ where $n$ is the number of parameters in the final layer. The gradient of the loss function with respect to the parameters of the final layer is a vector of length $n$ in the same direction as the vector of parameter values. If the gradient is not pointing in the right direction, then the weights will not be updated at all. If the gradient is consistently pointing in the same direction, then the weights will not be updated at all. This means that the weights will either stay the same or move in the direction of the gradient. If the gradient is pointing in the same direction for all parameters, then the weights will move in the direction of the gradient. This is the case for exploding gradients, where the gradient is exploding in magnitude.
In general, Adam and other modern algorithms work by adjusting the weights in the direction that the gradient is pointing. This means that the weights will move in the direction of the gradient, but they will not move in the direction that the gradient is pointing in a way that causes exploding or vanishing gradients.