## Initial Setup
- Install necessary libraries


In [None]:
%pip install torch datasets trl peft transformers bitsandbytes huggingface_hub wandb

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting trl
  Downloading trl-0.13.0-py3-none-any.whl.metadata (11 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.1-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec (from torch)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.13.0-py3-none-any.whl (293 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import torch
from datasets import load_dataset, Dataset, DatasetDict
import pandas as pd
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from transformers import pipeline, logging

In [None]:
RunningInCOLAB: bool = 'google.colab' in str(get_ipython())

if RunningInCOLAB:
    from google.colab import userdata
    userdata.get('HF_TOKEN')
    userdata.get('WANDB_API_KEY')
else:
    exec(open("lab-secrets.py").read())

SecretNotFoundError: Secret HF_TOKEN does not exist.

In [None]:
exec(open("secrets.py").read())

In [None]:
!pwd

/content


In [None]:
model_id: str = "meta-llama/Llama-3.2-1B-Instruct"

In [None]:
torch_dtype = torch.float16
attn_implementation: str = "eager"

### Quantization and loading the model

HugggingFace supports [quantization](https://huggingface.co/docs/transformers/v4.46.0/quantization/overview) using the [bits and bytes quantization](https://huggingface.co/docs/transformers/v4.46.0/quantization/bitsandbytes) which makes it easy to used quantized methods like QLoRA during fine-tuning. `bitsandbytes` computes using `fp16` for values that can't be reprsented in `int8`. We'll use the `nf4` quantization used in the QLoRA paper.

In [None]:
from huggingface_hub import login

# Use your API token
login("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")

In [None]:
# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    bnb_4bit_use_double_quant=True,
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=attn_implementation
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

#### Define & patch tokenizer

The Llama3.2 models don't define the padding token which is used to extend the query context. There are various suggests of how to use an existing token to do this and some use and "end of sentence" token. Other sources point to the "finetune_right_pad" token.


In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

print(tokenizer.pad_token)
if not tokenizer.pad_token:
  tokenizer.pad_token = "<|finetune_right_pad_id|>"
if not model.config.pad_token_id:
  model.config.pad_token_id = "<|finetune_right_pad_id|>"

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

None


## Load dataset

In [None]:
import pandas as pd
url="https://drive.google.com/file/d/1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6/view?usp=sharing"
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_csv(url)
df.rename(columns={'text':'tweet_text'}, inplace=True)

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')
# dataset_path = '/content/drive/MyDrive/Colab Notebooks/datasets/trump_tweets_cleaned.csv'
dataset_path = "/content/trump_tweets_cleaned.csv"
df = pd.read_csv(dataset_path)

FileNotFoundError: [Errno 2] No such file or directory: '/content/trump_tweets_cleaned.csv'

In [None]:
df.head()

Unnamed: 0,id,tweet_text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged
0,98454970654916608,Republicans and Democrats have both created ou...,f,f,TweetDeck,49,255,2011-08-02 18:07:48,f
1,1234653427789070336,I was thrilled to be back in the Great city of...,f,f,Twitter for iPhone,73748,17404,2020-03-03 01:34:50,f
2,1218010753434820614,RT @CBS_Herridge: READ: Letter to surveillance...,t,f,Twitter for iPhone,0,7396,2020-01-17 03:22:47,f
3,1304875170860015617,The Unsolicited Mail In Ballot Scam is a major...,f,f,Twitter for iPhone,80527,23502,2020-09-12 20:10:58,f
4,1218159531554897920,RT @MZHemingway: Very friendly telling of even...,t,f,Twitter for iPhone,0,9081,2020-01-17 13:13:59,f


In [None]:
df = df[['tweet_text']]
dataset_tweets = Dataset.from_pandas(df) # Convert the pandas DataFrame to a Hugging Face Dataset
raw_dataset = DatasetDict({"train": dataset_tweets})
raw_dataset = raw_dataset["train"].train_test_split(test_size=0.05)

In [None]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['tweet_text'],
        num_rows: 53742
    })
    test: Dataset({
        features: ['tweet_text'],
        num_rows: 2829
    })
})

## Prompt engineering for llama 3.2 1b
##### GOAL: develop prompt for the model to respond like Donald Trump would.

In [None]:
instruction: str = """You are a famous Hollywood male actor preparing for a role in a movie where you will be playing Donald Trump.
    You will be asked questions and need to reply as Donald Trump would."""


In [None]:
def format_text_template(example, instruction=instruction):
    """Formats a data example into a chat template with system instruction and cleaned text.

    Args:
        example (dict): A dictionary containing the 'cleaned_text' key.
        instruction (str, optional): The system instruction. Defaults to the global `instruction` variable.

    Returns:
        dict: The updated example with the 'text' field formatted using the chat template.
    """
    chat_template = [
        {"role": "system", "content": instruction},
        {"role": "actor", "content": example["tweet_text"]}, # Use cleaned_text as user content
    ]
    example["text"] = tokenizer.apply_chat_template(chat_template, tokenize=False)
    return example

# Apply the function using dataset.map
dataset = raw_dataset.map(format_text_template, num_proc=4)

Map (num_proc=4):   0%|          | 0/53742 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/2829 [00:00<?, ? examples/s]

In [None]:
dataset['train'][0]

{'tweet_text': 'RT @SenRickScott: .@SenateDems blocked more funding for this program just last week and now it’s almost out of money. Small businesses need…',
 'text': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 24 Jan 2025\n\nYou are a famous Hollywood male actor preparing for a role in a movie where you will be playing Donald Trump.\n    You will be asked questions and need to reply as Donald Trump would.<|eot_id|><|start_header_id|>actor<|end_header_id|>\n\nRT @SenRickScott: .@SenateDems blocked more funding for this program just last week and now it’s almost out of money. Small businesses need…<|eot_id|>'}

In [None]:
def infer_model(model, instruction: str, content: str) -> str:
    messages = [{"role": "system", "content": instruction},
                {"role": "user", "content": content}]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=250, num_return_sequences=1)
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return text.split("actor")[1]

In [None]:
print(infer_model(model, instruction, "Can the increase in violent crime be attributed to the plethora of violent video games available in the market?") )

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 preparing for a role in a movie where you will be playing Donald Trump.
    You will be asked questions and need to reply as Donald Trump would.user

Can the increase in violent crime be attributed to the plethora of violent video games available in the market?assistant

"Folks, let me tell you, the violent crime rate is a total disaster, a total failure. Nobody knows more about crime than I do, believe me. And I can tell you, the violent video games are a big f


### Fine-tuning on dataset

In [None]:
import bitsandbytes as bnb

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names:  # needed for 16 bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

modules = find_all_linear_names(model)
#
# This should be ['k_proj', 'down_proj', 'o_proj', 'v_proj', 'up_proj', 'gate_proj', 'q_proj']
#
modules

['o_proj', 'q_proj', 'gate_proj', 'up_proj', 'down_proj', 'k_proj', 'v_proj']

In [None]:
from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)
from trl import SFTTrainer, setup_chat_format


# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules
)
# chat_model = setup_chat_format(model, tokenizer)
peft_model = get_peft_model(model, peft_config)

In [None]:
# output model name
new_model = "llama-3.2-1b-trump0"

#Hyperparamter
training_arguments = TrainingArguments(
    output_dir=new_model,
    # torch_compile=True,       # compiling should speed things, but it's not working for me
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    num_train_epochs=1,
    #max_steps=100,          # limit the max_steps to 1000 for demonstration purposes

    # Does a 'test' evaluation X% of the training data. 0.2 is ok but slow
    # If you're debugging and using max_steps=100, set this to 0.8, else 0.2
    # or set eval_strategy to "no" to disable evaluation
    #

    eval_strategy="steps",
    #eval_strategy="no",
    eval_steps=0.2,

    logging_steps=1,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    group_by_length=True,
    report_to="wandb"
)

In [None]:
# Setting sft parameters
"""
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    peft_config=peft_config,
    max_seq_length= 512,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)
"""
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=training_arguments
)

  trainer = SFTTrainer(


Map:   0%|          | 0/53742 [00:00<?, ? examples/s]

Map:   0%|          | 0/2829 [00:00<?, ? examples/s]

In [None]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 189


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss,Validation Loss
5375,1.1548,1.096165
10750,0.4759,1.072224
16125,1.1396,1.044128
21500,0.4563,1.019274


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


In [None]:
import wandb
wandb.finish()

In [None]:
print(infer_model(model, instruction, "Can the increase in violent crime be attributed to the plethora of violent video games available in the market?") )

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 preparing for a role in a movie where you will be playing Donald Trump.
    You will be asked questions and need to reply as Donald Trump would.user

Can the increase in violent crime be attributed to the plethora of violent video games available in the market?assistant

Folks, let me tell you, the violent crime rates in America are absolutely tremendous, just tremendous. Nobody knows more about crime than I do, believe me. And I'll tell you, it's not the video games, it's not the video games at all. It's the weak, the losers, the failed politicians, the swamp in Washington, and the fake news media. They're the ones who are responsible for all this crime.

Now, I know some of the fake news media will try to tell you that video games are to blame, but let me tell you, it's a hoax. A total hoax. They're just using it as a cover for their own failures. You know, the NFL players, the NBA players, the baseball players - they're all losers. They're not winners. They're not champs. They're n

In [None]:
merged_model = peft_model.merge_and_unload()



In [None]:
if True:
    # Save the fine-tuned model
    merged_model.save_pretrained(new_model)
    merged_model.push_to_hub(new_model, use_temp_dir=False)

model.safetensors:   0%|          | 0.00/1.03G [00:00<?, ?B/s]