Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Fine-tune the Model? #37

Open
berkecanrizai opened this issue Apr 20, 2023 · 17 comments
Open

How to Fine-tune the Model? #37

berkecanrizai opened this issue Apr 20, 2023 · 17 comments
Labels
question Further information is requested

Comments

@berkecanrizai
Copy link

Hi, I want to fine-tune the 7b model, am I supposed to download the provided checkpoint and fine-tune it as shown in this repo: https://github.com/EleutherAI/gpt-neox#using-custom-data . Would they be compatible and did anyone here give it a shot? Thanks.

@berkecanrizai berkecanrizai changed the title How To Fine-tune the Model? How to Fine-tune the Model? Apr 20, 2023
@Ph0rk0z
Copy link

Ph0rk0z commented Apr 21, 2023

It looks like it needs much more training before any fine-tuning you do will be worth it.

@devonkinghorn
Copy link

What about using the databricks-dolly-15k dataset? https://github.com/databrickslabs/dolly/tree/master/data

@Ph0rk0z
Copy link

Ph0rk0z commented Apr 21, 2023

from: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=0

Model RAM lambada(ppl) lambada(acc) hellaswag(acc_norm) winogrande(acc) piqa(acc) coqa(f1) average
stablelm-base-alpha-7b 32 17.6493 41.06% 41.22% 50.12% 66.76%   49.79%

Look at the perplexity scores of this model right now. It is worse than a 500m. Wait till they finish it. You can thumbs down me 100x but it won't fix it.

@samuelazran
Copy link

@Ph0rk0z any idea whats the plan for release date of further checkpoints? I think training it on more than 1 trillion tokens can give it advantage compare to other pre-trained models.

@Ph0rk0z
Copy link

Ph0rk0z commented Apr 22, 2023

I wish.. I don't work for them. Hopefully they finish training and get rid of the disclaimers. Then this will be a great model for long contexts.. the first I found besides RWKV.

@snirbenyosef
Copy link

I'm also interested in fine-tuning the model on my book, is it possible?
did someone have an idea how to start?
I tried with transformers but the result was not that good. I just gave it some text. but i will need to process that text i guess.

@juanps90
Copy link

I'm interested in finetuning as well. Does anyone have any recommendation for this?

@Ph0rk0z
Copy link

Ph0rk0z commented Apr 22, 2023

https://github.com/oobabooga/text-generation-webui/blob/main/docs/Using-LoRAs.md

https://github.com/johnsmith0031/alpaca_lora_4bit

If you want to try to make a lora.

@mcmonkey4eva
Copy link
Member

You can fine-tune now with the links Ph0rk0z posted above, but ... yeah wait for the next release, the Alpha's are just that - initial Alpha's not meant for real usage, just meant to be open public development.

@aamir-gmail
Copy link

Hi, I want to fine-tune the 7b model, am I supposed to download the provided checkpoint and fine-tune it as shown in this repo: https://github.com/EleutherAI/gpt-neox#using-custom-data . Would they be compatible and did anyone here give it a shot? Thanks.

I have a training script for 7B and 3B , where can I send it,

@snirbenyosef
Copy link

Hi, I want to fine-tune the 7b model, am I supposed to download the provided checkpoint and fine-tune it as shown in this repo: https://github.com/EleutherAI/gpt-neox#using-custom-data . Would they be compatible and did anyone here give it a shot? Thanks.

I have a training script for 7B and 3B , where can I send it,

@aamir-gmail can you send it to me snirben@gmail.com please?

@vonbarnekowa
Copy link

@aamir-gmail, would be cool if you can share it here.

@Shaila96
Copy link

Shaila96 commented May 2, 2023

@aamir-gmail could you please send it to me also? shaila.zaman96@gmail.com

@aamir-gmail
Copy link

Here you go the full training script `# Developed by Aamir Mirza

create a conda virtual environment python 3.9

install PyTorch 1.13.1 ( not 2.0)

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

install the latest transformers

conda install -c conda-forge transformers

install deepspeed from GitHub not pip install

build deepspeed with CPU Adam optimiser support like this

git clone https://github.com/microsoft/DeepSpeed

DS_BUILD_CPU_ADAM=1 pip install .

accelerate via pip

pip install Ninja

conda install -c conda-forge mpi4py

train via commandline for example

deepspeed train_gptNX_v2.py --num_gpus=2

In my case I have 2x 3090 24GB

from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast, TextDataset,
DefaultDataCollator, DataCollatorForLanguageModeling, DataCollatorWithPadding
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
import os

os.environ['OMPI_MCA_opal_cuda_support'] = 'true'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

If you got a single GPU then change this to one

os.environ["WORLD_SIZE"] = "2"

Change this to your requirement for example 4096 (MAX)

MAX_LEN = 1024

stage2_config = """{
"bf16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},

"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
    }
},

"scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"
    }
},

"zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"

} """

class CustomTrainer(Trainer):
def compute_loss(self, model_a, inputs_a, return_outputs=False):
strd = ' '
outputs = model_a(**inputs_a, labels=inputs_a["input_ids"])
loss = outputs.loss
return (loss, outputs) if return_outputs else loss

tokenizer = GPTNeoXTokenizerFast.from_pretrained("stabilityai/stablelm-base-alpha-3b")

def process_data(examples):
texts = examples["text"]
# Remove empty lines
texts = [text for text in texts if len(text) > 0 and not text.isspace()]
# Remove lines that are too long
texts = [text for text in texts if len(text) < 512]
# Remove lines that are too short
texts = [text for text in texts if len(text) > 16]
# add newline character
texts = [text + ' ' + '\n' for text in texts]
examples["text"] = texts
return examples

process dataset columns [text] use tokenizer to get input_ids and attention mask

def process_data_add_mask(examples):
text = examples['text']
tokenizer.pad_token = tokenizer.eos_token
# Tokenize text
encoded_dict = tokenizer(
text,
padding=True,
truncation=True,
max_length=MAX_LEN
)
# Add input_ids and attention_mask to example
examples['input_ids'] = encoded_dict['input_ids']
examples['attention_mask'] = encoded_dict['attention_mask']
return examples

imdb_dataset = load_dataset('imdb')
imdb_dataset_train = imdb_dataset['train']
imdb_dataset_train = imdb_dataset_train.shuffle()
imdb_dataset_train = imdb_dataset_train.map(process_data, batched=True, remove_columns=['label'])
imdb_dataset_val = imdb_dataset['test']
imdb_dataset_val = imdb_dataset_val.shuffle()
imdb_dataset_val = imdb_dataset_val.map(process_data, batched=True, remove_columns=['label'])
train_dataset = imdb_dataset_train.map(process_data_add_mask, remove_columns=["text"], batched=True)
val_dataset = imdb_dataset_val.map(process_data_add_mask, remove_columns=["text"], batched=True)
strs = " "

model = GPTNeoXForCausalLM.from_pretrained("stabilityai/stablelm-base-alpha-3b")

absolute path required for deepspeed config

you can use the JSON above to create your own config

z_optimiser = '/two-tb/train_GPTNX/zeromq_config/stablelm-base-alpha-3b_config.json'
data_collator = DataCollatorWithPadding(tokenizer=tokenizer,
return_tensors="pt")
training_args_v2 = TrainingArguments(
output_dir="./trained_model",
learning_rate=2e-5,
save_total_limit=2,
fp16=True,
per_device_train_batch_size=1,
per_device_eval_batch_size=12,
evaluation_strategy="epoch",
deepspeed=z_optimiser,
num_train_epochs=1
)

Set up the trainer

trainer = CustomTrainer(
model=model,
args=training_args_v2,
train_dataset=train_dataset,
eval_dataset=val_dataset,
data_collator=data_collator,
tokenizer=tokenizer,
)
trainer.train()

trainer.save_model()

`

@aamir-gmail
Copy link

aamir-gmail commented May 3, 2023 via email

@Appleyc
Copy link

Appleyc commented May 6, 2023 via email

@aamir-gmail
Copy link

aamir-gmail commented May 22, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests