# Post your own Fine-tuned Language Model

1.   Select the Runtime tab on the top of the page and click change runtime type. Under Hardware Accelerator, select GPU.
2.   Go to https://huggingface.co/join and make an account. Put your username in the next block below inside the quotation marks.
3.   Put your project name in the next block below. This will be posted on your huggingface account once you run all the code.
4.   Find a text file (.txt) with your training data inside. For example: https://www.gutenberg.org/files/98/98-0.txt (Tale of Two Cities)
5.   Click the folder icon on the left side of this screen, then click the upload icon and upload your text file. Put the name of your text file in the next block below on the text_file_path line in the parenthesis
6.   (Optionally) Adjust num_train_epochs in the next block below. Too high a number can lead your AI to memorize data instead of learn general patterns (which is bad) and too low a number results in decreased learning. num_train_epochs determines how many times you train on your text file.
7.   Select the language model you want. gpt2 will be quicker and gpt2-medium will be more accurate.
8.   Select the Runtime tab on the top of the page and click "Run all". Login through the second block below so your model can be posted.
9.   You will need to keep this page open while your AI trains. Scroll down to see a progress bar and estimated finish time. Adjust num_train_epochs accordingly.

In [None]:
username = "PUT YOUR USERNAME HERE"
project_name = "YOUR PROJECT NAME HERE"
text_file_path = "NAME_OF_YOUR_TEXT_FILE_HERE.txt"
num_train_epochs = 30
language_model = "gpt2" # Your options are "gpt2" and "gpt2-medium"

In [None]:
!pip install transformers datasets
!huggingface-cli login

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.2-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 4.3 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 58.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 59.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 13.8 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 51.5 MB/s 
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.13-py37

Gets appropriate tokenizer

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(language_model)
tokenizer.mask_token = "<mask>"

Downloading config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Creates train and test datasets

In [None]:
from transformers import TextDataset

reader = open(text_file_path, "r")
train_file_path = f"{text_file_path[:-4]}_train.txt"
test_file_path = f"{text_file_path[:-4]}_test.txt"
train_writer = open(train_file_path, "w")
test_writer = open(test_file_path, "w")
for i, line in enumerate(reader):
  if i % 1000 > 900:
    test_writer.write(line)
  else:
    train_writer.write(line)
    
train_dataset = TextDataset(
      tokenizer=tokenizer,
      file_path=train_file_path,
      block_size=220)

test_dataset = TextDataset(
      tokenizer=tokenizer,
      file_path=test_file_path,
      block_size=220)

Token indices sequence length is longer than the specified maximum sequence length for this model (110667 > 1024). Running this sequence through the model will result in indexing errors


The data_collator batches the data during training and will mask if mlm=True

In [None]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=True,
    )

Selects the language model

In [None]:
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead

model = AutoModelWithLMHead.from_pretrained(language_model)



Downloading pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Tunes settings

In [None]:
training_args = TrainingArguments(
  output_dir = f"./{project_name}",
  num_train_epochs = num_train_epochs,
  push_to_hub = True,
  save_steps = 1000,
  save_total_limit = 1,
  #per_device_train_batch_size = 1, # Delete the leading #'s on
  #per_device_eval_batch_size = 2,  # these three lines if you see
  #gradient_accumulation_steps = 4, # "RuntimeError: CUDA out of memory."
  )


trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=train_dataset,
  eval_dataset=test_dataset,
  data_collator=data_collator,
)

Cloning https://huggingface.co/Dizzykong/Aristotle-8-29 into local empty directory.


If the next cell says "RuntimeError: CUDA out of memory.", uncomment lines 7, 8, and 9 on the above cell. Then select Runtime-->"Restart and run all" on this page's top bar.

In [None]:
print("Training your model")
trainer.train()

***** Running training *****
  Num examples = 503
  Num Epochs = 30
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1890


Step,Training Loss
500,3.7225
1000,3.2292
1500,2.9724


Saving model checkpoint to ./Aristotle-8-29/checkpoint-1000
Configuration saved in ./Aristotle-8-29/checkpoint-1000/config.json
Model weights saved in ./Aristotle-8-29/checkpoint-1000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1890, training_loss=3.2141891560226523, metrics={'train_runtime': 1412.559, 'train_samples_per_second': 10.683, 'train_steps_per_second': 1.338, 'total_flos': 6021680792371200.0, 'train_loss': 3.2141891560226523, 'epoch': 30.0})

In [None]:
trainer.save_model()

Saving model checkpoint to ./Aristotle-8-29
Configuration saved in ./Aristotle-8-29/config.json
Model weights saved in ./Aristotle-8-29/pytorch_model.bin
Saving model checkpoint to ./Aristotle-8-29
Configuration saved in ./Aristotle-8-29/config.json
Model weights saved in ./Aristotle-8-29/pytorch_model.bin
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.33k/1.35G [00:00<?, ?B/s]

Upload file runs/Aug29_16-31-24_4af7d6a74306/events.out.tfevents.1661790700.4af7d6a74306.77.0:  72%|#######1  …

To https://huggingface.co/Dizzykong/Aristotle-8-29
   9458dc0..ba57839  main -> main

   9458dc0..ba57839  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
To https://huggingface.co/Dizzykong/Aristotle-8-29
   ba57839..3fdeca7  main -> main

   ba57839..3fdeca7  main -> main



In [None]:
print("Saving your trained model to Hugging Face under your account")
tokenizer.push_to_hub(f"{username}/{project_name}")

Cloning https://huggingface.co/Dizzykong/Aristotle-8-29 into local empty directory.


Download file pytorch_model.bin:   0%|          | 1.98k/1.35G [00:00<?, ?B/s]

Download file runs/Aug29_16-31-24_4af7d6a74306/events.out.tfevents.1661790700.4af7d6a74306.77.0:  43%|####2   …

Download file runs/Aug29_16-31-24_4af7d6a74306/1661790700.0834985/events.out.tfevents.1661790700.4af7d6a74306.…

Clean file runs/Aug29_16-31-24_4af7d6a74306/events.out.tfevents.1661790700.4af7d6a74306.77.0:  21%|##1       |…

Clean file runs/Aug29_16-31-24_4af7d6a74306/1661790700.0834985/events.out.tfevents.1661790700.4af7d6a74306.77.…

Download file training_args.bin: 100%|##########| 3.23k/3.23k [00:00<?, ?B/s]

Clean file training_args.bin:  31%|###       | 1.00k/3.23k [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/1.35G [00:00<?, ?B/s]

tokenizer config file saved in Dizzykong/Aristotle-8-29/tokenizer_config.json
Special tokens file saved in Dizzykong/Aristotle-8-29/special_tokens_map.json
To https://huggingface.co/Dizzykong/Aristotle-8-29
   3fdeca7..d8f3727  main -> main

   3fdeca7..d8f3727  main -> main



'https://huggingface.co/Dizzykong/Aristotle-8-29/commit/d8f37272091f872163d99c003589406529f07ed4'

#Generate Text from your Posted Model

In [None]:
!pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Change max_length to generate longer passages.

In [None]:
from transformers import pipeline

generator = pipeline('text-generation', model=f'{username}/{project_name}', tokenizer=language_model)

def generate_text(prompt):
  string = generator(prompt, max_length=200)[0]['generated_text']
  print(string)

loading configuration file Dizzykong/Aristotle-8-29/config.json
Model config GPT2Config {
  "_name_or_path": "Dizzykong/Aristotle-8-29",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1024,
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "n_positions": 1024,
  "n_special": 0,
  "predict_special_tokens": true,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transforme

In [None]:
print(f"Your models are available at https://huggingface.co/{username}")

Your models are available at https://huggingface.co/Dizzykong


In [None]:
generate_text("It was the best of times, it was the worst of times, ")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


It was the best of times, it was the worst of times, 
[they say] because no great evil happens to them in prosperity
but to the unfortunate, either _a misfortune or a hinderance: to
these, we must add, the bad is a hindrance, a hindrance with
a view to some good.

So then we say that it is a good fortune which has these, but not
Happiness for which we think it deserves the appellation.[7] What has
come to be thus stated also, I confess, in some light; but by no
one is the point stated supposed to be absolute truth; we must, in
respect of this quality, add two or three more terms, that
contradicts it, at least not entirely: for, as the Greek
terms are not exactly alike (ἰδέα and γνώμη mean the same thing to
s


In [None]:
generate_text("It was the best of times, it was the worst of times, ")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


It was the best of times, it was the worst of times,  and we take every chance of escaping from these.

So in what sense are we said to be Brave, when we are bold and fearless,
for we do believe in nothing else but these; and if we are in a state
of complete Self-Control then we have it in the highest degree.[10]

Chapter V.

Next it may be well then to examine whether the term “Unjust man” (sometimes taken
indiscopably) denotes neither the man who does what is base in
heeling the pleasurable pleasures and at the same time doing the actions
that are base and hurtful to his mental condition, nor that
the term denotes the man of Perfected Self-Mastery, who avoids all
contrary pleasures and avoids doing any harm.

Such is our account. However, we must not depart from the truth simply
as


In [None]:
generate_text("In England, there was scarcely ")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In England, there was scarcely 〜a man”
in great pain who, to call his friends, would not reply in kind,
“would a stranger say?” or another who was annoyed and would not
resoluntarily reply, there was nothing improper in calling to his assistance
those people who were in need: but he that was not in the greatest want,
the man to call to his help.

The friends who were in need, therefore, being in want, being quiet and reserved,
restrained in their anger, and so forth, were the objects of choice,
because from this kind of comfort they were more likely to be of use to their
friends: but it is plain that, in all cases, the man who fails of all
friends is thought not to have suffered so great a loss, nor, in spite of it,
easily to be helped.

And so it is with those who are


In [None]:
generate_text("In England, there was scarcely ")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In England, there was scarcely “a sound shoemaker” and “a good shoemaker.”


The good shoemaker in Great-Britain is called a shoemaker, and the bad one
a crafter: here the one class is thought to be really connected with
mechanical manufacture, and the other with handicraft; the shoemaker has
means of living, but he makes but small quantities of work, and that no
man ever does: all other handicrafts he does either with very little or
little labour, and that is it with each. The shoemaker in Great-Britain has a
means of living, and of work, to all other handicrafts: but he
alone of any of them does anything good: he is not a shoemaker simply because
he has shoes, no; no, nor does he want or make any, just as the shoemaker neither in
his shoes nor in his


#Extra tips and tricks

1. Go to https://www.gutenberg.org/ for other books to train from.
2. Delete irrelevant parts of your text file to make training easier.
3. Use my other code that specializes in excel files (link forthcoming)
4. Go to https://huggingface.co/models to see other popular language models
5. Go to https://huggingface.co/course/chapter1/1 to learn how to use huggingface yourself
6. If things go wrong, select "Runtime-->Run all" on this page's top bar